IIIInnnntttteeeerrrrnnnnaaaattttiiiioooonnnnaaaallll ......Book of Abstracts of the International Workshop on Advances in Regularization, Optimization, Kernel Methods and Support Vector

IIIInnnntttteeeerrrrnnnnaaaattttiiiioooonnnnaaaallll WWWWoooorrrrkkkksssshhhhoooopppp oooonnnn AAAAddddvvvvaaaannnncccceeeessss iiiinnnn

RRRReeeegggguuuullllaaaarrrriiiizzzzaaaattttiiiioooonnnn,,,, OOOOppppttttiiiimmmmiiiizzzzaaaattttiiiioooonnnn,,,, KKKKeeeerrrrnnnneeeellll MMMMeeeetttthhhhooooddddssss aaaannnndddd

SSSSuuuuppppppppoooorrrrtttt VVVVeeeeccccttttoooorrrr MMMMaaaacccchhhhiiiinnnneeeessss:::: tttthhhheeeeoooorrrryyyy aaaannnndddd aaaapppppppplllliiiiccccaaaattttiiiioooonnnnssss

July 8-10, 2013

LEUVEN, BELGIUM

BBBBooooooookkkk ooooffff AAAAbbbbssssttttrrrraaaaccccttttssss

Editors: Johan Suykens, Andreas Argyriou, Kris De Brabanter, Moritz Diehl, Kristiaan Pelckmans, Marco Signoretto, Vanya Van Belle, Joos Vandewalle

Johan Suykens, Andreas Argyriou, Kris De Brabanter, Moritz Diehl, Kristiaan Pelckmans, Marco Signoretto, Vanya Van Belle, Joos Vandewalle (Eds.) Book of Abstracts of the International Workshop on Advances in Regularization, Optimization, Kernel Methods and Support Vector Machines: theory and applications (ROKS 2013), Leuven, Belgium, July 8-10, 2013

© Katholieke Universiteit Leuven Department of Electrical Engineering ESAT-SCD/SISTA Kasteelpark Arenberg 10 B-3001, Heverlee-Leuven, Belgium

LLLLaaaayyyyoooouuuutttt bbbbyyyy:::: Jacqueline De bruyn Marco Signoretto Johan Suykens Vanya Van Belle Liesbeth Van Meerbeek

ISBN 978-94-6018-700-1 Wettelijk depot D/2013/7515/86

All rights reserved. No part of this book may be reproduced without prior written permission from the publisher.

Table of Contents

1. Welcome ........................................................................................................... 5 2. Scientific Program ........................................................................................... 11 3. Invited talk abstracts ...................................................................................... 23 4. Oral and poster session abstracts .................................................................... 37 5. List of participants ........................................................................................ 113 6. Practical info ................................................................................................. 123

ROKS 2013

3

ROKS 2013

4

Welcome

ROKS 2013

5

ROKS 2013

6

Welcome!

Dear Participants,

Very welcome to the ROKS 2013 International Workshop on advances in Regulariza-tion, Optimization, Kernel Methods and Support Vector Machines: theory and applicationsin Leuven Belgium. In recent years considerable progress has been made in the areas ofkernel methods & support vector machines and compressed sensing & sparsity, where con-vex optimization plays an important common role. The aim of ROKS 2013 is to provide amulti-disciplinary forum where researchers of different communities can meet, to find newsynergies along these areas, both at the level of theory and applications. As written in theacronym, the main themes include:

RegularizationOptimizationKernel methodsSupport vector machines.

We are happy that many distinguished speakers have accepted our invitation to delivera plenary talk at the occasion of this workshop: Francis Bach, Stephen Boyd, Martin Jaggi,James Kwok, Yurii Nesterov, Massimiliano Pontil, Justin Romberg, Bernhard Scholkopf,John Shawe-Taylor, Joel Tropp, Ding-Xuan Zhou. In addition the workshop features fiveoral sessions organized around the themes: (i) Feature selection and sparsity; (ii) Opti-mization algorithms; (iii) Kernel methods and support vector machines; (iv) Structuredlow-rank approximation; (v) Robustness; and two poster sessions together with posterspotlight presentations. The reception in the Salons of the Arenberg castle and the work-shop dinner in the Faculty club offer plenty of additional opportunities for interaction andscientific discussions.

The organization of this workshop would have been impossible without the help ofmany people. We are very grateful to Jacqueline De bruyn, Liesbeth Van Meerbeek, ElsyVermoesen, Ida Tassens, Mimi Deprez and several other team members for the workshopsecretariat support, local arrangements and assistance. We thank all Scientific commit-tee members and Reviewers for their valuable comments. It enabled the authors to pre-pare well-polished extended abstracts. The ROKS 2013 has been organized within theframework of the European Research Council (ERC) Advanced Grant project 290923 A-DATADRIVE-B. We also gratefully acknowledge KU Leuven, FWO, IUAP DYSCO, CoEOPTEC EF/05/006, GOA MANET, iMinds Future Health Department, IWT.

On behalf also of all other organizing committee members Andreas Argyriou, Kris DeBrabanter, Moritz Diehl, Kristiaan Pelckmans, Marco Signoretto, Vanya Van Belle, JoosVandewalle, we wish all participants a stimulating workshop with many exciting discus-sions. Let it roks!

Johan SuykensChair ROKS 2013

ROKS 2013

7

Organizing committee

• Johan Suykens, KU Leuven, Chair

• Andreas Argyriou, Ecole Centrale Paris

• Kris De Brabanter, KU Leuven

• Moritz Diehl, KU Leuven

• Kristiaan Pelckmans, Uppsala University

• Marco Signoretto, KU Leuven

• Vanya Van Belle, KU Leuven

• Joos Vandewalle, KU Leuven

Local arrangements

• Jacqueline De bruyn

• Mimi Deprez

• Ida Tassens

• Elsy Vermoesen

• Liesbeth Van Meerbeek

ROKS 2013

8

Scientific committee

• Carlos Alzate, IBM Research

• Gavin Cawley, University of East Anglia

• Alessandro Chiuso, University of Padova

• Andreas Christmann, University of Bayreuth

• Arnak Dalalyan, ENSAE-CREST

• Francesco Dinuzzo, Max Planck Institute, Tuebingen

• Yonina Eldar, Technion Israel Institute of Technology

• Irene Gijbels, KU Leuven

• Francois Glineur, Catholic University of Louvain

• Kurt Jetter, Universitat Hohenheim

• S. Sathiya Keerthi, Microsoft

• Irwin King, Chinese University of Hong Kong

• Vera Kurkova, Academy of Sciences Czech Republic

• Gert Lanckriet, University of California, San Diego

• Lek-Heng Lim, University of Chicago

• Chih-Jen Lin, National Taiwan University

• Ivan Markovsky, Vrije Universiteit Brussel

• Sayan Mukherjee, Duke University

• Henrik Ohlsson, Linkoping University

• Francesco Orabona, Toyota Technological Institute Chicago

• Pradeep Ravikumar, University of Texas, Austin

• Lorenzo Rosasco, MIT

• Marcello Sanguineti, University of Genova

ROKS 2013

9

• Shai Shalev-Shwartz, The Hebrew University of Jerusalem

• Suvrit Sra, Max Planck Institute, Tuebingen

• Ambuj Tewari, University of Michigan

• Sergios Theodoridis, University of Athens

• Ryota Tomioka, The University of Tokyo

• Lieven Vandenberghe, University of California, Los Angeles

• Allessandro Verri, Universita degli Studi di Genova

• Bo Wahlberg, KTH Royal Institute of Technology

• Yuesheng Xu, Syracuse University

• Yiming Ying, University of Exeter

• Ming Yuan, Georgia Institute of Technology

• Luca Zanni, University of Modena and Reggio Emilia

ROKS 2013

10

Scientific Program

ROKS 2013

11

ROKS 2013

12

Program

Monday July 8

12:00-13:00 Registration and welcome coffee in Arenberg castle

13:00-13:10 Welcome by Johan Suykens

13:10-14:00 Deep-er Kernels

John Shawe-Taylor (University College London)

[Chair: Kristiaan Pelckmans]

14:00-14:50 Connections between the Lasso and Support Vector Machines

Martin Jaggi (Ecole Polytechnique Paris)

[Chair: Andreas Argyriou]

14:50-15:10 Coffee break

15:10-16:40 Oral session 1 : Feature selection and sparsity

[Chair: Johan Suykens]

16:40-17:30 Kernel Mean Embeddings applied to Fourier Optics

Bernhard Scholkopf (Max Planck Institute Tuebingen)


17:30-19:00 Reception in Salons Arenberg Castle

ROKS 2013

13

Program

Tuesday July 9

09:00-09:50 Large-scale Convex Optimization for Machine Learning

Francis Bach (INRIA)

[Chair: Joos Vandewalle]

09:50-10:40 Domain-Specific Languages for Large-Scale Convex Optimization

Stephen Boyd (Stanford University)



11:00-12:00 Oral session 2: Optimization algorithms


12:00-12:20 Spotlight presentations Poster session 1 (2 min/poster)

[Chair: Vanya Van Belle]

12:20-14:30 Group picture and Lunch in De Moete

Poster session 1 in Rooms S

14:30-15:20 Dynamic L1 Reconstruction

Justin Romberg (Georgia Tech)


15:20-16:10 Multi-task Learning

Massimiliano Pontil (University College London)



16:30-18:30 Oral session 3: Kernel methods and support vector machines

[Chair: Marco Signoretto]

19:00 Dinner in Faculty Club

ROKS 2013

14

Program

Wednesday July 10

09:00-09:50 Subgradient methods for huge-scale optimization problems

Yurii Nesterov (Catholic University of Louvain)


09:50-10:40 Living on the edge: A geometric theory of phase transitions

in convex optimization

Joel Tropp (California Institute of Technology)



11:00-12:30 Oral session 4: Structured low-rank approximation


12:30-12:50 Spotlight presentations Poster session 2 (2 min/poster)

[Chair: Vanya Van Belle]

12:50-14:30 Lunch in De Moete

Poster session 2 in Rooms S

14:30-15:20 Minimum error entropy principle for learning

Ding-Xuan Zhou (City University of Hong Kong)


15:20-16:10 Learning from Weakly Labeled Data

James Kwok (Hong Kong University of Science and Technology)



16:30-18:00 Oral session 5: Robustness

[Chair: Kris De Brabanter]

18:00 Closing

ROKS 2013

15

Program - Oral sessions

Oral session 1: Feature selection and sparsity(July 8, 15:10-16:40)

15:10-15:40 The graph-guided group lasso for genome-wide association studies

Zi Wang and Giovanni Montana

Mathematics Department, Imperial College London

15:40-16:10 Feature Selection via Detecting Ineffective Features

Kris De Brabanter1 and Laszlo Gyorfi2

1 KU Leuven ESAT-SCD2 Dep. Comp. Sc. & Inf. Theory, Budapest Univ. of Techn. and Econ.

16:10-16:40 Sparse network-based models for patient classification using fMRI

Maria J. Rosa, Liana Portugal, John Shawe-Taylor and Janaina Mourao-

Miranda: Computer Science Department, University College London

Oral session 2: Optimization algorithms(July 9, 11:00-12:00)

11:00-11:30 Incremental Forward Stagewise Regression: Computational Complexity and

Connections to LASSO

Robert M. Freund1, Paul Grigas2 and Rahul Mazumder2

1 MIT Sloan School of Management, 2 MIT Operations Research Center

11:30-12:00 Fixed-Size Pegasos for Large Scale Pinball Loss SVM

Vilen Jumutc, Xiaolin Huang and Johan A.K. Suykens

KU Leuven ESAT-SCD

ROKS 2013

16


Oral session 3: Kernel methods and support vectormachines(July 9, 16:30-18:30)

16:30-17:00 Output Kernel Learning Methods

Francesco Dinuzzo1, Cheng Soon Ong2 and Kenji Fukumizu3

1 MPI for Intelligent Systems Tuebingen, 2 NICTA, Melbourne,3 Institute of Statistical Mathematics, Tokyo

17:00-17:30 Deep Support Vector Machines for Regression Problems

M.A. Wiering, M. Schutten, A. Millea, A. Meijster and L.R.B. Schomaker

Institute of Artif. Intell. and Cognitive Eng., Univ. of Groningen

17:30-18:00 Subspace Learning and Empirical Operator Estimation

Alessandro Rudi1, Guillermo D. Canas2 and Lorenzo Rosasco2,3

1 Istituto Italiano di Tecnologia, 2 MIT-IIT,3 Universita di Genova

18:00-18:30 Kernel based identification of systems with multiple outputs

using nuclear norm regularization

Tillmann Falck, Bart De Moor and Johan A.K. Suykens

KU Leuven, ESAT-SCD and iMinds Future Health

ROKS 2013

17


Oral session 4: Structured low-rank approximation(July 10, 11:00-12:30)

11:00-11:30 First-order methods for low-rank matrix factorization applied to

informed source separation

Augustin Lefevre1 and Francois Glineur1,2

1 ICTEAM Institute and 2 CORE Institute, Universite catholique de Louvain

11:30-12:00 Structured low-rank approximation as optimization on a Grassmann manifold

Konstantin Usevich and Ivan Markovsky

Dep. ELEC, Vrije Universiteit Brussel

12:00-12:30 Scalable Structured Low Rank Matrix Optimization Problems

Marco Signoretto1, Volkan Cevher2 and Johan A.K. Suykens1

1 KU Leuven, ESAT-SCD, 2 LIONS, EPFL Lausanne

Oral session 5: Robustness(July 10, 16:30-18:00)

16:30-17:00 Learning with Marginalized Corrupted Features

Laurens van der Maaten1, Minmin Chen2, Stephen Tyree2 and Kilian

Weinberger2: 1 Delft University of Technology, 2 Washington Univ. St. Louis

17:00-17:30 Robust regularized M-estimators of regression parameters and covariance matrix

Esa Ollila, Hyon-Jung Kim and Visa Koivunen

Dep. of Signal Processing and Acoustics, Aalto University

17:30-18:00 Robust Near-Separable Nonnegative Matrix Factorization Using Linear

Optimization, Nicolas Gillis1 and Robert Luce2

1 ICTEAM Institute, Univ. catholique de Louvain, 2 Technische Univ. Berlin

ROKS 2013

18

Program - Poster sessions

Poster session 1(July 9, 13:15-14:30)

• Data-Driven and Problem-Oriented Multiple-Kernel Learning

Valeriya Naumova and Sergei V. Pereverzyev

Johann Radon Institute for Computational and Applied Mathematics (RICAM)

Austrian Academy of Sciences, Linz

• Support Vector Machine with spatial regularization for pixel classification

Remi Flamary1 and Alain Rakotomamonjy2

1 Lagrange Lab., CNRS, Universite de Nice Sophia-Antipolis,2 LITIS Lab., Universite de Rouen

• Regularized structured low-rank approximation

Mariya Ishteva and Konstantin Usevich and Ivan Markovsky

Dep. ELEC, Vrije Universiteit Brussel

• A Heuristic Approach to Model Selection for Online Support Vector Machines

Davide Anguita, Alessandro Ghio, Isah A. Lawal and Luca Oneto

DITEN, University of Genoa

• Lasso and Adaptive Lasso with Convex Loss Functions

Wojciech Rejchel

Nicolaus Copernicus University, Torun

• Conditional Gaussian Graphical Models for Multi-output Regression of

Neuroimaging Data

Andre F. Marquand1, Maria Joao Rosa2 and Orla Doyle1

1 King’s College London, 2 University College London

ROKS 2013

19

• High-dimensional convex optimization via optimal affine subgradient algorithms

Masoud Ahookhosh and Arnold Neumaier

Faculty of Mathematics, University of Vienna

• Joint Estimation of Modular Gaussian Graphical Models

Jose Sanchez and Rebecka Jornsten

Mathematical Sciences, Chalmers Univ, of Technology and University of Gothenburg

• Learning Rates of l1-regularized Kernel Regression

Lei Shi, Xiaolin Huang and Johan A.K. Suykens

KU Leuven, ESAT-SCD

• Reduced Fixed-Size LSSVM for Large Scale Data

Raghvendra Mall and Johan A.K. Suykens

KU Leuven, ESAT-SCD

ROKS 2013

20

Program - Poster sessions

Poster session 2(July 10, 13:15-14:30)

• Pattern Recognition for Neuroimaging Toolbox

Jessica Schrouff1, Maria J. Rosa2, Jane Rondina2, Andre F. Marquand3, Carlton

Chu4, John Ashburner5, Jonas Richiardi6, Christophe Phillips1 and Janaina Mourao-

Miranda2

1 Cyclotron Research Centre, University of Liege, 2 Computer Science Dep., Uni-

versity College London, 3 Institute of Psychology, King’s College, London, 4 NIMH,

NIH, Bethesda, 5 Wellcome Trust Centre for Neuroimaging, University College Lon-

don, 6 Stanford University

• Stable LASSO for High-Dimensional Feature Selection through Proximal Optimiza-

tion

Roman Zakharov and Pierre Dupont

ICTEAM Institute, Universite catholique de Louvain

• Regularization in topology optimization

Atsushi Kawamoto, Tadayoshi Matsumori, Daisuke Murai and Tsuguo Kondoh

Toyota Central R&D Labs., Inc., Nagakute

• Classification of MCI and AD patients combining PET data and psychological scores

Fermin Segovia, Christine Bastin, Eric Salmon and Christophe Phillips

Cyclotron Research Centre, University of Liege

• Kernels design for Internet traffic classification

Emmanuel Herbert1, Stephane Senecal1 and Stephane Canu2

1 Orange Labs, Issy-les-Moulineux, 2 LITIS/INSA, Rouen

ROKS 2013

21

• Kernel Adaptive Filtering: Which Technique to Choose in Practice

Steven Van Vaerenbergh and Ignacio Santamaria

Dep. of Communications Engineering, University of Cantabria

• Structured Machine Learning for Mapping Natural Language to Spatial Ontologies

Parisa Kordjamshidi and Marie-Francine Moens

Dep. of Computer Science, Katholieke Universiteit Leuven

• Windowing strategies for on-line multiple kernel regression

Manuel Herrera and Rajan Filomeno Coelho

BATir Dep., Universite libre de Bruxelles

• Non-parallel semi-supervised classification

Siamak Mehrkanoon and Johan A.K. Suykens

KU Leuven, ESAT-SCD

• Visualisation of neural networks for model reduction

Tamas Kenesei and Janos Abonyi

Dep. of Process Engineering, University of Pannonia

• Convergence analysis of stochastic gradient descent on strongly convex objective

functions

Cheng Tang and Claire Monteleoni

Dep. of Computer Science, The George Washington University

ROKS 2013

22

Invited talk abstracts

ROKS 2013

23

ROKS 2013

24

Deep-er Kernels

John Shawe-Taylor

Centre for Computational Statistics and Machine LearningUniversity College London

Abstract: Kernels can be viewed as shallow in that learning is only applied in a single(output) layer. Recent successes with deeper networks highlight the need to consider richerfunction classes. The talk will review and discuss methods that have been developed to enablericher kernel classes to be learned. While some of these methods rely on greedy proceduresmany are supported by statistical learning analyses and/or convergence bounds. The talk willhighlight the potential for further research on this topic.

ROKS 2013

25

Connections between the Lasso and Support Vector

Machines

Martin Jaggi

CMAP, Ecole Polytechnique, Paris

Abstract: We discuss the relation of two fundamental tools in machine learning and signalprocessing, that is the support vector machine (SVM) for classification, and the Lasso tech-nique used in regression. By outlining a simple equivalence between the Lasso primal and theSVM dual problem, we argue that many existing optimization algorithms can also be appliedto the respective other task, and that many known theoretical insights can be translated be-tween the two settings. One such consequence is that the sparsity of a Lasso solution is equalto the number of support vectors for the corresponding SVM instance, and that one can usescreening rules to prune the set of support vectors. Another example is a kernelized versionof the Lasso, analogous to the kernel trick in the SVM setting. On the algorithms side, wewill discuss popular greedy first-order methods used in both settings.

ROKS 2013

26

Kernel Mean Embeddings applied to Fourier Optics

Bernhard Scholkopf

Empirical Inference DepartmentMax Planck Institute for Intelligent Systems

ROKS 2013

27

Large-scale Convex Optimization for Machine Learning

Francis Bach

INRIA - SIERRA project-teamLaboratoire d’Informatique de l’Ecole Normale Superieure, Paris

Abstract: Many machine learning and signal processing problems are traditionally cast asconvex optimization problems. A common difficulty in solving these problems is the size ofthe data, where there are many observations (”large n”) and each of these is large (”largep”). In this setting, online algorithms which pass over the data only once, are usually pre-ferred over batch algorithms, which require multiple passes over the data. In this talk, I willpresent several recent results, showing that in the ideal infinite-data setting, online learningalgorithms based on stochastic approximation should be preferred, but that in the practicalfinite-data setting, an appropriate combination of batch and online algorithms leads to unex-pected behaviors, such as a linear convergence rate with an iteration cost similar to stochasticgradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt)

ROKS 2013

28

Domain-Specific Languages for Large-Scale Convex

Optimization

Stephen Boyd

Information Systems Laboratory, Stanford University

Abstract: joint work with Eric Chu - Specialized languages for describing convex optimiza-tion problems, and associated parsers that automatically transform them to canonical form,have greatly increased the use of convex optimization in applications, especially those wherethe problem instances are not very large scale. CVX and YALMIP, for example, allow users torapidly prototype applications based on solving (modest size) convex optimization problems.More recently, similar techniques were used in CVXGEN to automatically generate super-efficient small footprint code for solving families of small convex optimization optimizationproblems, as might be used in real-time control. In this talk I will describe the general meth-ods used in such systems, and describe methods by which they can be adapted for large-scaleproblems.

ROKS 2013

29

Dynamic L1 Reconstruction

Justin Romberg

School of Electrical and Computer EngineeringGeorgia Tech

Abstract: Sparse signal recovery often involves solving an L1-regularized optimization prob-lem. Most of the existing algorithms focus on the static settings, where the goal is to recovera fixed signal from a fixed system of equations. This talk will have two parts. In the first,we present a collection of homotopy-based algorithms that dynamically update the solutionof the underlying L1 problem as the system changes. The sparse Kalman filter solves an L1-regularized Kalman filter equation for a time-varying signal that follows a linear dynamicalsystem. Our proposed algorithm sequentially updates the solution as the new measurementsare added and the old measurements are removed from the system.In the second part of the talk, we will discuss a continuous time ”algorithm” (i.e. a set ofcoupled nonlinear differential equations) for solving a class of sparsity regularized least-squareproblems. We characterize the convergence properties of this neural-net type system, with aspecial emphasis on the case when the final solution is indeed sparse.This is joint work with M. Salman Asif, Aurele Balavoine, and Chris Rozell

ROKS 2013

30

Multi-task Learning

Massimiliano Pontil

Centre for Computational Statistics and Machine LearningUniversity College London

Abstract: A fundamental limitation of standard machine learning methods is the cost in-curred by the preparation of the large training samples required for good generalization. Apotential remedy is offered by multi-task learning: in many cases, while individual samplesizes are rather small, there are samples to represent a large number of learning tasks (linearregression problems), which share some constraining or generative property. If this propertyis sufficiently simple it should allow for better learning of the individual tasks despite theirsmall individual sample sizes. In this talk I will review a wide class of multi-task learningmethods which encourage low-dimensional representations of the regression vectors. I willdescribe techniques to solve the underlying optimization problems and present an analysisof the generalization performance of these learning methods which provides a proof of thesuperiority of multi-task learning under specific conditions.

ROKS 2013

31

Subgradient methods for huge-scale optimization problems

Yurii Nesterov

CORE, Catholic University of Louvain, Louvain-la-Neuve

Abstract: We consider a new class of huge-scale problems, the problems with sparse sub-gradients. The most important functions of this type are piece-wise linear. For optimizationproblems with uniform sparsity of corresponding linear operators, we suggest a very efficientimplementation of subgradient iterations, which total cost depends logarithmically on thedimension. This technique is based on a recursive update of the results of matrix/vectorproducts and the values of symmetric functions. It works well, for example, for matrices withfew nonzero diagonals and for max-type functions. We show that the updating technique canbe efficiently coupled with the simplest subgradient methods, the unconstrained minimizationmethod by Polyak, and the constrained minimization scheme by Shor. Similar results can beobtained for a new non- smooth random variant of a coordinate descent scheme. We discussan extension of this technique onto conic optimization problems.

ROKS 2013

32

Living on the edge: A geometric theory of phase

transitions in convex optimization

Joel Tropp

Department of Computing and Mathematical SciencesCalifornia Institute of Technology

Abstract: Recent empirical research indicates that many convex optimization problems withrandom constraints exhibit a phase transition as the number of constraints increases. Forexample, this phenomenon emerges in the l1 minimization method for identifying a sparsevector from random linear samples. Indeed, this approach succeeds with high probabilitywhen the number of samples exceeds a threshold that depends on the sparsity level; otherwise,it fails with high probability.This talk summarizes a rigorous analysis that explains why phase transitions are ubiquitous inrandom convex optimization problems. It also describes tools for making reliable predictionsabout the quantitative aspects of the transition, including the location and the width of thetransition region. These techniques apply to regularized linear inverse problems with randommeasurements, to demixing problems under a random incoherence model, and also to coneprograms with random affine constraints.Joint work with D. Amelunxen, M. Lotz, and M. B. McCoy.

ROKS 2013

33

Minimum error entropy principle for learning

Ding-Xuan Zhou

Department of MathematicsCity University of Hong Kong

Abstract: Information theoretical learning is inspired by introducing information theoryideas into a machine learning paradigm. Minimum error entropy is a principle of informationtheoretical learning and provides a family of supervised learning algorithms. It is a substitu-tion of the classical least squares method when the noise is non-Gaussian. Its idea is to extractfrom data as much information as possible about the data generating systems by minimizingerror entropies. In this talk we will discuss some minimum error entropy algorithms in a re-gression setting by minimizing empirical Renyi’s entropy of order 2. Consistency results andlearning rates are presented. In particular, some error estimates dealing with heavy-tailednoise will be given.

ROKS 2013

34

Learning from Weakly Labeled Data

James Kwok

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology Hong Kong

Abstract: In many machine learning problems, the labels of the training examples are in-complete. These include, for example, (i) semi-supervised learning where labels are partiallyknown; (ii) multi-instance learning where labels are implicitly known; and (iii) clusteringwhere labels are completely unknown. In this talk, focusing on the SVM as the learner, I willdescribe a label generation strategy that leads to a convex relaxation of the underlying mixedinteger programming problem. Computationally, it can be solved via a sequence of SVM sub-problems that are much more scalable than other convex SDP relaxations. Empirical resultson the three weakly labeled learning tasks above also demonstrate improved performance.(joint work with Yu-Feng Li, Ivor W. Tsang, and Zhi-Hua Zhou)

ROKS 2013

35

ROKS 2013

36

Oral and poster session abstracts

ROKS 2013

37

ROKS 2013

38

An Equivalence between the Lasso

and Support Vector Machines

Martin Jaggi

CMAP, Ecole Polytechnique, Palaiseau, France

[email protected]

Abstract: We investigate the relation of two fundamental tools in machine learning andsignal processing, that is the support vector machine (SVM) for classification, and the Lassotechnique used in regression. We show [7] that the resulting optimization problems are equiv-alent, in the following sense: Given any instance of one of the two problems, we construct aninstance of the other, having the same optimal solution.In consequence, many existing optimization algorithms for both SVMs and Lasso can alsobe applied to the respective other problem instances. Also, the equivalence allows for manyknown theoretical insights for SVM and Lasso to be translated between the two settings. Onesuch implication gives a simple kernelized version of the Lasso, analogous to the kernels usedin the SVM setting. Another consequence is that the sparsity of a Lasso solution is equalto the number of support vectors for the corresponding SVM instance, and that one can usescreening rules to prune the set of support vectors. Furthermore, we can relate sublinear timealgorithms for the two problems, and give a new such algorithm variant for the Lasso.

Keywords: Lasso, SVM, Kernel Methods, ℓ1-Regularized Least Squares, Screening Rules

1 Introduction

Large margin classification and kernel methods, and inparticular the support vector machine (SVM) [3], areamong the most popular standard tools for classifica-tion. On the other hand, ℓ1-regularized least squaresregression, i.e. the Lasso estimator [8], is one of themost widely used tools for robust regression and sparseestimation. However, the two research topics developedlargely independently and were not much set into con-text with each other.

Support Vector Machines. In this work, we focus onSVM large margin classifiers whose dual optimizationproblem is of the form

minx∈

‖Ax‖22. (1)

Here the matrix A ∈ Rd×n contains all n datapoints as

its columns, and is the unit simplex in Rn, i.e. the

set of probability vectors. The formulation (1) includesthe commonly used soft-margin SVM with ℓ2-loss (forone or two classes, with regularized or no offset, withor without using a kernel), as given by

minw∈R

d, ρ∈R,ξ∈R

n

1

2‖w‖2

2− ρ+ C

2

∑

i ξ2

i

s.t. yi · wTXi ≥ ρ− ξi ∀i ∈ [1..n] .

(2)

here C > 0 is the regularization parameter. (Note thatoptionally, this also allows the use of a regularised offsetvariable b ∈ R to obtain a classifier that does not neces-sarily pass through the origin, by the trick of increasingthe dimensionality of Xi and w by one.)

Lasso. On the other hand, the Lasso [8], is given bythe quadratic program

minx∈

‖Ax− b‖22, (3)

also known as the constrained variant of ℓ1-regularizedleast squares regression. Here the right hand side b isa fixed vector b ∈ R

d, and is the ℓ1-unit-ball in Rn.

Note that if the desired ℓ1-regularization constraint isnot ‖x‖

1≤ 1, but ‖x‖

1≤ r for some r > 0 instead, then

it is enough to simply re-scale the input matrix A by afactor of 1

r, in order to obtain our above formulation (3)

for any general Lasso problem. In applications of theLasso, it is important to distinguish two alternative in-terpretations of the data matrix A, which defines theproblem instance (3): On one hand, in the setting ofsparse regression, the goal is to approximate the singlevector b by a combination of few dictionary vectors, be-ing the columns of A, called the dictionary matrix. Onthe other hand if the Lasso problem is interpreted asfeature-selection, then each row Ai: of A is interpretedas an input vector, and for each of those, the Lasso isapproximating the response bi to input Ai:. See e.g. [1]for a recent overview of Lasso-type methods.

Related Work. The early work of [6] has already sig-nificantly deepened the joint understanding of kernelmethods and the sparse coding setting of the Lasso.However, despite its title, [6] is not addressing SVMclassifiers, but in fact the ε-insensitive loss variant ofsupport vector regression (SVR). In the application pa-per [5], the authors already suggested to make use of the

ROKS 2013

39

“easier” direction of our reduction, reducing the Lassoto a very particular SVM instance, but not addressingthe SVM regularization parameter.

2 The Equivalence

We prove that the two problems (1) and (3) are indeedequivalent, by constructing instances of the respectiveother problem, having the same solutions.

2.1 (Lasso SVM): Given a Lasso instance,construct an equivalent SVM instanceIn order to represent the ℓ1-ball by a simplex ,the standard concept of barycentric coordinates comesto help, stating that every polytope can be repre-sented as the convex hull of its vertices [9], which inthe ℓ1-ball case are ±ei | i ∈ [1..n]. Given a Lasso

instance of the form (3), that is, minx∈ ‖Ax− b‖22,

we can therefore directly parameterize the ℓ1-ball bythe 2n-dimensional simplex. By writing (In |−In)xfor any x ∈ , the objective function becomes‖(A |−A)x − b‖2

2. This means we have obtained the

equivalent non-negative regression problem over the do-main x ∈ ⊂ R

2n which, by translation, is equiv-alent to the (hard-margin) SVM formulation (1), i.e.minx∈ ‖Ax‖22 , where the data matrix is given by

A := (A |−A)−b1T ∈ Rd×2n. This reduction gives us a

one-to-one correspondence of all feasible solutions, pre-serving the objective values: For any feasible solutionx ∈ R

n to the Lasso, we have a feasible SVM solutionx ∈ R

2n of the same objective value, and vice versa.

2.2 (SVM Lasso): Given an SVM instance,constructing an equivalent Lasso instanceThis reduction is harder to accomplish than the otherdirection we explained above. Given an instance ofa (soft- or hard-margin) SVM problem (1), we sup-pose that we have a (possibly non-optimal) weakly-separating vector w ∈ R

d available. Given w, we de-fine a Lasso instance (A, b) as the translated datapoints

A :=

Ai + b∣

∣

∣i ∈ [1..n]

together with the right hand

side b := −C w‖w‖

2

, that is a translation in direction of w

for some C > 0. Details are given in the full paper [7],showing that a weakly-separating vector w is trivial toobtain for the ℓ2-loss soft-margin SVM (2), even if theSVM input data is not separable.

3 Implications & Remarks

3.1 Some Implications for the LassoSublinear Time Algorithms. Using our reduction,we observe that the recent breakthrough SVM algo-rithm of [2] can also be applied to the Lasso, returningan ε-accurate solution to (3) in time O(ε−2(n+d) log n).

A Lasso in Kernel Space. Using our reduction, wecan kernelize the Lasso fully analogously to the classi-cal kernel trick for SVMs, resulting in the formulationminx∈ ‖∑i Ψ(Ai)xi −Ψ(b)‖2

H. In the light of the suc-

cess of the kernel idea for classification with its existingwell-developed theory, it will be interesting to relatethese results to the above proposed kernelized versionof the Lasso, and to study how different kernels willaffect the solution x for applications of the Lasso.

3.2 Some Implications for SVMsStructure and Sparsity of the Support Vec-tors, in the View of Lasso Sparsity. There hasbeen a very significant boost of new literature study-ing the sparsity of solutions to the Lasso and relatedℓ1-regularized methods, in particular the study of thesparsity of x when A and b are from distributions withcertain properties. Using our construction of the equiv-alent Lasso instance for a given SVM, such results thendirectly apply to the sparsity pattern of the solutionto our original SVM (i.e. the pattern and the numberof support vectors). More precisely, any distributionof matrices A and corresponding b for which the Lassosparsity is well characterized, will also give the samepatterns of support vectors for the equivalent SVM (andin particular the same number of support vectors).

Screening Rules for Support Vector Machines.For the Lasso, screening rules have been developed re-cently. This approach consists of a single pre-processingpass though the data A, in order to immediately dis-card those predictors Ai that can be guaranteed to beinactive for the optimal solution [4]. Translated to theSVM setting by our reduction, any such Lasso screeningrule can be used to permanently discard input pointsbefore the SVM optimization is started.

References

[1] P Buhlmann and S van de Geer. Statistics for High-Dimensional Data - Methods, Theory and Applications.Springer, 2011.

[2] K L Clarkson, E Hazan, and D P Woodruff. SublinearOptimization for Machine Learning. FOCS, 2010.

[3] C Cortes and V Vapnik. Support-Vector Networks. Ma-chine Learning, 20(3):273–297, 1995.

[4] L El Ghaoui, V Viallon, and T Rabbani. Safe Fea-ture Elimination for the LASSO and Sparse SupervisedLearning Problems. arXiv.org, cs.LG, 2010.

[5] D Ghosh and A M Chinnaiyan. Classification and Se-lection of Biomarkers in Genomic Data Using LASSO.Journal of Biomedicine and Biotechnology, 2005(2):147–154, 2005.

[6] F Girosi. An Equivalence Between Sparse Approxima-tion and Support Vector Machines. Neural Computation,10(6):1455–1480, 1998.

[7] M Jaggi. An Equivalence between the Lasso and Sup-port Vector Machines. arXiv, 2013.

[8] R Tibshirani. Regression Shrinkage and Selection via theLasso. Journal of the Royal Statistical Society. Series B(Methodological), pages 267–288, 1996.

[9] G M Ziegler. Lectures on Polytopes, volume 152 of Grad-uate Texts in Mathematics. Springer, 1995.

ROKS 2013

40

The graph-guided group lasso for genome-wide association

studies

Zi WangMathematics Department, Imperial College

180 Queen’s Gate, London SW7 [email protected]

Giovanni MontanaMathematics Department, Imperial College

180 Queen’s Gate, London SW7 [email protected]

Abstract: In this work we propose a penalised regression model in which the covariates areknown to be clustered into groups, and the clusters are arranged as nodes in a graph. Weare motivated by an application to genome-wide association studies in which the objective isto identify important predictors, single nucleotide polymorphisms (SNPs), that account forthe variability of a quantitative trait. In this applications, SNPs naturally cluster into SNPsets representing genes, and genes are treated as nodes of a biological network encoding thefunctional relatedness of genes. Our proposed graph-guided group lasso (GGGL) takes intoaccount such prior knowledge available on the covariates at two different levels, and allows toselect important SNPs sets while also favouring the selection of functionally related genes. Wedescribe a computationally efficient algorithm for parameter estimation, provide experimentalresults and present a GWA study on lipids levels in two Asian populations.

Keywords: sparse group lasso, Laplacian penalty, genome-wide association studies

1 Introduction

Genome-wide association studies (GWAs) are con-cerned with the search of common genetic variantsacross the human genome that are associated to a dis-ease status or quantitative trait. The genetic markersare often taken to be single-nucleotide polymorphisms(SNPs), and are treated as covariates in a linear re-gression model in which the response is a continuousmeasurement. We let X be the n × p design matrixcontaining n independent samples for which p SNPshave been observed, and y be the n-dimensional vectorcontaining the univariate quantitative traits. We fur-ther assume that X and y are column-wise normalizedto have sum zero and unit length.

Since the objective is to carry our variable selection, theempirical loss is minimised subject to some constraintconditions placed on the coefficients, in order to reg-ularised the solution and carry out variable selection[1, 2].

One way of improving variable selection accuracy inGWAs is to make use of available prior knowledge aboutthe genetic markers and the functional relationshipsbetween genes. Such knowledge typically includes thegrouping of SNPs into genes, and it has been observedthat selecting groups of SNPs in a single block, ratherthan individual SNPs in isolation, may increase thepower to detect true causative and rare variants (e.g.[3]). However, additional information can also be ob-tained from publicly available data bases in the form of

biological networks encoding pairwise interactions be-tween genes or proteins associated to those genes. Un-der the assumption that such networks describe true bi-ological processes, there are reasons to be believe thatusing this additional information to guide the SNP se-lection process may produce results that are biologicallymore plausible and easy to interpret as well as increase.Regularised regression models that take into consider-ation the integrated effects of all SNPs that belong tofunctionally related genes are also believed to achievesuperior performance in terms of detecting the truecausative markets [4, 5]. In previous GWA studies, thishas been accomplished by using variations of the grouplasso [6] and the sparse group lasso [7]. When groupsoverlap, for instance when a SNP is mapped to morethan one single gene, variables selected by the overlap-ping group lasso [8] are the union of some groups.

To the best of our knowledge, graph structures placeson genetic markers or genes are not yet used to drive thevariable selection process in GWA studies with quan-titative traits. In a typical gene network, two nodesare connected by an edge if the associated genes belongto the same genetic pathway or are deemed to sharerelated functions. In the case of a weighted graph,the weights may be a probability measure of the un-certainty of the link between the genes. In this workwe consider the case where prior knowledge is availableat two different levels: SNPs are grouped into genes,and a weighted gene network encodes the functionalrelatedness of the all pairs of genes. We propose a pe-nalised regression model, the graph-guided group lasso,

ROKS 2013

41

which selects important SNP groups, while also fusinginformation between adjacent SNPs groups in the givenbiological network.

2 Graph-guided group lasso

Suppose that the p available SNPs are grouped intomutually exclusive genes R1, R2, ..., Rr. The size ofa group Rl is denoted by |Rl|. We let XRl

denote then× |Rl| matrix where the columns correspond to SNPsin Rl, and G = G(V,E) the gene network with vertexset V corresponding to the r genes in R. The weightof the edge k − l is denoted by wkl. For simplicity, weassume that all the weights are non-negative.

The regression coefficients are obtaining by minimising‖y −Xβ‖22 plus a penalty term given by

2λ1

r∑g=1

√|Rg|‖βRg

‖2 + 2λ2‖β‖1 + µ

∑i∈Rk,j∈Rl,Rk∼Rl

wkl(βi − βj)2(1)

where λ1, λ2, and µ are non-negative regularization pa-rameters, and Rk ∼ Rl if and only if they are connectedin the network G. This model has two main features.Firstly, by making use of the Laplacian penalty on thecomplete bipartite graph (Ri, Rj) for all Rj ∼ Ri, in-formation is fused from all other genes interacting withRi in G so that these functionally related genes are en-couraged to be selected in and out of the model alto-gether. ([4]) Secondly, there is a grouping effect, in thesense that all SNPs within a gene Ri are either selectedtogether or not selected. This feature follows from theproperties of the sparse group lasso penalty [7], in whichsparsity of genes and SNPs are regularized by λ1 andλ2 respectively.

Note the prior knowledge represented by grouping andthe graph are at heterogeneous levels, hence how thepairwise relations at genes’ level influence variable se-lection for individual SNPs may have different answers.The proposed penalty has also the effect of smoothingthe regression coefficients corresponding to all SNPsthat belong to interacting genes. When µ → ∞, allthese coefficients are expected to be equal.

In some cases, a modification of the model above maybe preferred. If two genes are directly connected in G,it may be preferred to encourage them to be selected ordiscarded altogether without smoothing the individualSNP coefficients within a gene. For this reason, wealso propose a second version of the GGGL model byreplacing the last term in (1) by:

µ∑

Rk∼Rl

wkl(βRk− βRl

)2 (2)

where βRkdenotes the average coefficient for predic-

tors in Rk. We show that, using (2), the interacting

genes are indeed encouraged to be selected in or outof the model altogether, nonetheless no smoothing ef-fect is imposed on the coefficients corresponding to theSNPs within a gene. In summary, the penalty (1) ismore desirable when the interest is only in selectinggenes, regardless of the specific SNP effects that drivethe gene selection process, whereas (2) is more appro-priate for the detection of localised SNP effects.

In summary, we propose a sparse regression model,graph-driven group lasso, for GWA studies that allowsto incorporate prior knowledge at two different levels.We describe a computationally efficient estimation al-gorithm for both version of the model, which is basedon coordinate descent methods. We also carry out ex-tensive power studies using realistically simulated data,and compare the proposed model to the original grouplasso [6] and a regression model with a network con-strained penalty [4]. Finally, we present a real applica-tion to detect genetic effects associated to lipids levelsin two Asian cohorts.

References

[1] Tibshirani. Regression shrinkage and selection viathe lasso. J.R.Statist. Soc.B, 58:267-288. 1996.

[2] Wu et al. Genome-wide association analysis bylasso penalized logistic regression. Bioinformatics25(6):714-721. 2009.

[3] Zhou et al. Association screening of commonand rare genetic variants by penalized regression.Bioinformatics 26(19): 2375-2382. 2010.

[4] Li and Li. Network-constrained regularization andvariable selection for analysis of genomic data.Bioinformatics. Vol. 24 no. 9, pages 1175-11822008.

[5] Silver and Montana. Fast identification of biolog-ical pathways associated with a quantitative traitusing group lasso with overlaps. vol. 11, issue 1,article Statistical Applications in Genetics andMolecular Biology. vol. 11, issue 1, article 7. 2012.

[6] Yuan and Lin. Model selection and estimationin regression with grouped variables. J.R.Statist.Soc.B, 68(1):49-67, 2006.

[7] Friedman et al. A note on the group lasso and asparse group lasso. arXiv:1001.0736. 2010.

[8] Jacob et al. Group Lasso with overlap andgraph Lasso. International Conference on MachineLearning (ICML 26) 2009.

ROKS 2013

42

Feature Selection via Detecting Ineffective Features

Kris De Brabanter

Dep. Electrical Engineering

Kasteelpark Arenberg 10, 3001 Leuven

Katholieke Universiteit Leuven

[email protected]

Laszlo Gyorfi

Dep. Computer Science & Information Theory

Magyar Tudosok korutja 2., Budapest

Budapest University of Technology and Economics

[email protected]

Abstract: Consider the regression problem with a response variable Y and with a featurevector X. For the regression function m(x) = EY | X = x, we introduce a new and simpleestimator of the minimummean squared error L∗ = E(Y−m(X))2. LetX(−k) be the featurevector, in which the k-th component ofX is missing. In this paper we analyze a nonparametrictest for the hypothesis that the k-th component is ineffective, i.e., EY | X = EY | X(−k)a.s.

Keywords: feature selection, minimum mean squared error, hypothesis test

1 Introduction

Let the label Y be a real valued random variable and letthe feature vectorX = (X1, . . . , Xd) be a d-dimensionalrandom vector. The regression function m is defined by

m(x) = EY | X = x.

The minimum mean squared error, called also varianceof the residual Y −m(X), is denoted by

L∗ := E(Y −m(X))2 = minf

E(Y − f(X))2.

The regression function m and the minimum meansquared error L∗ cannot be calculated when the dis-tribution of (X, Y ) is unknown. Assume, however, thatwe observe data

Dn = (X1, Y1), . . . , (Xn, Yn)

consisting of independent and identically distributedcopies of (X, Y ). Dn can be used to produce an esti-mate of L∗. Nonparametric estimates of the minimummean squared error are given in [2, 4].

2 New estimate of the minimum mean

squared error

One can derive a new and simple estimator of L∗ byconsidering the definition

L∗ = E(Y −m(X))2 = EY 2 −Em(X)2. (1)

The first and second term on the right-hand-side of (1) can be estimated by 1

n

∑ni=1 Y

2i and

1n

∑n

i=1 YiYn,i,1 respectively where Yn,i,1 denotes thelabels of the first nearest neighbors of Xi among

X1, . . . ,Xi−1,Xi+1, . . . ,Xn. Therefore, the minimummean squared error L∗ can be estimated by

Ln :=1

n

n∑

i=1

Y 2i − 1

n

n∑

i=1

YiYn,i,1. (2)

One can show without any conditions that

Ln → L∗

a.s. Moreover, for bounded |Y | and ‖X‖, and for Lips-chitz continuous m, and for d ≥ 2, we have (cf. [3])

E|Ln − L∗| ≤ c1n−1/2 + c2n

−2/d.

3 Feature Selection and Hypothesis Test

One way of feature selection would be to detect inef-fective components of the feature vector. Let X(−k) =(X1, . . . , Xk−1, Xk+1, . . . , Xd) be the d− 1 dimensionalfeature vector such that we leave out the k-th compo-nent from X. Then the corresponding minimum erroris

L∗(−k) := E

Y −EY |X(−k)2

.

We want to test the following (null) hypothesis:

Hk : L∗(−k) = L∗,

which means that leaving out the k-th component theminimum mean squared error does not increase. Thehypothesis Hk means that

m(X) = EY | X = EY | X(−k) =: m(−k)(X(−k))

a.s.

By using the data

D(−k)n = (X(−k)

1 , Y1), . . . , (X(−k)n , Yn),

ROKS 2013

43

L∗(−k) can be estimated by

L(−k)n :=

1

n

n∑

i=1

Y 2i − 1

n

n∑

i=1

YiY(−k)n,i,1 ,

so the corresponding test statistic is

L(−k)n − Ln =

1

n

n∑

i=1

Yi(Yn,i,1 − Y(−k)n,i,1 ).

We can accept the hypothesis Hk if

L(−k)n − Ln

is “close” to zero. Since with large probability the first

nearest neighbors of Xi and of X(−k)i are the same,

Yn,i,1 − Y(−k)n,i,1 = 0 in the test statistic. We know that

P(Yn,i,1 = Y(−k)n,i,1 ) is decreasing as n increases (and d

remains fixed) and vice versa, this probability is in-creasing as d increases (while n remains fixed). Hence,this test statistic is small even when the hypothesis Hk

is not true.

To correct for this problem we modify the test statisticsuch that

(Yn,i,1, Y(−k)n,i,1 ) = (Yn,i,1, Y

(−k)n,i,1 ) if Yn,i,1 6= Y

(−k)n,i,1

and

(Yn,i,1, Y(−k)n,i,1 ) = Ii(Yn,i,2, Y

(−k)n,i,1 )+(1−Ii)(Yn,i,1, Y

(−k)n,i,2 ),

otherwise (where Yn,i,2 denotes the labels ofthe second nearest neighbors of Xi amongX1, . . . ,Xi−1,Xi+1, . . . ,Xn), with

Ii =

0 with probability 1/2,1 with probability 1/2,

yielding

L(−k)n − Ln =

1

n

n∑

i=1

Yi(Yn,i,1 − Y(−k)n,i,1 ).

As in classical hypothesis testing, we need to find thelimit distribution of the test statistic. The main diffi-culty here is that L

(−k)n − Ln is an average of dependent

random variables. However, this dependence has a spe-cial property, called exchangeable. Based on a centrallimit theorem for exchangeable arrays [1], we can showthe following result.

Theorem 1 Under the conditions of [1, Theorem 2],we have that

√n(L(−k)

n − Ln)d−→N(0, 2L∗EY 2)

under the null hypothesis Hk.

In the above theorem, L∗ and EY 2 can be estimatedby (2) and 1

n

∑n

i=1 Y2i respectively. Note that such a

results is quite surprising, since under Hk the smooth-ness of the regression function m and the dimension ddo not count.

4 Simulations

First, consider the following nonlinear function with 4uniformly distributed inputs on [0, 1]4 with n = 1, 000:Y = sin(πX(1)) cos(πX(4)) + ε, with ε ∼ N(0, 0.12).Figure 1(a) illustrates the frequency of the true selectedsubset, true subset with additional component and fullsubset selected by the proposed test procedure during1,000 runs. The significance level is set to 0.05.

Second, we experimentally verify Theorem 1 by meansof bootstrap (10,000 replications). Consider the fol-lowing five dimensional function with additive noise:Y =

∑5i=1 ciX

(i) + ε, where c1 = 0 and ci = 1for i = 2, . . . , 5. Let X be uniform on [0, 1]5 andε ∼ N(0, 0.052). Figure 1(b) shows the histogram of√n(L

(−k)n − Ln) under the null hypothesis i.e., Hk for

k = 1. A Kolmogorov-Smirnov test confirms this result.

0

10

20

30

40

50

60

70

80

90

Nu

mb

er

of

tim

es

se

lec

ted

(%

)

Rest True Subset True Subset+1 ALL

(a)

−2 −1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

Data

De

ns

ity

(b)

Fig. 1: (a) Illustration of the frequency of the true selectedsubset, true subset with additional component and full sub-set. Rest denotes at least one component of the true subsetis selected; (b) Density histogram of

√

n(L(−k)n − Ln) under

the null hypothesis with corresponding Normal fit.

5 Conclusion

We have presented a simple nonparametric hypothesistest for detecting ineffective features. The simulationshows the capability of the proposed methodology.

Acknowledgments

Kris De Brabanter is supported by an FWO fellowship grant.Laszlo Gyorfi was partially supported by the European Union andthe European Social Fund through project FuturICT.hu (grantno.: TAMOP-4.2.2.C-11/1/KONV-2012-0013).

References

[1] J.R. Blum, H. Chernoff, M. Rosenblatt & H. Teicher. Centrallimit theorems for interexchangeable processes. Canad. J.

Math., 10:222–2229, 1958.

[2] L. Devroye, D. Schafer, L. Gyorfi & H. Walk. The estima-tion problem of minimum mean squared error. Statistics and

Decisions, 21(1): 15-28, 2003.

[3] L. Devroye, P. Ferrario, L. Gyorfi & H. Walk. Strong univer-sal consistent estimate of the minimum mean squared error.Submitted, 2013.

[4] E. Liitiainen, F. Corona & A. Lendasse. Residual varianceestimation using a nearest neighbor statistic. J. Multivariate

Anal., 101(4): 811-823, 2010.

ROKS 2013

44

Sparse network-based models for patient classification

using fMRI

Maria J. RosaComputer Science DepartmentUniversity College London, UK

[email protected]

Liana PortugalComputer Science DepartmentUniversity College London, UK

John Shawe-TaylorComputer Science DepartmentUniversity College London, UK

Janaina Mourao-MirandaComputer Science DepartmentUniversity College London, UK

Abstract: Pattern recognition applied to whole-brain neuroimaging data, such as functionalMagnetic Resonance Imaging (fMRI), has been successful at discriminating psychiatric pa-tients from healthy subjects. However, predictive patterns obtained from whole-brain dataare difficult to interpret in terms of the underlying neurobiology. As is generally accepted,most psychiatric disorders are brain connectivity disorders. Therefore, pattern recognitionbased on network models, in particular sparse models, should provide more scientific insightand potentially more powerful predictions than whole-brain approaches. Here, we build anovel sparse network-based discriminative modelling framework, based on Gaussian graphicalmodels and L1-norm linear Support Vector Machines (SVM). This framework provides easierpattern interpretation, in terms of network changes, and we illustrate it by classifying patientswith depression and controls, using fMRI data from a sad facial processing task.

Keywords: sparse models, graphical LASSO, L1-norm SVM, brain connectivity, fMRI

1 Introduction

Brain connectivity measures provide ways of assessingstatistical relationships between signals from differentbrain regions [5]. These methods have revealed newinsights into brain network function in general, and ofnetwork disfunction in psychiatric disorders [1]. A wayof measuring connectivity is by estimating the inversecovariance (iCOV) matrix between brain regions undersparsity constraints. The zero entries in this matrix cor-respond to conditional independence between regions(and missing links in a Gaussian graphical model). Thisapproach has been used for classification tasks [3] but,to our knowledge, it has not yet been combined withsparse discriminative classifiers to provide a fully sparsepredictive modelling framework.

Here, we build a novel connectivity-based discrimi-native framework combining graphical Least AbsoluteShrinkage and Selection Operator (gLASSO, [4]) andL1-norm linear SVM. We illustrate our technique byclassifying patients with depression and controls, usingfMRI data from a sad faces task. The resulting patternsare easier to interpret than whole-brain and non-sparseones, by revealing a small set of connections that bestdiscriminate between the groups.

2 Sparse network-based predictivemodelling framework

2.1 Data preprocessingFor each subject, fMRI images are motion corrected andcoregistered to an MNI template1. The images thenundergo parcellation into p regions from a brain atlas.Regional mean time-series are estimated by averagingthe fMRI signals over all spatial elements within eachregion. The pairwise inter-regional covariance matrix,Σ=[p × p], is then computed from the averaged time-series.

2.2 Sparse graphs and predictive modelWe then use gLASSO [4] to estimate a sparse (via regu-larisation, not thresholding) iCOV matrix for each sub-ject, Ω = Σ−1. gLASSO tries to find Ω that maximisesthe penalised Gaussian log-likelihood: log det Ω −tr(ΣΩ)−λ||Ω||1 using a coordinate descent optimisationprocedure [4], where λ is the regularisation parameter.The lower triangular entries of the iCOV matrices arethen vectorised, divided by their norm, and used asfeatures in a linear SVM [6] for classification. LinearL1-norm SVM solves the following optimisation prob-lem: minw f(w) ≡ ||w||1+C

∑ki=1 max(1−yiwTxi, 0)2,

where C > 0, k is the number of examples, xi ∈ <q

1Montreal Neuroimaging Institute template.

ROKS 2013

45

are the feature vectors and yi = −1,+1 the labels(e.g. control and patient). We use a leave-one-subject-per-group-out (LOSGO) cross-validation (CV) scheme(number of folds, nf , = number of subjects in eachgroup, ns), with a nested LOSGO-CV (nf = ns − 1)to independently optimise the gLASSO regularisationparameter, λ, via maximum likelihood, and the C-parameter from SVM.

2.3 Pattern interpretationThe weight vector, w, is sparse, and its elements cor-respond to brain connections. Because each CV foldyields a different weight vector, we calculate how ofteneach feature was selected across the repetitions. Theconnections that were selected at least half of the foldsare referred to as the most discriminative set.

3 Experiments

We use the fMRI data of [2], from 19 medication-free patients with depression and 19 healthy volun-teers. The paradigm involved implicit processing of sadfaces of different emotional intensity (low, medium, andhigh). In addition to the above preprocessing, the datawere smoothed in space using an 8 mm Gaussian filter.The images were parcellated into 137 regions: 122 fromthe BSA atlas2 and 15 from the Harvard-Oxford atlas3.Our sparse network model correctly classified 74% ofpatients and 89% of controls, corresponding to an ac-curacy of 82% (p-value < 0.05, permutation test with1000 samples, Table 1). The inverse covariances wereon average 83% sparse (total number of connections =9316), while the most discriminative set of connectionswas 99% sparse and included: amygdala and hippocam-pus with temporal cortex; cingulate cortex with frontalcortex and thalamus; amygdala with precentral cortex;insula with frontal cortex (Figure 1).

Tab. 1: Accuracies of sparse network-based framework andL2-norm SVM based whole-brain (WB) approach on thesame data [2]. WB (betas) use the WB coefficients of ageneral linear model from a univariate analysis for classifi-cation. The * denotes a p-value < 0.05.

Model Sens. (%) Spec. (%) Acc. (%).Network-based 74 89 82*

WB [2] 53 58 56WB (betas) [2] 72 82 77*

4 Discussion

Our results show that it is possible to discriminate pa-tients with depression from controls based on sparsenetwork-based predictive models. Compared to whole-brain analyses on the same data (using all emotionalstimuli) [2], our approach provided higher accuracy,

2http://lnao.lixium.fr/spip.php?article=2293http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases

Fig. 1: Sparse-networks that best discriminate patientswith depression from controls. The connections shown wereselected at least half of the cross-validation folds. The size ofeach region corresponds to the number of connected nodes.

with the advantage of easier pattern interpretation,which aids the development of interpretable diagnostictools for psychiatry. The most discriminative featuresare consistent with the literature [1], and highlight dif-ferences between the groups in circuitry associated withemotional regulation. Future work includes better pat-tern interpretation and validation with other datasets.

AcknowledgmentsThis work was funded by the Wellcome Trust (UK)[WT086565/Z/08/Z].

References

[1] A. Stuhrmann et al. Facial emotion processingin major depression: a systematic review of neu-roimaging findings. Biol Mood Anxiety Disord,1(1):10, 2011.

[2] C. H. Fu et al. Pattern classification of sad fa-cial processing: toward the development of neuro-biological markers in depression. Biol. Psychiatry,63(7):656–662, 2008.

[3] G. Cecchi et al. Discriminative network models ofschizophrenia. In Adv. in Neural Inform. Proc. Syst.22, pages 252–260. 2009.

[4] J. Friedman et al. Sparse inverse covariance esti-mation with the graphical LASSO. Biostatistics,9(3):432–441, 2008.

[5] K. Li et al. Review of methods for functional brainconnectivity detection using fMRI. Comp. Med. Im.and Graph., 33(2):131, 2009.

[6] R. Fan et al. LIBLINEAR: A library for large lin-ear classification. The Journal of Machine LearningResearch, 9:1871–1874, 2008.

ROKS 2013

46

Incremental Forward Stagewise Regression: Computational

Complexity and Connections to LASSO

Robert M. FreundMIT Sloan School of Management

[email protected]

Paul Grigas Rahul MazumderMIT Operations Research Centerpgrigas, [email protected]

Abstract: Incremental Forward Stagewise Regression (FSε) is a statistical algorithm thatproduces sparse coefficient profiles for linear regression. Using the tools of first-order meth-ods in convex optimization, we analyze the computational complexity of FSε and its flexiblevariants with adaptive shrinkage parameters. We also show that a simple modification toFSε yields an O(1/k) convergent algorithm for the least squares LASSO fit for any regular-ization parameter and any data-set — thereby quantitatively characterizing the nature ofregularization implicitly induced by FSε.

Keywords: Incremental Forward Stagewise Regression, first-order methods, sparsity, `1 reg-ularization

1 Introduction

Using tools from first-order methods in convex opti-mization, this paper establishes connections betweensparse `1 regression [4], the Incremental Forward Stage-wise (Boosting) Algorithm [10] and the notion of reg-ularization in boosting [1], and presents new compu-tational guarantees for these methods. We considerthe linear regression model y = Xβ + e, with responsey ∈ Rn, model matrix X ∈ Rn×p, regression coefficientsβ ∈ Rp and errors e ∈ Rn. In the high-dimensional sta-tistical regime, especially with p n, a sparse linearmodel with few non-zero coefficients is often desirable.In this context `1 penalized regression, i.e., LASSO [4]:

minimizeβ

1

2‖y −Xβ‖22 subject to ‖β‖1 ≤ δ (1)

is often used to perform variable selection and shrinkagein the coefficients — and is known to yield models withgood predictive performance.

Incremental Forward Stagewise: The Incremen-tal Forward Stagewise algorithm [1, 10] (FSε for short)with shrinkage factor ε is a type of boosting algorithmfor the linear regression problem. This generates a co-efficient profile1 by repeatedly updating (by a smallamount ε) the coefficient of the variable most corre-lated with the current residuals. The algorithm is ini-tialized with residual r0 = y and β0 = 0, and updatesat iteration k as follows:

1A coefficient profile is a path of coefficients β(α)α∈α whereα parametrizes the path. In the context of FSε, α indexes the `1arc-length of the coefficients.

Compute jk ∈ arg maxj∈1,...,p

|(rk)TXj | and update:

βk+1jk← βkjk + εsgn[(rk)TXjk ] (2)

rk+1 ← rk − εsgn[(rk)TXjk ]Xjk (3)

where βkjk is the jthk coordinate of βk. Due to the up-date scheme (2), FSε has the following desirable spar-sity properties2:

‖βk‖1 ≤ kε and ‖βk‖0 ≤ k . (4)

In the presence of noise and specifically in the high di-mensional regime, it is desirable to have a regularizedsolution to obtain a proper bias-variance tradeoff. Aprincipal reason behind why FSε (and other boostingalgorithms in general) is attractive from a statisticalviewpoint is because of its ability to deliver regular-ized solutions (4) by controlling the number of iter-ations along with the shrinkage parameter. Differentchoices of ε lead to different algorithms: a choice ofε = |(rk)TXjk | in (2)-(3) yields the Forward Stagewisealgorithm (FS) [1] — which is a greedy version of best-subset selection. It is therefore natural to ask whatcriterion does the FSε algorithm optimize?, and whatare its computational guarantees? Furthermore, sinceFSε produces a coefficient profile with implicit regular-ization, is it possible to characterize it via a constrainedleast squares fit (e.g. LASSO)? To the best of ourknowledge, a simple and complete answer to the abovequestions are heretofore unknown, apart from some spe-cial cases. It is known that Infinitesimal IncrementalForward Stagewise Regression (FS0), the limit of FSεas ε → 0+ is the solution to a complicated differen-tial equation [1, 10] and is in general different from theLASSO coefficient profile.

2For a vector x, ‖x‖0 counts the number of non-zero entries.

ROKS 2013

47

Our Contributions: We derive novel complexitybounds for FSε and its flexible variants. We also showthat a simple modification to FSε yields an O(1/k) con-vergent algorithm for LASSO for any (y,X, δ).

2 FSε as Subgradient Descent

Consider the non-smooth convex optimization problem:

minimizer∈Pres

f(r) := ‖XT r‖∞ (5)

where Pres := r ∈ Rn : r = y −Xβ for some β ∈ Rpis the the set of residuals. Note that (5) has optimalobjective value of f∗ = 0. We establish the followingconnection.

Theorem 2.1. The FSε algorithm is an instance ofthe Subgradient Descent Method to solve problem (5),initialized at r0 = y and with a constant step-size of εat each iteration.

Applying the computational complexity tools for theSubgradient Descent Method [3, 6], it is straightforwardto establish the following computational guarantees.

Corollary 2.1. (Complexity of FSε) For the con-stant shrinkage factor ε it holds that:

min0≤i≤k

‖XT ri‖∞ ≤‖y‖22

2ε(k + 1)+ε‖X‖21,2

2. (6)

If we a priori decide to run FSε for k iterations and set

ε := ‖y‖2‖X‖1,2

√k+1

then 3

min0≤i≤k

‖XT ri‖∞ ≤‖X‖1,2‖y‖2√

k + 1. (7)

If instead, the shrinkage factor is dynamically chosen

as ε = εk :=|(rk)TXjk

|‖Xjk

‖22, then the bound (7) holds for all

values of k without having to set k a priori.

3 A modified FSε for the LASSO

Suppose we modify (5) by adding a regularizing termto the objective function as follows:

f∗δ = minr∈Pres

fδ(r) := ‖XT r‖∞ +1

2δ‖r − y‖22 , (8)

for some parameter δ > 0. We show that a rescaled ver-sion of the above problem is a dual of the LASSO (1),thereby demonstrating one way that LASSO arises isas a quadratic regularization of (5). One method tosolve (1) that has desirable sparsity properties simi-lar to (4) is the Frank-Wolfe Method (also known asthe conditional gradient method [3]). The Frank-Wolfemethod with step-size sequence αk, αk ∈ (0, 1), ap-plied to problem (1) is initialized with residual r0 = yand β0 = 0, and updates at iteration k as follows:

3Where, ‖X‖1,2 is the largest `2 norm of the columns of X.

Compute jk ∈ arg maxj∈1,...,p

|(rk)TXj | and update:

βk+1 ← (1− αk)βk + αkδ sgn((rk)TXjk)ejk (9)

rk+1 ← (1− αk)rk − αkδsgn[(rk)TXjk ]Xjk (10)

For a fixed step-size αk := εδ+ε , observe that (9) can be

rearranged to:

βk+1 ← δ

ε+ δ

[βk + εsgn((rk)TXjk)ejk

](11)

which is equivalent to the FSε update (2) modulo amultiplicative factor which keeps the coefficient profilewithin β : ‖β‖1 ≤ δ. Using complexity guaranteesconcerning duality gaps for Frank-Wolfe, we establishthe following computational complexity bounds thatapply simultaneously for both problems (1) and (8).

Theorem 3.1. (Complexity of modified FSε) Sup-pose that we run the Frank-Wolfe method to solve prob-lem (1) (with optimal objective function value L∗δ), withstep-size αi = 2

i+2 . Then, after k ≥ 1 iterations, thereexists an iteration i ∈ 1, . . . , k for which:

12‖y −Xβi‖22 − L∗δ ≤

17.4‖X‖21,2δ2

k

‖XT (y −Xβi)‖∞ +1

2δ‖Xβi‖22 − f∗δ ≤

17.4‖X‖21,2δk

‖XT (y −Xβi)‖∞ ≤1

2δ‖y‖22 +

17.4‖X‖21,2δk

,

in addition to satisfying ‖βi‖1 ≤ δ, ‖βi‖0 ≤ i. If in-stead we fix the number of iterations k ≥ 1 a priori, setα0 := 1 and then use an appropriately chosen constantstep-size (11) for k ≥ 1, then the O( 1

k ) terms in the

above inequalities become O( log kk ) terms.

4 Extensions and Conclusions

Many of the results of this paper extend to the problemof boosting. Moreover, all of the methods considered inthis paper fall into the broad family of dual averag-ing methods to solve (5) [2], which immediately yieldbounds for (5). Additional references and commentscan be found in the Supplementary Materials section A.

References

[1] T. Hastie, R. Tibshirani and J. Friedman The Elementsof Statistical Learning: Prediction, Inference and DataMining 2nd edition, Springer Verlag, New York , 2009.

[2] Y. Nesterov. Primal-dual subgradient methods for con-vex problems. Mathematical Programming, 120(1)221–259, 2009

[3] B. Polyak. Introduction to Optimization. OptimizationSoftware, Inc., Publications Division, 1987.

[4] R. Tibshirani. Regression shrinkage and selection viathe lasso. Journal of Royal Statistical Society B, 58267–288, 1996.

ROKS 2013

48

Fixed-Size Pegasos for Large Scale Pinball Loss SVM

Vilen JumutcKULeuven, ESAT-SISTAKasteelpark Arenberg 10,B-3001, Leuven, Belgium

[email protected]

Xiaolin HuangKULeuven, ESAT-SISTAKasteelpark Arenberg 10,B-3001, Leuven, Belgium

Johan A.K. SuykensKULeuven, ESAT-SISTAKasteelpark Arenberg 10,B-3001, Leuven, Belgium

Abstract: Here we address the problem of learning pinball loss SVM model. We utilizemodified Pegasos algorithm for large scale problems. The latter type of problems is leveragedfor linearly non-separable cases via Nystrom approximation and Fized-Size approach. Wepresent an outline of the improved Pegasos algorithm and a complete learning procedurewithin the Fixed-Size setting.Keywords: Stochastic gradient method, Pinball loss SVM, Fixed-Size approach.

1 Introduction

Recent research in linear Support Vector Machines(SVM) [1, 4, 8] justified the importance of the first orderstochastic approaches in bringing these machine learn-ing techniques to large scale. Computation of the fullgradient sometimes might be not feasible while stochas-tic approximation [7] to the original optimization prob-lem only to some degree increases the number of iter-ations to converge1. Here we consider another aspectof learning SVM models in the primal and particularlyput our attention to using another loss function.

Pegasos [8] has become a widely acknowledged algo-rithm for learning linear SVM and has attracted re-search interest because of the strongly convex optimiza-tion objective and better convergence bounds. Pegasosutilizes the hinge loss which replaces the original lin-ear constraints while making the SVM objective un-constrained. With the proper projection step Pegasosachieves a solution of accuracy ε in O(R

2

λε ) iterationswhere λ is the regularization parameter and R is theradius of the smallest ball containing all training sam-ples. We would like to stress the fact that the hingeloss plays an important but not the essential role inestablishing the results of Pegasos.

Here we enrich the class of loss functions applicable forPegasos with the pinball loss [3]. We show some advan-tages and potential strengths of using the pinball losswithin the Pegasos framework. We apply Pegasos to-gether with the Fixed-Size approach [10, 11] to achievebetter classification accuracy and to extend the methodto the nonlinear case.

1We should note that without any stronger assumptions aboutour data the stochastic gradient approach delivers sublinear con-vergence rate compared to the linear one in the full gradient ap-proach.

2 Pinball loss SVM

The sensitivity to noise or the instability to re-samplingcomes from the fact that in hinge loss SVM, the dis-tance between two sets is measured by the nearestpoints. Hence, one way to overcome this weak pointis to change the definition of distance between two sets.For example, if we use the distance of the nearest 30%points to measure the distance between two sets, theresults are less sensitive. Such distance is a kind ofquantile value, which is closely related to pinball lossLτ defined by

Lτ (w; (x, y)) =

1− y〈w, x〉 y〈w, x〉 ≤ 1,τ(y〈w, x〉 − 1), y〈w, x〉 > 1, (1)

where the reasonable range of τ is [0, 1] as explained in[3] and the related SVM decision function is defined byf(x) = 〈w, x〉. The pinball loss Lτ has been applied forquantile regression, see, e.g. [6] [2] and [9]. Motivatedby the relationship between pinball loss and quantilevalue, we proposed the following pinball loss SVM in[3],

minw

λ

2‖w‖2 +

1m

∑(x,y)∈S

Lτ (w; (x, y)). (2)

Hinge loss is a special case of pinball loss in Eq.(1) withτ = 0. Accordingly, pinball loss SVM in Eq.(2) is anextension to hinge loss SVM. It has been shown in [3]that pinball and hinge loss SVMs have similar compu-tational complexity and consistency property. Besides,the result of pinball loss SVM is less sensitive to noisearound the boundary.

3 Pegasos with Pinball Loss

3.1 Outline of the algorithmIn Algorithm 1 we can see a major ”for” loop wheregradient and projection steps are taking place and a

ROKS 2013

49

Algorithm 1: Pagasos with pinball lossData: S, λ, τ, T, k, εSelect w1 randomly s.t. ‖w(1)‖ ≤ 1/

√λ

for t = 1→ T doSet ηt = 1

λtSelect At ⊆ S, where |At| = kρ = 1

|S|∑

(x,y)∈At(y − 〈wt, x〉),

A+t = (x, y) ∈ At : y(〈wt, x〉+ ρ) < 1,A−t = (x, y) ∈ At : y(〈wt, x〉+ ρ) > 1,wt+ 1

2=

wt − ηt(λwt − 1k

[∑(x,y)∈A+

tyx−

∑(x,y)∈A−t

τyx])

wt+1 = min

1, 1/√λ

‖wt+ 1

2‖

wt+ 1

2

if ‖wt+1 − wt‖ ≤ ε thenreturn (wt+1,

1|S|∑

(x,y)∈S(y − 〈wt, x〉))end

endreturn (wT+1,

1|S|∑

(x,y)∈S(y − 〈wt, x〉))

minor ”if” condition which terminates execution if thenorm of the difference of two subsequent w vectors isless than ε. In Algorithm 1 we denote the whole datasetby S and at each iteration select randomly k samplesfor computation of the subgradient

∇t = λwt −1|At|

∑(x,y)∈A+

t

yx−∑

(x,y)∈A−t

τyx

(3)

where ∇t additionally depends on the τ parameter ofthe pinball loss and A−t stands for the subset of Atwhere y〈w, x〉 > 1 and A+

t is the reciprocal subsetwhere y〈w, x〉 < 1. For additional details and supple-mentary analysis we refer to our paper [5]. Anotherimportant issue is related to the computation of thebias term. We should emphasize that the bias termρ is not part of our instantaneous optimization objec-tive and we perform computation of it just to returnconvenient and ubiquitous representation of the SVMdecision function by y = sign(〈w, x〉+ ρ) where ρ is re-turned in the Line 10 and 13 of Algorithm 1 togetherwith w vector.

3.2 Complete procedureIn Algorithm 2 the ”PegasosPBL” function standsfor the shortcut of Algorithm 1. The ”ComputeNys-tromApprox” function denotes the Fixed-Size partwhere we first compute m×m RBF kernel matrix2 fromthe data points found by the maximization of Renyi en-tropy in ”FindActiveSet” function and then we applyNystrom approximation

Φi(x) =1√λi,m

m∑t=1

uti,mk(xt, x), (4)

2we assume m |S|

Algorithm 2: Fized-Size Pegasos with pinball lossinput : training data S with |S| = n, labeling Y ,

parameters λ, τ, T, k, ε,moutput: mapping Φ(x),∀x ∈ S, SVM model given by

w and ρbeginSr ← FindActiveSet(S,m);Φ(x)← ComputeNystromApprox(Sr);X ← [Φ(x1)T , . . . , Φ(xn)T ];[w, ρ]← PegasosPBL(X,Y, λ, τ, T, k, ε);

end

where λi,m and ui,m denote the i-th eigenvalue and thei-th eigenvector of the RBF kernel matrix K to deriveour approximate feature map Φ(x). Finally we stackour explicit feature vectors in matrix X and proceed tothe function ”PegasosPBL”.

AcknowledgmentsThis work was supported by Research Council KUL, ERC AdG

A-DATADRIVE-B, GOA/10/09MaNet, CoE EF/05/006, FWO

G.0588.09, G.0377.12, SBO POM, IUAP P6/04 DYSCO, COST in-

telliCIS.

References

[1] O. Chapelle. Training a support vector machine in the primal.Neural Computation, 19:1155–1178, 2007.

[2] A. Christmann and I. Steinwart. How SVMs can estimate quan-tiles and the median. In NIPS, pages 305–312, 2007.

[3] X. Huang, L. Shi, and J. A. K. Suykens. Support vectormachine classifier with pinball loss. Technical Report KUL-12-162, Katholieke Universiteit Leuven, Kasteelpark Arenberg10, Leuven, 2012. Available: http://homes.esat.kuleuven.be/

~sistawww/cgi-bin/newsearch.pl?Name=Huang+X

[4] T. Joachims. Training linear SVMs in linear time. In Proceed-ings of the 12th ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ’06, pages 217–226, New York, NY, USA, 2006. ACM.

[5] V. Jumutc, X. Huang, and J. A. K. Suykens. Fixed-size pegasosfor hinge and pinball loss SVM. Technical Report KUL-13-31,Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, Leu-ven, 2013.

[6] R. Koenker. Quantile Regression. Econometric Society Mono-graphs. Cambridge University Press, 2005.

[7] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robuststochastic approximation approach to stochastic programming.SIAM J. on Optimization, 19(4):1574–1609, January 2009.

[8] S. Shalev-Shwartz, Y. Singer and N. Srebro. Pegasos: PrimalEstimated sub-GrAdient SOlver for SVM. In Proceedings of the24th international conference on Machine learning, ICML ’07,pages 807–814, New York, NY, USA, 2007.

[9] I. Steinwart and A. Christmann. Estimating conditional quan-tiles with the help of the pinball loss. Bernoulli, 17(1):211–225,2011.

[10] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor,and J. Vandewalle. Least Squares Support Vector Machines.World Scientific, Singapore, 2002.

[11] C. Williams and M. Seeger. Using the Nystrom method to speedup kernel machines. In Advances in Neural Information Pro-cessing Systems 13, pages 682–688. MIT Press, 2001.

ROKS 2013

50

Output Kernel Learning Methods

Francesco DinuzzoMPI for Intelligent Systems

Tubingen, [email protected]

Cheng Soon OngNICTA,

Melbourne, [email protected]

Kenji FukumizuInstitute of Statistical Mathematics,

Tachikawa, Tokyo, [email protected]

Abstract: A rather flexible approach to multi-task learning consists in solving a regulariza-tion problem where a suitable kernel is used to model joint relationships between both inputsand tasks. Since specifying an appropriate multi-task kernel in advance is not always possible,estimating one from the data is often desirable. Herein, we overview a class of techniques forlearning a multi-task kernel that can be decomposed as the product of a kernel on the inputsand one on the task indices. The kernel on the task indices (output kernel) is optimized si-multaneously with the predictive function by solving a joint two-level regularization problem.

Keywords: regularization, multi-task learning, kernel learning

1 Learning Multi-Task Kernels

Predictive performances of kernel-based regularizationmethods are highly influenced by the choice of the ker-nel function. Such influence is especially evident in thecase of multi-task learning where, in addition to speci-fying input similarities, it is crucial to correctly modelinter-task relationships. Designing the kernel allows toincorporate domain knowledge by properly constrainingthe function class over which the solution is searched.Unfortunately, in many problems the available knowl-edge is not sufficient to uniquely determine a good ker-nel in advance, making it highly desirable to have data-driven automatic selection tools. This need has moti-vated a fruitful research stream which has led to thedevelopment of a variety of techniques for learning thekernel.

For a broad class of multi-task (or multi-output) learn-ing problems, a kernel can be used to specify the jointrelationships between inputs and tasks [1]. Gener-ally, it is necessary to specify similarities of the formK((x1, i), (x2, j)) for every pair of input data (x1, x2)and every pair of task indices (i, j). However, a verycommon way to simplify modeling is to utilize a multi-plicative decomposition of the form

K((x1, i), (x2, j)) = KX(x1, x2)KY (i, j),

where the input kernel KX is decoupled from the outputkernel KY . The same structure can be equivalently

represented in terms of a matrix-valued kernel

H(x1, x2) = KX(x1, x2) · L, (1)

where L is a symmetric and positive semidefinite matrixwith entries Lij = KY (i, j).

Even after imposing such simplified model, specifyingthe inter-task similarities in advance may still be im-practical. Indeed, it is often the case that multiplelearning tasks are known to be related, but no pre-cise information about the structure or the intensity ofsuch relationships is available. Simply fixing L to theidentity is clearly suboptimal since it amounts to shareno information between the tasks. On the other hand,wrongly specifying the entries may lead to a severe per-formance degradation. It is therefore clear that, when-ever the task relationships are subject to uncertainty,learning them from the data is the only meaningful wayto proceed.

The most widely developed approach to automatickernel selection, known as Multiple Kernel Learning(MKL), consists in learning a conic combination of ba-sis kernels of the form

K =N∑

k=1

dkKk, dk ≥ 0.

Appealing properties of MKL methods include the abil-ity to perform selection of a subset of kernels via spar-sity, and tractability of the associated optimization

ROKS 2013

51

problem, typically (re)formulated as a convex program.Apparently, the MKL approach can be also used tolearn a multi-task kernel of the form

K((x1, i), (x2, j)) =N∑

k=1

dkKkX(x1, x2)Kk

Y (i, j),

that includes the possibility of optimizing the matrixL in (1) as a conic combination of basis matrices. Inprinciple, proper complexity control allows to combinean arbitrarily large, even infinite [2], number of kernels.However, computational and memory constraints forcethe user to specify a relatively small dictionary of basiskernels to be combined, which again calls for a certainamount of domain knowledge.

2 Output Kernel Learning

A more direct approach to synthesize the output kernelfrom the data consists in solving a two-level regulariza-tion problem of the form

minL∈S+

minf∈HL

∑i=1

m∑j=1

V (yij , fj(xij)) + ‖f‖2HL+ Ω(L)

,

where V is a suitable loss function, HL is the Repro-ducing Kernel Hilbert Space of vector-valued functionsassociated with the reproducing kernel (1), Ω is a suit-able matrix regularizer, and S+ is the cone of symmet-ric and positive semidefinite matrices. We call such anapproach Output Kernel Learning (OKL).

A technique of this kind was introduced in [3] for thecase where L is a square loss function, Ω is the squaredFrobenius norm, and the input data xij are the same forall the output components fj . Such special structureof the objective functional allows to develop an effec-tive block coordinate descent strategy where each stepinvolves the solution of a Sylvester linear matrix equa-tion. Regularizing the output kernel with a squaredFrobenius norm leads to a simple and effective compu-tational scheme. However, we may want to encouragedifferent types of relationship structures in the outputspace. Along this line, [4] introduces low-rank OKL, amethod to discover relevant low dimensional subspacesof the output space by learning a low-rank kernel ma-trix. This is achieved by regularizing the output kernelwith a combination of the trace and a rank indicatorfunction, namely

Ω(L) = tr(L) + I(rank(L) ≤ p).

For p = m, the hard-rank constraint disappears and Ωreduces to the trace norm which, as it is well known,encourages low-rank solutions. Setting p < m gives upconvexity of the regularizer but, on the other hand, al-lows to set a hard bound on the rank of the outputkernel, which can be useful for both computational and

interpretative reasons. Low-rank OKL enjoys interest-ing properties and interpretations. Just as sparse MKLwith a square loss can be seen as a nonlinear general-ization of (grouped) Lasso, low-rank OKL is a naturalkernel-based generalization of reduced-rank regression,a popular multivariate technique in statistics.

For problems where the inputs xij are the same forall the tasks, optimization for low-rank OKL can beperformed by means of a rather effective procedurethat iteratively computes eigendecompositions. Impor-tantly, the size of the involved matrices can be con-trolled by selecting the parameter p. Unfortunately,more general multi-task learning problems where eachtask is sampled in correspondence with different inputsrequire completely different methods. If a square lossis adopted, it turns out that an effective strategy to ap-proach these problems consists in iteratively apply inex-act Preconditioned Conjugate Gradient (PCG) solvers[5] to suitable linear operator equations that arise fromthe optimality conditions.

3 Concluding remarks and futuredirections

Learning output kernels via regularization is an effec-tive way to solve multi-task learning problems wherethe relationships between the tasks are highly uncer-tain. The OKL framework that we have sketched in theprevious section is rather general and can be developedin various directions. Effective optimization techniquesfor more general (non-quadratic) loss functions are stilllacking and the use of a variety of matrix penalties forthe output kernel matrix is yet to be explored.

References

[1] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learningmultiple tasks with kernel methods. Journal of MachineLearning Research, 6:615–637, 2005.

[2] A. Argyriou, C. A. Micchelli, and M. Pontil. Learningconvex combinations of continuously parameterized ba-sic kernels. In Peter Auer and Ron Meir, editors, Learn-ing Theory, volume 3559 of Lecture Notes in ComputerScience, pages 338–352. Springer Berlin / Heidelberg,2005.

[3] F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto.Learning output kernels with block coordinate descent.In Proceedings of the 28th Annual International Confer-ence on Machine Learning, Bellevue, WA, USA, 2011.

[4] F. Dinuzzo and K. Fukumizu. Learning low-rank out-put kernels. Journal of Machine Learning Research -Proceedings Track, 20:181–196, 2011.

[5] F. Dinuzzo. Learning output kernels for multi-task prob-lems. Neurocomputing, (to appear), 2013.

ROKS 2013

52

Deep Support Vector Machines for Regression Problems

M.A. Wiering, M. Schutten, A. Millea, A. Meijster, and L.R.B. SchomakerInstitute of Artificial Intelligence and Cognitive Engineering

University of Groningen, the NetherlandsContact e-mail: [email protected]

Abstract: In this paper we describe a novel extension of the support vector machine, calledthe deep support vector machine (DSVM). The original SVM has a single layer with kernelfunctions and is therefore a shallow model. The DSVM can use an arbitrary number of layers,in which lower-level layers contain support vector machines that learn to extract relevantfeatures from the input patterns or from the extracted features of one layer below. Thehighest level SVM performs the actual prediction using the highest-level extracted featuresas inputs. The system is trained by a simple gradient ascent learning rule on a min-maxformulation of the optimization problem. A two-layer DSVM is compared to the regular SVMon ten regression datasets and the results show that the DSVM outperforms the SVM.

Keywords: Support Vector Machines, Kernel Learning, Deep Architectures

1 Introduction

Machine learning algorithms are very useful for re-gression and classification problems. These algorithmslearn to extract a predictive model from a dataset ofexamples containing input vectors and target outputs.Among all machine learning algorithms, one of the mostpopular methods is the SVM. SVMs have been used formany engineering applications such as object recogni-tion, document classification, and different applicationsin bio-informatics, medicine and chemistry.

Limitations of the SVM. There are two importantlimitations of the standard SVM. The first one is thatthe standard SVM only has a single adjustable layer ofmodel parameters. Instead of using such “shallow mod-els”, deep architectures are a promising alternative [4].Furthermore, SVMs use a-priori chosen kernel functionsto compute similarities between input vectors. A prob-lem is that using the best kernel function is important,but kernel functions are not very flexible.

Related Work. Currently there is a lot of research inmulti-kernel learning (MKL) [1, 5]. In MKL, differentkernels are combined in a linear or non-linear way tocreate more powerful similarity functions for comparinginput vectors. However, often only few parameters areadapted in the (non-linear) combination functions. In[2], another framework for two-layer kernel machines isdescribed, but no experiments were performed in whichboth layers used non-linear kernels.

Contributions. We propose the deep SVM (DSVM),a novel algorithm that uses SVMs to learn to extracthigher-level features from the input vectors, after whichthese features are given to the main SVM to do the ac-

tual prediction. The whole system is trained with sim-ple gradient ascent and descent learning algorithms onthe dual objective of the main SVM. The main SVMlearns to maximize this objective, while the feature-layer SVMs learn to minimize it. Instead of adaptingfew kernel weights, we use large DSVM architectures,sometimes consisting of a hundred SVMs in the firstlayer. Still, the complexity of our DSVM scales onlylinearly with the number of SVMs compared to thestandard SVM. Furthermore, the strong regularizationpower of the main SVM prevents overfitting.

2 The Deep Support Vector Machine

[x]1 ///.-,()*+JJJJJJJJJ

666666666666666

..................... f(x)

[x]2 ///.-,()*+HHHHHHHHHH

666666666666666 S1/.-,()*+

KKKKKKKKKKK

... S2/.-,()*+ M /.-,()*+ f //

[x]D−1 ///.-,()*+vvvvvvvvvv

S3

/.-,()*+sssssssssss

[x]D ///.-,()*+sssssssss

Fig. 1: Architecture of a two-layer DSVM. In this example,the feature layer consists of three SVMs Sa.

We use regression datasets: (x1, y1), . . . , (x`, y`),where xi are input vectors and yi are the target out-puts. The architecture of a two-layer DSVM is shownin Figure 1. First, it contains an input layer of D in-puts. Then, there are a total of d pseudo-randomlyinitialized SVMs Sa, each one learning to extract onefeature f(x)a from an input pattern x. Finally, there isthe main support vector machine M that approximatesthe target function using the extracted feature vector as

ROKS 2013

53

input. For computing the feature-layer representationf(x) of input vector x, we use:

f(x)a =∑i=1

(α∗i (a)− αi(a))K(xi,x) + ba,

which iteratively computes each element f(x)a. In thisequation, α∗

i (a) and αi(a) are SVM coefficients for SVMSa, ba is its bias, and K(·, ·) is a kernel function. Forcomputing the output of the whole system, we use:

g(f(x)) =∑i=1

(α∗i − αi)K(f(xi), f(x)) + b.

Learning Algorithm. The learning algorithm adjuststhe SVM coefficients of all SVMs through a min-maxformulation of the dual objective W of the main SVM:

minf(x)

maxα,α∗

W (f(x),α(∗)) = −ε∑i=1

(α∗i + αi) +

∑i=1

(α∗i − αi)yi

−1

2

∑i,j=1

(α∗i − αi)(α

∗j − αj)K(f(xi), f(xj))

We have developed a simple gradient ascent algorithmto train the SVMs. The method adapts the SVM co-efficients α(∗) (standing for all α∗

i and αi) toward a(local) maximum of W , where λ is the learning rate:

α(∗)i ← α

(∗)i + λ · ∂W/∂α(∗)

i . The resulting gradientascent learning rule for αi is:

αi = αi + λ(−ε− yi +∑j

(α∗j − αj)K(f(xi), f(xj)))

We use radial basis function (RBF) kernels in both lay-ers of a two-layered DSVM. Results with other kernelswere worse. For the main SVM:

K(f(xi), f(x)) = exp(−∑a

(f(xi)a − f(x)a)2

σm)

The system constructs a new dataset for each feature-layer SVM Sa with a backpropagation-like technique formaking examples: (xi, f(xi)a − µ · δW/δf(xi)a), whereµ is some learning rate, and δW/δf(xi)a is given by:

δW

δf(xi)a= (α∗

i − αi)∑j=1

(α∗j − αj)

f(xi)a − f(xj)aσm

·K(f(xi), f(xj))

The feature extracting SVMs are pseudo-randomly ini-tialized and then alternated training of the main SVMand feature layer SVMs is executed a number of epochs.The bias values are computed from the average errors.

3 Experimental Results

We experimented with 10 regression datasets to com-pare the DSVM to an SVM, both using RBF kernels.

Both methods are trained with our simple gradient as-cent learning rule, adapted to also consider penalties,e.g. for obeying the bias constraint. The first 8 datasetsare described in [3] and the other 2 datasets are takenfrom the UCI repository. The number of examples perdataset ranges from 43 to 1049, and the number of fea-tures is between 2 and 13. The datasets are split into90% trainingdata and 10% testingdata. For optimizingthe learning parameters we have used particle swarmoptimization. Finally, we used 1000 or 4000 times cross-validation with the best found parameters to computethe mean squared error and its standard error.

Dataset SVM results DSVM results

Baseball 0.02413 ± 0.00011 0.02294 ± 0.00010Boston H. 0.006838± 0.000095 0.006381 ± 0.000090Concrete . 0.00706± 0.00007 0.00621± 0.00005Electrical 0.00638 ± 0.00007 0.00641 ± 0.00007Diabetes 0.02719± 0.00026 0.02327± 0.00022Machine-CPU 0.00805± 0.00018 0.00638± 0.00012Mortgage 0.000080 ± 0.000001 0.000080 ± 0.000001Stock 0.000862± 0.000006 0.000757± 0.000005Auto-MPG 6.852 ± 0.091 6.715 ± 0.092Housing 8.71 ± 0.14 9.30 ± 0.15

Tab. 1: The mean squared errors and standard errors ofthe SVM and the two-layer DSVM on 10 regression datasets.

Table 1 shows the results. The results of the DSVMare significantly better for 6 datasets (p < 0.001) andworse on one. From the results we can conclude thatthe DSVM is a powerful novel machine learning algo-rithm. More research, such as adding more layers andimplementing more powerful techniques to scale up tobig datasets, can be done to discover its full potential.

References

[1] F.R. Bach, G.R.G. Lanckriet, and M.I. Jordan.Multiple kernel learning, conic duality, and theSMO algorithm. In Proceedings of the twenty-first international conference on Machine learning,ICML ’04, pages 6–15, 2004.

[2] Francesco Dinuzzo. Kernel machines with two layersand multiple kernel learning. CoRR, 2010.

[3] M. Graczyk, T. Lasota, Z. Telec, and B. Trawin-ski. Nonparametric statistical analysis of machinelearning algorithms for regression problems. InKnowledge-Based and Intelligent Information andEngineering Systems, pages 111–120. 2010.

[4] G.E. Hinton and R.R. Salakhutdinov. Reducing thedimensionality of data with neural networks. Sci-ence, 313:504–507, 2006.

[5] S. Sonnenburg, G. Ratsch, C. Schafer, andB. Scholkopf. Large scale multiple kernel learn-ing. Journal of Machine Learning Research, 7:1531–1565, 2006.

ROKS 2013

54

Subspace Learning and Empirical Operator Estimation

Alessandro RudiIstituto Italiano di Tecnologia

[email protected]

Guillermo D. CanasMIT-IIT

[email protected]

Lorenzo RosascoUniversita di Genova & MIT-IIT

[email protected]

Abstract: This work deals with the problem of linear subspace estimation in a general,Hilbert space setting. We provide bounds that are considerably sharper than existing ones,under equal assumptions. These bounds are also competitive with bounds that are allowedto make strong, further assumptions (on the fourth order moments), even when we do not.Finally, we generalize these results to a family of metrics, allowing for a more general definitionof performance.

Keywords: Subspace Learning, PCA, Kernel-PCA

1 Introduction

Estimating the smallest linear space supporting datadrawn from an unknown distribution is a classical prob-lem in machine learning and statistics, with several sta-blished algorithms addressing it, most notably PCAand kernel PCA [2, 3]. Thus knowledge of the speedof convergence of these estimators with respect to thesample size, and the algorithms’ parameters is of con-siderable practical importance.

We use tools from linear operator theory to arrive atnovel learning rates for the subspace estimation prob-lem. These rates are significantly sharper than existingones, under typical assumptions on the eigenvalue de-cay rate of the covariance. Furthermore, they cover awider range of performance metrics.

Problem statement. Given a measure ρ with sup-port M in the unit ball of a separable Hilbert spaceH, we consider the problem of estimating, from n i.i.d.samples Xn = xini=1, the smallest linear subspace Sρthat includes M , that is, Sρ := span(M).

The quality of an an estimated subspace S, for a givenmetric (pseudo-metric) d, is characterized in terms ofbounds of the form

P[d(Sρ, S) ≤ ε(δ, n)

]≥ 1− δ, 0 < δ < 1. (1)

Common choices of S are the empirical estimate Sn :=span(Xn), and the k-truncated (kernel) PCA estimateSkn (where Snn = Sn).

Performance criteria. A natural performance crite-ria is the so-called reconstruction pseudometric

dR(Sρ, S) := Ex∼ρ‖PSρ(x)− PS(x)‖2Hwhere PV is the metric projection onto a subspaceV . Another important criterion is the gap distancedG(Sρ, S) := ‖PSρ − PS‖∞, which measures, inside the

unit ball, the maximum distance to S over points in Sρ.

By letting Cρ := Ex∼ρx⊗x be the (uncentered) covari-ance operator associated to ρ, and defining

dα,p(Sρ, S) := ‖(PSρ − PS)Cαρ ‖p (2)

we notice that it is both dR(Sρ, ·) = d1/2,2(Sρ, ·)2, anddG = d0,∞, and therefore the two metrics can be ana-lyzed using common tools.

Figure 1: Expected distance from a random sample to theempirical k-truncated kernel-PCA -subspace estimate, as afunction of k (n = 1000, 1000 trials shown in a boxplot).Our predicted plateau threshold k∗ (Cor. 2.2) is a good es-timate of the value k past which the distance stabilizes.

1.1 ExampleConsider a simple one-dimensional uniform distributionembedded into a reproducing-kernel Hilbert space H(using the exponential of the `1 distance as kernel).Figure 1 is a box plot of dR(Sρ, S

kn), where Skn is the

k-truncated kernel-PCA estimate, with n = 1000 andvarying k. Note that, while dR is computed analyticallyin this example, and Sρ is fixed, the estimate Skn is arandom variable, and hence the variability in the graph.The graph is highly concentrated around a curve witha steep intial drop, until reaching some sufficiently highk, past which the reconstruction (pseudo) distance be-comes stable, and does not vanish. In our experiments,this behavior is typical for the reconstruction distance

ROKS 2013

55

and high-dimensional problems.

Notice that our bound for this case (Cor. 2.3) similarlypredicts a steep performance drop until a value k = k∗

(indicated in the figure by the vertical blue line), and aplateau afterwards.

2 Learning Rates

Our main technical contribution is a bound of the formof Eq. (1). We begin by stating it in the most generalform in Th. 2.1, which bounds the general distance dα,pgiven a known covariance Cρ.

Theorem 2.1. Let xini=1 be drawn i.i.d. accordingto a probability measure ρ supported on the unit ball ofa separable Hilbert space H, with covariance Cρ. As-suming n > 3, 0 < δ < 1, 0 ≤ α ≤ 1

2 and k ≥ 1, itholds

P[dα,p(Sρ, S

kn) ≤ 4tα

∥∥Cαρ (Cρ + t)−α∥∥p

]≥ 1− δ

where t = maxσk, 9n log n

δ , and σk is the k-th topeigenvalue of Cρ.

We say that Cρ has eigenvalue decay rate of order rif there are a constants q,Q > 0 such that qj−r ≤σj ≤ Qj−r, where σj are the (decreasingly ordered)eigenvalues of Cρ, and r > 1. Such knowledge can beincorporated into Th. 2.1 to obtain explicit learningrates, as follows.

Corollary 2.2 (Polynomial eigenvalue decay). Let Cρhave eigenvalue decay rate of order r. Under the as-sumptions of Th. 2.1, it is, with probability 1− δ

dα,p(Sρ, Skn) ≤ Q′min k, k∗−rα+

1p

where k∗ =(

qn9 log(nδ−1)

) 1r

, and Q′ is a constant whose

value is omitted here for brevity.

Finally, by particularizing Cor. 2.2 to the reconstruc-tion distance dR, and choosing the predicted optimalk = k∗, we obtain the following result.

Corollary 2.3 (Reconstruction distance). Letting Cρhave eigenvalue decay rate of order r, and k∗ be as inCor. 2.2, it holds with probability 1− δ:

dR(Sρ, Sk∗

n ) = O

((log n

n

)c )where c = 1− 1/r.

Note that, as is, Th. 2.1 does not provide a useful boundfor the gap-metric d0,∞. We obtain bounds for thismetric independently, but they are omitted here in theinterest of conciseness.

3 Discussion

Figure 2 shows a comparison of our learning rates withexisting rates in the literature [1, 4]. The plot shows thepolynomial decay rate c of dR(Sρ, S

kn) = O(n−c), as a

function of the eigenvalue decay rate r of the covarianceCρ, computed at the best value k∗ (which minimizes thebound).

4 6 8 10

0.2

0.4

0.6

0.8

r

Figure 2: Known upper bounds for the polynomial decayrate c (for the best choice of k), for the expected distancefrom a random sample to the empirical k-truncated kernel-PCA estimate, as a function of the covariance eigenvaluedecay rate (higher is better). Our bound (purple line), con-sistently outperforms previous ones ([4] black line). Thetop (dashed) line, has significantly stronger assumptions,and is only included for completeness.

The rate exponent c, under a polynomial eigenvalue

decay assumption for Cρ, is c = s(r−1)r−s+sr for [1] and

c = r−12r−1 for [4], where s is related to the fourth mo-

ment. Note that, among the two (purple and black)that operate under the same assumptions, ours (pur-ple line) is the best by a wide margin. The top, bestperforming, dashed line [1] is obtained for the best pos-sible fourth-order moment s = 2r, and therefore it isnot a fair comparison. However, it is worth noting thatour bounds perform almost as well as the most restric-tive one, even when we do not include any fourth-ordermoment constraints.

References

[1] G. Blanchard, O. Bousquet, and L. Zwald. Statisticalproperties of kernel principal component analysis. Ma-chine Learning, 66(2):259–294, 2007.

[2] I. Jolliffe. Principal component analysis. Wiley OnlineLibrary, 2005.

[3] B. Scholkopf, A. Smola, and K.R. Muller. Kernelprincipal component analysis. Artificial Neural Net-works—ICANN’97, pages 583–588, 1997.

[4] J. Shawe-Taylor, C. K. Williams, N. Cristianini, andJ. Kandola. On the eigenspectrum of the gram matrixand the generalization error of kernel-pca. InformationTheory, IEEE Transactions on, 51(7):2510–2522, 2005.

ROKS 2013

56

Kernel based identification of systems with multipleoutputs using nuclear norm regularization

Tillmann Falck, Bart De Moor, Johan A. K. SuykensSCD/SISTA, ESAT, KU Leuven

tillmann.falck, bart.demoor, [email protected]

Abstract: This contribution introduces a novel identification scheme for nonlinear systemswith multiple outputs based on nuclear norm regularization. A kernel based formulation isderived in a primal-dual setting. Then the model representation in terms of the kernel functionis established.

Keywords: kernel methods, system identification, nuclear norm, primal-dual framework

1 Introduction

Kernel based methods have been proved to be a pow-erful technique for nonlinear system identification [6].The established approach to handle systems with multi-ple outputs is to estimate independent models for eachoutput. This approach is suboptimal however. Con-sider the scenario described in [1] which analyzes timeseries acquired from the Belgian power distribution net-work. The data set contains measurements from severalhundred substations and the referenced work identified7 profiles which can be used to explain the measure-ments of all substations. This illustrates a large amountof redundancy contained in the data which cannot becaptured by estimating individual models for each out-put. To utilize such relationships among several vari-ables for improved model performance, this paper pro-poses the use of nuclear norm regularization [4].

The combination of a kernel based model and nuclearnorm regularization is still challenging. A representertheorem has been proved in [2] for a class of matrixregularizers including the nuclear norm. This contri-bution however adopts a novel primal-dual approach.The primal-dual formulation makes it straightforwardto include prior information in terms of additional con-straints and in general is very flexible as demonstratedin [5].

2 Problem statement

The parametric model formulation is a direct extensionof support vector models in a primal-dual setting. For apair of inputs and outputs (xt,yt), the model equationis given by

y(i)t = wT

i ϕ(xt) + bi (1)

for i = 1, . . . ,M and yt = [y(1)t , . . . , y

(M)t ]T . Here

wi ∈ Rnh denotes the parameter vector for the i-thoutput and bi is the corresponding intercept. The non-linear map ϕ : RD → Rnh is referred to as feature

map and often only implicitly defined by the relationK(x,y) = ϕ(x)Tϕ(y). A popular choice for the posi-tive semidefinite kernel functionK is the Gaussian RBFkernel KRBF(x,y) = exp(−‖x− y‖22/σ2).

Given data (xt,yt)Nt=1 a model of the form specifiedby (1) can be estimated by solving the following convexproblem,

minW ,b,et

η‖W ‖∗ +1

2

N∑t=1

eTt et

subject to yt = W Tϕ(xt) + b+ et, t ∈ S,

(2)

where S = 1, . . . , N. Note the use of the nuclear norm‖ ·‖∗ as regularization term. This imposes a connectionbetween the different output variables as motivated inthe introduction. However the deviation from the clas-sical quadratic regularization term vec(W )T vec(W )requires a new solution strategy to obtain a kernel basedmodel, which is outlined in the next section.

3 Kernel based formulation

3.1 Dual optimization problemIn kernel based models the parametric problem (2) canoften not be solved directly as the feature map ϕ isnot explicitly known. One popular way of obtaining atractable problem is deriving the Lagrange dual. Tohandle the nondifferentiable objective one can reformu-late the nuclear norm by using the definition of the dualnorm

‖W ‖∗ = max‖C‖2≤1

tr(CTW ) (3)

where ‖·‖2 is the spectral norm. This allows the deriva-tion of the following lemma.

Lemma 1. The solution to (2) is equivalent to the so-lution of its Lagrange dual

maxA∈RM×N

tr(ATY )− 1

2tr(ATA)

subject to A1N = 0M , AΩAT η2IM

(4)

ROKS 2013

57

with Y = [y1, . . . ,yN ] ∈ RM×N , 1N ∈ RN a vec-tor of all ones, 0M ∈ RM a vector of all zeros andIM the identity matrix of size M . The elements ofthe Gram matrix Ω are computed according to Ωij =ϕ(xi)

Tϕ(xj) = K(xi,xj) for i, j = 1, . . . , N .

The proof of this lemma can be found in Chapter 6of [3]. Note that the quadratic matrix inequality con-straint in (4) can be reformulated into a linear con-straint by the Schur complement or by taking a squareroot on both sides of the inequality. In the reformulatedform the problem can be solved with general purposesemidefinite programming solvers like SDPT3 or CVXOPT.

3.2 Kernel based model representationThe kernel based estimation problem given in Lemma 1is not useful on its own, but only when a model repre-sentation in terms of the dual variable A can be estab-lished. The derivation of this representation requiresestablishing a link between the primal model parame-ters W , b and the Lagrange multipliers A. To obtainsuch a connection the set of all possible values for Wcorresponding to the optimal C in (3) is characterizedin [3]. Based on this characterization one can find thefollowing relations

W = ΦATM and b =1

N(Y −MAΩ)1N

with M = η−2Pη(Y AT −AAT )Pη where Pη = VηVTη .

The matrix Vη contains the eigenvectors correspondingto the largest eigenvalue of AΩAT , where λmax = η2.

The dual model representation and the correspondingpredictive model for a new point z are obtained bythe substitution of these relations into the parametricmodel (1),

y = f(z) =N∑t=1

αtK(xt, z) + b.

The variables αt form the matrix A = [α1, . . . , αN ],

which is computed as A = MA.

4 Numerical example

Figure 1 compares the validation performance for differ-ent estimation problems on a toy data set. The data isgenerated using a model of the form yt = W T

0 ϕ(xt)+etwith M = 20 outputs. The parameter matrix W0 isconstructed as W0,BW0,M . W0,B is a 50 × 3 matrix,while W0,M is 3 × 20. The elements of both matricesare drawn from a standard normal distribution.

The figure compares the estimation technique describedin this abstract, denoted by MIMO, with models witha simple quadratic regularization term wT

i wi trainedon each output denoted by RR and two ordinary leastsquares (OLS) estimates without any regularization.

10−3 10−2 10−1 100 101

1

2

3

regularization η

RMSE

MIMO RR

OLS OLS + Oracle

Fig. 1: Validation performance of different multivariatemodel structures for toy dataset. The predictive perfor-mance is shown as a function of the regularization parame-ter. The root mean squared error in this figure is computedwith respect to time t and output i.

The model with oracle was given the matrix W0,M suchthat it only had to estimate the parameters of W0,B ,which accounts to a factor of 60 less in the number ofparameters. One can observe that the proposed schemeclearly outperforms OLS as well as RR. The knowledgeof the true dependencies still yields a better perfor-mance though.

5 Conclusions

This contribution proposed a novel formulation to es-timate kernel based models with multiple outputs inan identification setting and illustrated the effective-ness on a simple toy example. Furthermore the kernelbased model representation for a model with nuclearnorm regularization has been stated in a primal-dualsetting.

AcknowledgmentsThis work was supported by CoE EF/05/006 OPTEC, GOA

MANET, IUAP DYSCO, FWO G.0377.12, ERC AdG A-

DATADRIVE-B. J.A.K. Suykens is a professor and B. De Moor

is a full professor at KU Leuven, Belgium.

References

[1] C. Alzate, M. Espinoza, B. De Moor, and J.A.K. Suykens.Identifying customer profiles in power load time series usingspectral clustering. In Proc. of the 19th International Con-ference on Artificial Neural Networks, pages 315–324, 2009.

[2] A. Argyriou, C.A. Micchelli, and M.A. Pontil. When Is Therea Representer Theorem? Vector Versus Matrix Regularizers.Journal of Machine Learning Research, 10:2507–2529, 2009.

[3] T. Falck. Nonlinear system identification using structuredkernel based modeling. PhD thesis, Katholieke UniversiteitLeuven, Belgium, April 2013.

[4] M. Fazel. Matrix Rank Minimization with Applications. PhDthesis, Stanford, 2002.

[5] J.A.K. Suykens, C. Alzate, and K. Pelckmans. Primal anddual model representations in kernel-based learning. Statis-tics Surveys, 4:148–183, August 2010.

[6] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor,and J. Vandewalle. Least Squares Support Vector Machines.World Scientific, 2002.

ROKS 2013

58

First-order methods for low-rank matrix factorization

applied to informed source separation

Augustin Lefevre1

[email protected] Universite catholique de Louvain - ICTEAM Institute

Avenue Georges Lemaıtre 4, B-1348 Louvain-la-Neuve - Belgium

Francois Glineur1,2

[email protected] Universite catholique de Louvain - CORE

Voie du Roman Pays 34, B-1348 Louvain-la-Neuve - Belgium

Abstract: We study a convex formulation of low-rank matrix factorization, in a special casewhere additional information on the factors is known. Our formulation is typically adaptedto source separation scenarii, where additional information on the sources may be providedby an expert. Our formulation promotes low-rank with a nuclear-norm based penalty. As itis non-smooth, generic first-order algorithms su↵er from slow convergence rates. We studyand compare several algorithms that fully exploit the structure of our problem while keepingmemory requirements linear in the size of the problem.

Keywords: source separation,inverse problems,machine learning

1 Low-rank matrix factorization and

informed source separation

Given a matrix of observations Y 2 RFN , we assumethat Y is a sum of G low-rank contributions perturbedby some noise, i.e. Y

PG

g=1 Xg

where X

g

2 RFN

+

is the contribution of source g. The informed sourceseparation problem consists in identifying contributionsX

g

with the additional knowledge that some entries insome of the contributions are equal to zero.

Matrix factorization is an essential building block insource separation methods. Indeed, as we seek to rep-resent Y as a sum of low-rank matrices, an equivalentproblem is to express Y approximately as a low-rankproduct of factors D 2 RFK and corresponding ac-tivation coecients A 2 RKN (where the inner sizeK is the sum of the ranks of the contributions). Morespecifically, one seeks to minimize a suitable norm ofthe di↵erence kY DAk. Unfortunately, the problemof identifying the best factors D and A is nonconvexand multimodal, so that in practice only local solutionscan be expected to be found in reasonable time (i.e. es-timates of D and A in the neighbourhood of which noimprovement can be made).

2 Nuclear norm-based convex

reformulation

In this work, we substitute the above nonconvex prob-lem with a convex reformulation. A well-known tech-

nique to obtain low-rank estimates for the sources X

g

is to penalize kXg

k?

, the nuclear norm of Xg

, i.e. thesum of its singular values 1 2 . . .

F

. This leadsus to consider the following problem, introduced earlierin [1] :

minX

12kY

PG

g=1 Xg

k2F

+

PG

g=1 kXg

k?

(1)

subject to X

g

0 and M

g

·Xg

= 0, (2)

where A·B denotes the coecientwise product of A andB, and the matrices M

g

enforce the constraint that theentry of X

g

at coordinates (f, n), Xg,fn

= 0 if Mg,fn

6=0. In what follows, we define a scalar product hU, V i =P

g

TrU>g

V

g

for the search space. Problem 2 is convex,because we have replaced the strict constraint that eachsource term X

g

be low-rank by a penalty term (X) =

Pg

kXg

k?

favouring low-rank solutions. Convexity isdesirable because it means that global solutions maybe reached from any initial point, without recourse toextensive sampling of the search space RFNG.

3 Algorithms for nuclear norm

minimization

Because of the nuclear norm penalty term, the objectivefunction in Problem 2 is non-smooth. In previous ex-periments with a projected subgradient algorithm [1],we observed that satisfactory source separation couldbe obtained fast enough, provided the step size is care-fully selected. We compare in this article algorithms

ROKS 2013

59

that enjoy faster convergence rates than subgradient, byexploiting the specific structure of the nuclear norm :indeed, the general non-smooth minimization methodrelies on an arbitrary choice of subgradient at each it-eration, even though it might not be a descent direc-tion. In our situation, the subdi↵erential of the objec-tive function may be described completely, so we canmake our choice more wisely, as we will show in Section3.1. On the other hand, we also consider applying anoptimal gradient method to a smooth approximation ofthe nuclear norm [3], which we will detail in Section3.2.

3.1 Subgradient methods

Let @f(X) be the subdi↵erential of f at X. Since theobjective function f is convex, it also admits directionalderivatives f

0(X;D) in every direction, and we havef

0(X;D) = maxhU,Di, U 2 @f(V ). General subgra-dient methods for minimizing f consist in successivelypicking a subgradient G 2 @f(V ) at a given point X,moving X along the direction G with an appropratechoice of the step size, and projecting X on the setof constraints, if any. However, while the (projected)gradient is always a descent direction when f is di↵er-entiable, it is no longer the case for arbitrary choices ofsubgradients. Fortunately, the steepest descent direc-tion is related to the minimum norm subgradient :

argminf 0(X;D) = argminkZk, Z 2 @f(X)

In the unconstrained case, the steepest descent direc-tion is always a descent direction (provided, of course,X is not a global minimum). This means that the stepsize can be chosen so as to ensure :f(X +↵G) f(X).In our case, additional care must be taken to select afeasible descent direction. As we show, computing theminimum norm subgradient implies roughly twice thecomputational cost of an arbitrary subgradient, so it isworth comparing its merit experimentally.

3.2 Smoothing based gradient methods

Nesterov [3] showed that for a particular class of non-smooth functions (in which our Problem fits), it is pos-sible to obtain faster convergence rate by applying anaccelerated gradient method to a smooth approxima-tion f

µ

of the objective function. In our case, we replacethe nuclear norm term by :

kXk?,µ

=FX

f=1

h

µ

(f

) where h

µ

(x) =

(x

2

2µ if x µ

x µ

2 if x > µ

When µ = 0, kXk?,µ

= kXk?

and for general µ > 0,it is a continuously di↵erentiable function with Lip-schitz continuous derivatives with the Lipschitz con-stant 1

µ

. We show that the Lipschitz constant of f

µ

is Lµ

= G+

µ

. Moreover, fµ

(X) f(X) f

µ

(X)+ µ

2 ,so parameter µ trades o↵ the magnitude of the Lips-chitz constant L

µ

(and hence the rate of convergence

of the fast gradient algorithm) versus the quality of theapproximation error. Following [3] we implement anaccelerated gradient method with fixed step size. Val-ues and first order derivatives of kXk

?,µ

are obtainedby computing the SVD of X, so we can implement afast gradient method with the same iteration cost asprojected subgradient descent, but whose theoreticalconvergence rate is much faster.

3.3 Contributions and related work

Algorithms for the minimization of the nuclear normsubject to linear equality constraints have been pro-posed in [2, 4]. For small problems, the author showthat nuclear norm minimization problems may be re-formulated as semidefinite programs (SDP), for whichinterior-point algorithms with superlinear convergenceare available. For the purpose of source separation,interior-point methods are too expensive as they re-quire storage and inversion of matrices of size (FN)2,where F ' 500, N ' 103. Another interesting cate-gory of algorithms are those based on augmented La-grangians and explicit factorizations of the source termsX

G

D

g

A

g

[4].

4 Experimental results and comments

We compare smoothing based gradient methods andsubgradient (with or without minimum norm sub-gradient) in an informed source separation experimentinvolving four musical pieces of 14 seconds each. Forappropriate values of µ > 0, smoothing based gradientmethods achieve both faster decrease of the objectivefunction, and better quality solutions, while the mini-mum norm sub-gradient improves much over an arbi-trary one.

Acknowledgments

This paper presents research results of the Belgian Net-work DYSCO (Dynamical Systems, Control, and Op-timization), funded by the Interuniversity AttractionPoles Programme, initiated by the Belgian State, Sci-ence Policy Oce. The scientific responsibility restswith its authors.

References

[1] A. Lefevre, F. Glineur, and P.-A. Absil. A nuclear norm-based

convex formulation for informed source separation. Technical

Report 1212.3119, arXiv, 2012.

[2] Z. Liu and L. Vandenberghe. Interior-point method for nu-

clear norm approximation with application to system identifi-

cation. SIAM Journal on Matrix Analysis and Applications,

2009.

[3] Yu. Nesterov. Smooth minimization of non-smooth functions.

Mathematical Programming, 2005.

[4] B. Recht, M. Fazel, and P.A. Parrilo. Guaranteed Minimum-

Rank Solutions of Linear Matrix Equations via Nuclear Norm

Minimization. SIAM Review., 2010.

ROKS 2013

60

Structured low-rank approximation as optimization on a

Grassmann manifold

Konstantin Usevich, Ivan MarkovskyDept. ELEC, Vrije Universiteit Brusselkusevich, [email protected]

Abstract: Many data modeling problems can be posed and solved as a structured low-rankapproximation problem. Using the variable projection approach, the problem is reformulatedas optimization on a Grassmann manifold. We compare local optimization methods based ondifferent parametrizations of the manifold, including recently proposed penalty method andmethod of switching permutations. A numerical example of system identification is provided.

Keywords: structured low-rank approximation, variable projection, Grassmann manifold,local optimization, system identification

1 Introduction

A linear structure is a linear map Rnp → R

m×n. Forconvenience, we assume that m ≤ n. In this paper, weconsider the problem of approximating a given struc-tured matrix by a structured matrix of low rank.

Problem 1 (Structured low-rank approximation).Given p ∈ R

np , structure S , seminorm ‖ · ‖ and r < m

minimizep∈R

np‖p− p‖ subject to rankS (p) ≤ r. (1)

Many data modeling problems can be posed and solvedas Problem 1, for a suitable structure, seminorm, andrank [4]. In this case, p is an observed data vector,and the approximating vector p should possess certainproperty, which is encoded in the low-rank constraint.

Usually, the weighted Euclidean seminorm is used

‖p‖2w =

np∑

i=1

wip2i , where wi ∈ [0,+∞]. (2)

Infinite weights correspond to fixed (exactly known)values of the approximating vector [6] and zero weightscorrespond to the case of missing observations [5].

1.1 Variable projection solution

The rank constraint in (1) is equivalent to existence ofa full row rank R ∈ R

d×m, where d := m− r, such thatRS (p) = 0. Hence Problem 1 can be reformulated as

minimizeR: rankR=d

f(R), where (3)

f(R) := minp∈R

np‖p− p‖2w subject to RS (p) = 0. (4)

For the weighted semi-norm (2) the inner problem (4)has a closed-form solution [5]. For mosaic Hankel struc-ture and positive weights there exist efficient O(d3m2n)methods [6] for evaluating f(R) and its derivatives.

The cost function is homogeneous in the following sense:

f(R) = f(UR) ∀ nonsingular U ∈ Rd×d. (5)

Therefore, f depends only on the row space of R.Hence, f is defined on aGrassmann manifold GrR(d,m)[3] (the manifold of all d-dimensional subspaces of Rm).

2 Optimization on GrR(d,m)

Consider a homogeneous (in the sense of (5)) functionf : Rf → R+, where

Rf := R ∈ Rd×m : rankR = d, (6)

and R+ := [0;+∞). We require the function f to besmooth (from C1(Rf)). The optimization problem con-sidered is (3), which is minimization on GrR(d,m).

2.1 Orthonormal bases

For any subspace there exists an orthonormal basis.Hence, (3) is equivalent to

minimizeR∈Rd×m

f(R) subject to RR⊤ = Id. (7)

Problem (7) can be solved by constrained optimizationmethods or by methods using Riemannian geometry [3].

2.2 Exact penalty method of [5] (reg)

In [5] it was shown that the hard constraint in (7) canbe equivalently replaced by a soft constraint.

Theorem 1. Let ˜f : Rd×m → R+ be a homogeneous

extension (not necessarily smooth) of (4), i.e. ˜f(R) =

f(R) for all R ∈ Rf, and ˜f satisfies (5). Let γ > f(R0)for an R0 ∈ Rf. Then the solutions of

minimizeR∈Rd×m

f(R) + γ‖RR⊤ − Id‖2F, (8)

coincide with the solutions of (3).

Therefore, the penalty method in (8) is exact.

ROKS 2013

61

Parameters perm0 perm genrtr fmincon reg

T q m ℓ % it t % it t % it t % it t % it t

57 7 5 1 0.04 200 0.07 0.04 21 0.01 0.05 200 1.69 0.04 66 0.73 0.04 22 0.26

90 8 5 1 0.06 200 0.27 0.06 200 0.26 0.07 200 6.62 0.07 87 1.4 0.08 12 0.29

801 2 1 2 1.46 19 0.01 1.46 19 0.01 1.46 15 0.2 1.45 49 0.25 1.56 200 0.78

1,000 2 1 5 3.45 80 0.1 3.45 77 0.1 4.25 200 6.43 4.22 67 0.49 4.35 200 1.67

1,024 2 1 5 9.75 44 0.06 9.74 200 0.25 17.46 200 0.69 13.39 56 0.47 13.31 25 0.27

Tab. 1: Comparison of the methods

2.3 Switching between permutations (perm)

Any subspace from GrR(d,m) can be represented by amatrix of the form [X −Id ]Π, where Π is a permutationmatrix and X ∈ R

d×(m−d). As shown in [1], Π can bechosen such that |Xi,j | ≤ 1 for all i, j. Hence, problem(3) is equivalent to

minimizeΠ∈0,1m×m,

ΠΠ⊤=Im

minX∈[−1;1]d×(m−d)

f([

X −Id]

Π)

. (9)

For local optimization of (9), the combinatorial prob-lem of choosing Π can be avoided by switching permu-tations in the course of optimization [7].

2.4 Using Riemannian geometry [3] (genrtr)

Let M be a Riemannian manifold, with tangent bun-dle TxM. The methods of [3] require a retractionRx : TxM → M, and operate as follows. From xk,a direction ξk ∈ Txk

is selected, based on derivatives off at xk. The new iterate is xk+1 = Rxk

(ξk), and theprocess is repeated until a convergence criterion is met.A trust-region method is implemented in [8].

3 Results

We compare the methods on an example of errors-in-variable system identification. The data w ∈ (Rq)T is aq-variate time series, and the structure is block-Hankel:

Hℓ+1(w) :=

w(1) w(2) · · · w(T − ℓ)

w(2) w(3)...

......

w(ℓ+ 1) · · · · · · w(T )

,

where the number of rows ism = q(ℓ+1). Identificationin the class L

qm,ℓ of linear time-invariant systems with

at most m inputs and lag at most ℓ is equivalent toProblem 1 for the structure Hℓ+1(w) and r = m− (q−m) [4] (rank reduction by the number of outputs).

We compare the methods on examples from DAISYdatabase [2]. In Table 1, we provide for each methodthe number of iterations (it), time (t), and the fit (%),equal to 100%·‖p∗−p‖22/‖p‖22, where p∗ is the computedapproximation. fmincon stands for the minimizationin orthonormal bases with the Optimization Toolboxof MATLAB. perm0 denotes perm with Π fixed to Im.

The methods are given the same initial approximation.In perm and perm0, the Levenberg-Marquardt methodis used for local optimization over X. Our preliminaryresults suggest that methods perm and reg are compet-itive with the method genrtr (see [7] for more details).

Acknowledgments

The research leading to these results has receivedfunding from the European Research Council underthe European Union’s Seventh Framework Programme(FP7/2007-2013) / ERC Grant agreement no. 258581“Structured low-rank approximation: Theory, algo-rithms, and applications”.

References

[1] D. E. Knuth. ‘Semi-optimal bases for linear depen-dencies’. Lin. and Multilin. Algebra 17, 1–4, 1985.

[2] B. De Moor, P. De Gersem, B. De Schutter andW. Favoreel. ‘Daisy : A database for identificationof systems’. Journal A (Benelux publication of theBelg. Fed. of Aut. Control) 38, 4–5, 1997.

[3] P.-A. Absil, R. Mahony and R. Sepulchre. Opti-mization Algorithms on Matrix Manifolds. Prince-ton University Press. Princeton, NJ, 2008.

[4] I. Markovsky. Low Rank Approximation: Algo-rithms, Implementation, Applications. Communi-cations and Control Engineering. Springer, 2012.

[5] I. Markovsky and K. Usevich. ‘Structured low-rankapproximation with missing values’. SIAM J. Ma-trix Anal. Appl., in press, 2013.

[6] K. Usevich and I. Markovsky. ‘Variable projectionmethods for affinely structured low-rank approx-imation in weighted 2-norms’, J. Comput. Appl.Math., 2013. doi:10.1016/j.cam.2013.04.034

[7] K. Usevich and I. Markovsky. ‘Structuredlow-rank approximation as optimization on aGrassmann manifold’, preprint, 2013. Avail-able from: http://homepages.vub.ac.be/

~kusevich/preprints.html.

[8] GenRTR (Generic Riemannian Trust-Region package). Available from: http:

//www.math.fsu.edu/~cbaker/genrtr/.

ROKS 2013

62

Scalable Structured Low Rank Matrix Optimization

Problems

Marco SignorettoESAT-SCD/SISTA Katholieke Universiteit Leuven,

Kasteelpark Arenberg 10, B-3001 Leuven (BELGIUM).

[email protected]

Volkan CevherLaboratory for Information and Inference Systems (LIONS),

EPFL STI IEL LIONS ELD 243 Station 11,

Lausanne (SWITZERLAND). [email protected]

Johan A. K. SuykensESAT-SCD/SISTA Katholieke Universiteit Leuven,

Kasteelpark Arenberg 10, B-3001 Leuven (BELGIUM).

[email protected]

Abstract: We consider a class of structured low rank matrix optimization problems. Werepresent the desired structure by a linear map, termed mutation, that can encode matriceshaving entries partitioned into known disjoined groups. Our interest arises in particularfrom concatenated block-Hankel matrices that appear in formulations for input-output linearsystem identification problems with noisy and/or partially unobserved data. We present analgorithm and test it against an existing alternative.

1 Introduction

Nuclear norm optimization methods for low-rank ma-trix approximation have been discussed in several re-cent papers. In many different settings, in fact, generalnotions of model complexity can be conveniently ex-pressed by the rank of an appropriate matrix; in turn,well known properties of the nuclear norm motivate itsused for convex relaxations of rank-based problems. Inthis contribution, we focus on problems where we needto find a matrix that, in addition to being low-rank, isrequired to have entries partitioned into known disjointgroups. This setting includes various type of structuredmatrices such as Hankel, Toeplitz and circulant matri-ces. Generally speaking, it allows to deal with matri-ces that have reduced degrees of freedom. The desiredstructure is encoded by a linear map, termed muta-tion, that we characterize and use in our algorithm.We then discuss an application to linear system identi-fication problems with noisy and partially unobservedinput-output data. A recently proposed algorithm thatapplies to this problem involves at each iteration thesingular value decomposition (SVD) of a structured ma-trix Y ∈ RM×N , with M ≤ N . The cost of this stepusually grows as O(M2N) which hinders the applica-tion to large scale problems.

2 Problem Statement

We present an SVD-free solution strategy that deliverslow rank solutions with comparable quality (in termsof objective value, feasibility and model fit) at a sub-stantially lower computational price.

2.1 Main Problem FormulationRecall that, for a generic matrix A ∈ RM×N with rankR and singular values σ1(A) ≥ σ2(A) ≥ · · · ≥ σR(A) >0, the nuclear norm (a.k.a. trace norm or Schatten-1norm) is defined as ‖A‖∗ =

∑r∈NR

σr(A) where NR isused as a shorthand for the set 1, 2, . . . , R. We focuson the situation where we need to find a matrix that, inaddition to being low-rank, is required to have entriespartitioned into known disjoint groups. For a significantclass of problems this task can be accomplished solvingthe convex optimization problem:

minx∈RL

1

2(x− a)>H(x− a) + ‖B(x)‖∗ (1)

for a given vector a ∈ RL, a positive definite matrixH ∈ RL×L and a linear map B, the features of which aregiven below. The corresponding quadratic term usuallyplays the role of a data fitting measure and can subsumea trade-off parameter λ. A conventional choice is, inparticular, H = λ IL, where IL is the L × L identitymatrix.

2.2 Encoding the Structure by MutationsFormulation (1) regards a structured matrix as the out-put of a linear map B : RL → RM×N , termed mu-tation, that maps entries of a vector x into disjointgroups Pl, l = 1, 2, . . . , L forming a partition of theset of entries of the matrix B(x). More formally, ifι : (m,n) 7→ l : (m,n) ∈Pl is a well defined mem-bership function, then:

B : x 7→(xι(m,n) : (m,n) ∈ NM ×NN

). (2)

Mutations give rise to structured matrices with Hankel,Toeplitz and circulant matrices as special cases. It is

ROKS 2013

63

not difficult to show that the adjoint operator is:

B∗ : X 7→(∑

(m,n)∈Plxmn : l ∈ NL

). (3)

From (2) and (3) one can then show that B∗B is rep-resented by a diagonal matrix with entries equal to thecardinality of the sets Pl, l = 1, 2, . . . , L. Starting froma known partition one can then give a “quick and dirty”implementation of these operators based on linear in-dexing [5]. More efficient implementations can be givenfor mutations encoding matrices with more specializedstructure.

3 SVD-free Algorithm via AugmentedLagrangian Approach

In order to avoid computing SVDs we rely on a knownvariational characterization of the nuclear norm:

‖Y ‖∗ = arg minU,V : Y=UV >

1

2

(‖U‖2F + ‖V ‖2F

)(4)

where we considered full (i.e., not thin) matrices U andV . This suggests the following restatement of problem(1):

minx,U,V

12 (x− a)>H(x− a) + 1

2

(‖U‖2F + ‖V ‖2F

)subject to B(x) = UV > .

(5)Note that this problem is non-convex due to the prod-uct between U and V . Nonetheless, it is possible toshow that if (x∗, U∗, V∗) is a solution to (5), x∗ is a solu-tion to (1) and B(x∗) = U∗V

>∗ . A similar result is found

in [4, Section 5.3], which discusses an approach based on(4) to solve an unstructured low rank matrix problem.Problem (5) has a differentiable objective function. Itcan be tackled via a gradient-based solution strategyafter embedding the constraint B(x) = UV > by an aug-mented Lagrangian approach. Details of our approachare given in [5]. Note that, in contrast, the objective ofproblem (1) is non-smooth. Correspondingly, comput-ing a solution to problem (1) requires a sub-gradientapproach or an operator splitting technique [1]. This,ultimately, leads to singular values soft-thresholding [2]which requires computing at each iteration the SVD ofan M ×N matrix.

4 System Identification with MissingData by Nuclear Norm Optimization

Recently, [3] proposed the following formulation for lin-ear system identification from input-output data u, y:

minu,y

λ1

2

∑i∈Mu

(ui− ui)2 +λ2

2

∑i∈My

(yi− yi)2 + ‖F (u, y)‖∗

(6)where u, y are latent inputs and outputs, Mu (resp.My) is a set of indices of observed inputs (resp. out-puts) and F (u, y) is a matrix obtained stacking block-Hankel matrices HQ(u) and HQ(y). This method is

Tab. 1: Results for V% = 20, P = 2, S = 3.obj val feas. (10−3) model fit CPU time

M = 5, T = 1500SVD-free 1016.79 0.30 79 0.67

SVD-based 1017.36 0.87 79 5.61

M = 15, T = 1500SVD-free 1166.05 0.51 75 1.96

SVD-based 1165.71 0.30 75 8.65

M = 20, T = 4000SVD-free 2079.09 0.22 76 6.78

SVD-based 2081.47 0.86 76 47.01

M = 40, T = 4000SVD-free 2501.61 0.39 70 22.32

SVD-based 2501.76 0.37 70 120.16

M = 50, T = 10000SVD-free 4803.46 0.30 67 111.28

SVD-based 4803.18 0.92 67 825.77

motivated by the fact that, if the generating dynami-cal system is linear, then under mild conditions one hasrank(F (u, y)) = S + rank(HQ(u)) where S is the orderof the system. It is not difficult to show that this prob-lem is equivalent to (1) for suitably defined mutation Band diagonal p.s.d. matrix H [5].

5 Experiments

We tested the SVD-free approach presented in [5]against the SVD-based algorithm of [3] on randomlygenerated linear dynamical systems; details are given in[5]. Table 1 reports a representative set of experimentsin which we kept fixed the percentage of missing data(V%), the input dimension (P ), the system order (S)and vary the output dimension (M) and the numberof samples (T ). The experiments show that the pro-posed algorithm delivers low rank solutions with com-parable quality (in terms of objective value, feasibilityand model fit) at a substantially lower computationalprice than the existing SVD-based algorithm.

References

[1] H.H. Bauschke and P.L. Combettes. Convex Analy-sis and Monotone Operator Theory in Hilbert Spaces.Springer Verlag, 2011.

[2] J.F. Cai, E.J. Candes, and Z. Shen. A singular valuethresholding algorithm for matrix completion. SIAMJournal on Optimization, 20(4):1956–1982, 2010.

[3] Z. Liu, A. Hansson, and L. Vandenberghe. Nuclear normsystem identification with missing inputs and outputs.Submitted to System and Control Letters, 2012.

[4] B. Recht, M. Fazel, and P.A. Parrilo. Guaranteedminimum-rank solutions of linear matrix equations vianuclear norm minimization. SIAM Rev., 52:471–501,2010.

[5] M. Signoretto, V. Cevher, and Suykens J. A. K. AnSVD-free approach to a class of structured low rank ma-trix optimization problems with application to systemidentification. Technical report, 13-44, ESAT-SISTA,K.U.Leuven (Leuven, Belgium), Submitted, 2013.

ROKS 2013

64

Learning with Marginalized Corrupted Features

Laurens van der Maaten Delft Univ. of Techn.

[email protected]

Minmin Chen

Wash. Univ. in St. Louis

[email protected]

Stephen Tyree

Wash. Univ. in St. Louis

[email protected]

Kilian Weinberger

Wash. Univ. in St. Louis [email protected]

Abstract: We propose a new framework for regularization, called marginalized corruptedfeatures, that reduces overfitting by increasing the robustness of the model to data corruptions.

Keywords: Regularization; Supervised learning.

1 Introduction

Dealing with overfitting is one of the key problemsone encounters when training machine-learning models.Three approaches are commonly used to combat over-fitting: (1) early-stopping techniques stop the learningas soon as the performance on a held-out validationset deteriorates; (2) regularizers encourage the learn-ing to find “simple” models by penalizing “complex”models, e.g., models with large parameter values; and(3) Bayesian techniques define a prior distribution overmodels that favor simple models, and perform predic-tions by averaging over the model posterior.

We propose an alternative to counter overfitting, calledmarginalized corrupted features (MCF; [1]). Instead ofrequiring the user to define priors over model parame-ters, which can be very counter intuitive, we focus oncorruptions of the data. MCF is based on the obser-vation that overfitting would completely disappear ifwe were to train on infinite data drawn from the datadistribution P. Unfortunately, a learning scenario inwhich we only obtain a finite training set is more re-alistic. In many learning scenarios, we may howeverhave some additional knowledge about the data dis-tribution: we might know that certain corruptions ofdata instances do not affect their label. As an example,deleting a few words in a text document rarely changesits topic. With this prior knowledge, we can corruptexisting data to generate new artificial instances thatresemble those sampled from the actual data distribu-tion. MCF corrupts the existing finite training exam-ples with a fixed corrupting distribution to constructan infinite training set in which the model is trained.For a wide range of learning models and corrupting dis-tributions, we show that it is practical to train modelson such an infinite, augmented training set.

2 Marginalized Corrupted Features

We start by defining a corrupting distribution thatspecifies how training observations x are transformedinto corrupted versions x. We assume that the cor-rupting distribution factorizes over dimensions and that

each individual distribution PE is a member of the nat-ural exponential family:

p(x|x) =D∏

d=1

PE(xd|xd; ηd),

where ηd represents user-defined hyperparameters ofthe corrupting distribution on dimension d. Corruptingdistributions of interest, PE , include: (1) independentsalt or “blankout” noise in which the d-th feature israndomly set to zero with probability qd; (2) indepen-dent Gaussian noise on the d-th feature with varianceσ2d; and (3) independent Poisson corruptions in which

the d-th feature is used as the rate of the distribution.

Assume we are provided with a training data setD = (xn, yn)Nn=1 and a loss function L(x, y; Θ), withmodel parameters Θ. A simple approach to approx-imately learn from the distribution p(x|x)P(x), is tocorrupt each training sample M times, and to train onthe resulting corrupted data in an empirical risk mini-mization framework, by minimizing:

L(D; Θ) =N∑

n=1

1

M

M∑m=1

L(xnm, yn; Θ),

with xnm ∼ p(xnm|xn). Although such an approachis effective, it lacks elegance and comes with highcomputational costs: the minimization of L(D; Θ)scales linearly in the number of corrupted observa-tions. MCF addresses these issues by consideringthe limiting case M → ∞, in which we can rewrite1M

∑Mm=1 L(xm, ym; Θ) as its expectation to obtain:

L(D; Θ) =N∑

n=1

E[L(xn, yn; Θ)]p(xn|xn).

For linear predictors that employ a quadratic or expo-nential loss function, the required expectation can becomputed analytically for all corrupting distributionsin the natural exponential family; for linear predictorsthat employ logistic loss, we derive a practical upperbound on the expected loss. As a result, direct mini-mization of the resulting MCF loss functions is efficient.

ROKS 2013

65

Quadratic loss. Assuming a linear modelparametrized by vector w and a target variable y (forregression, y is continuous; for binary classification,y ∈ −1,+1), the expected value of the quadratic lossunder corrupting distribution p(x|x) is given by:

L(D; w) =N∑

n=1

E

[(wTxn − yn

)2]p(xn|xn)

= wTHw − 2

( N∑n=1

ynE [xn]

)T

w +N,

where the hat matrix H =∑N

n=1 E[xn]E[xn]T + V [xn],V [x] is the variance of x, and all expectations are underp(xn|xn). Hence, to minimize the expected quadraticloss under the corruption, we only need to compute themean and variance of the corrupting distribution.

Exponential loss. Assuming a label variable y ∈0, 1, the expected value of the exponential loss un-der corruption model p(x|x) is given by:

L(D; w) =N∑

n=1

E[exp

(−ynwTxn

)]p(xn|xn)

=N∑

n=1

D∏d=1

E [exp (−ynwdxnd)]p(xnd|xnd).

The equation can be recognized as a product ofmoment-generating functions (MGFs), E[exp(tndxnd)]with tnd = −ynwd. MGFs can be computed for all cor-rupting distributions in the natural exponential family.

Logistic loss. The expected logistic loss cannot becomputed in closed form. Instead, we derive an upperbound that can be minimized as a surrogate loss:

L(D; w) =

N∑n=1

E[log

(1 + exp

(−ynwTxn

))]p(xn|xn)

≤N∑

n=1

log

(1 +

D∏d=1

E [exp (−ynwdxnd)]p(xnd|xnd)

).

Herein, we have made use of Jensen’s inequality toupper-bound E[log(z)]. We again recognize a productof MGFs that can be computed in closed-form for cor-rupting distributions in the natural exponential family.

3 Experiments

Document classification. We perform experimentswith blankout MCF and Poisson MCF on four Amazondata sets (using all three loss functions). All data setshave about 5, 000 examples and 20, 000 (bag-of-words)features. All our classifiers use L2-regularization in ad-dition to MCF, and we cross-validated over the regu-larization parameter λ. The results of the experimentsare shown in Figure 1. The results reveal the potential

of MCF to improve the performance of linear predictors(note that “standard” classifiers correspond to blankoutMCF with q=0). The best performances are obtainedusing blankout MCF with q≈0.7 and λ=0.

0 0.2 0.4 0.6 0.8 10.135

0.14

0.145

0.15

0.155

0.16

0.165

0.17

0.175

0.18Blank−out MCF (qua.)Blank−out MCF (exp.)Blank−out MCF (log.)Poisson MCF (qua.)Poisson MCF (exp.)Poisson MCF (log.)

0 0.2 0.4 0.6 0.8 1

0.145

0.15

0.155

0.16

0.165

0.17

0.175

0.18

0.185


0 0.2 0.4 0.6 0.8 10.11

0.115

0.12

0.125

0.13

0.135

0.14

0.145


0 0.2 0.4 0.6 0.8 10.1

0.105

0.11

0.115

0.12

0.125

0.13


Books DVD

Electronics Kitchen

Cla

ssifi

catio

n er

ror

Noise level q

Fig. 1: Results of the document classification experiment.

Nightmare at test time. We experiment with blank-out MCF on a “nightmare at test time” scenario inwhich a percentage of the features is randomly unob-served at test time on the MNIST handwritten digitdata set. Figure 2 presents the classification error ofstandard and MCF classifiers as a function of the per-centage of features deleted from the test data. Theresults show the strong performance of MCF in thisscenario; in particular, MCF outperforms the state-of-the-art technique for this learning setting, FDROP.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

0.1

0.2

0.3

0.4

0.5Quadratic loss (L2)Exponential loss (L2)Logistic loss (L2)Hinge loss (L2)Hinge loss (FDROP)MCF quadratic lossMCF exponential lossMCF logistic loss

Percentage of deletions

Cla

ssifi

catio

n er

ror

Fig. 2: Results of the “nightmare at test time” experiment.

References

[1] L.J.P. van der Maaten, M. Chen, S.W. Tyree, andK.Q. Weinberger. Learning with Marginalized Fea-tures. Int. Conf. on Machine Learning, 2013.

ROKS 2013

66

Robust regularized M-estimators of regression parameters

and covariance matrix

Esa Ollila, Hyon-Jung Kim and Visa KoivunenDepartment of Signal Processing and Acoustics, SMARAD CoE

Aalto University, Finland

Abstract: High dimension low sample size (HD-LSS) data are becoming increasingly presentin a variety of fields, including chemometrics and medical imaging. Especially problems withn < p (more variables than measurements) present a challenge to data analysts since theclassical techniques can not be used. In this paper, we consider HD-LSS data in regressionparameter and covariance matrix estimation problems. In particular, we consider and extendconvex relaxation (or shrinkage regularization, diagonal loading) approach for M–estimationof regression coefficients and covariance (scatter) matrix. We demonstrate the utility of themethods in beamforming and tensor decomposition applications

Keywords: Covariance matrix estimation, Diagonal loading, High-dimensional data, M -estimator, Regularization, Ridge regression, Shrinkage

1 Introduction

High dimensional data sets are challenging for dataanalyst. In regression one often resorts to convex-relaxation methods such as ridge regression which seeksa balance with bias-variance trade-off. HD-LSS dataprovides a challenge to classical multivariate analysisas well. For example, principal component analysis(PCA) is a common pre-processing and standardizationstep which cannot be performed due to possibly rankdeficient sample covariance matrix (SCM). Also impul-sive measurement environments and outliers are com-monly encountered in many practical applications. Inthis paper we tackle these two important issues and con-sider robust shrinkage approaches for regression param-eter and covariance (scatter) matrix estimation prob-lems in case of HD-LSS data.Recently robust shrinkageapproaches for covariance matrix estimation were ad-dressed in [1, 7] and for regression setting in [4].

2 Shrinkage M-estimates of regression

We consider the multiple linear regression model yi =φ>i s + εi, i = 1, . . . , n, which can be more conve-niently expressed in matrix form as y = Φs + ε, wherey = (y1, . . . , yn)> is the observed data vector (measure-ments), Φ = (φ1 · · · φn)> is the known n× p measure-ment matrix , s = (s1, . . . , sp)

> is the unobserved signalvector (or regression coefficients) and ε = (ε1, . . . , εn)>

is the (unobserved) noise vector. The problem is then ofestimating the unknown s. All measurements and pa-rameters are assumed to be real-valued. Suppose εi’sare i.i.d. from a continuous symmetric distribution withp.d.f. fε(e) = 1

σf0(eσ

), where σ > 0 denotes the scale

parameter and f0(·) the standard form of the p.d.f.

2.1 Ridge M-estimates with preliminary scaleWe first assume that scale parameter σ is known or re-placed by its estimate. In either case, we replace here-after σ by σ. First recall that ridge regression (RR)estimator [2] is defined as the (unique) minimizer ofthe penalized residual sum of squares objective func-tion JRR(s) =

∑ni=1(yi−φ>i s)2 +λ‖s‖22, where λ > 0 is

the ridge (shrinkage or regularization) parameter. Thebigger the λ, the greater is the amount of shrinkageof coefficients toward zero. Let us denote residuals fora given (candidate) s by ei(s) = yi − φ>i s and writee(s) = (e1(s), . . . , en(s))>. We define the ridge regres-sion M -estimator sλ as the minimizer of

J(s) = σ2n∑i=1

ρ

(ei(s)

σ

)+ λ‖s‖22 (1)

where ρ is continuous and differentiable even function(ρ(e) = ρ(−e)) and increasing for e ≥ 0. Note thatthe multiplying factor σ2 is used so that the objectivefunction coincide with JRR(s) when ρ(e) = e2.

Let us write ψ(e) = ρ′(e) and w(e) = ψ(e)/e with con-vention that w(e) = 0 for e = 0. To obtain robust RRM -estimates, we need ρ-functions that give small orzero weights for large residuals. In the conventional re-gression model (n > p), the maximum likelihood (ML)estimator of s is found by choosing ρ(e) ∝ − log f0(e)+cin (1) with λ = 0. In case of Cauchy error terms,ρC(e) = 1

2 log(1 + e2) and ψC(e) = ρ′(e) = e/(1 + e2)whereas Huber’s ρ function is

ρH(e) =

12e

2, for |e| ≤ kk|e| − 1

2k2, for |e| > k

and the corresponding ψ-function is ψH(e) =

ROKS 2013

67

max[−k,min(k, e)]. Above k is a user-defined tuningconstant that affects robustness and efficiency of themethod. With Huber’s ρ-function, the objective func-tion in (1) is convex but due to the non-convexity ofCauchy ρ, also the associated optimization problem isnon-convex.

Computation: By setting the derivatives of (1) to zeroshows that sλ solves the following estimating equation:

(Φ>WΦ + 2λI)sλ = Φ>Wy (2)

where W = diag(win1 ) with wi = w(ei(sλ)/σ

). This

suggest the computation of the estimator by ”itera-tively (re)weighted RR (IWRR) algorithm. which it-erates

st+1 = (Φ>WtΦ + 2λI)−1Φ>Wty (3)

until convergence. Note that an initial estimate s0 of sand an estimate σ of the scale of the residuals is needed.Following [4] it can be shown that the objective func-tion (1) descends at each iteration. Thus for convexproblems the IWRR algorithm can be used to find theglobal minimum.

2.2 Ridge M-estimates of regression and scaleNext we consider joint estimation of regression param-eter s and scale σ. We define the joint ridge M -estimators of regression sλ and scale σ as the minimizersof

J(s, σ) = σn∑i=1

ρ

(ei(s)

σ

)+ λ‖s‖22 + α(nσ). (4)

where ρ and λ are as earlier and α ≥ 0 is a tuning pa-rameter. Note that if λ = 0, then the equation reducesto approach proposed by Huber. By setting the deriva-tives of (4) w.r.t. (s, σ2) to zero shows that (sλ, σ)solves the estimating equation (2) jointly with the esti-mating equation

1

n

n∑i=1

χ(ei(sλ)

σ

)= α (5)

where χ(e) = ψ(e)e− ρ(e) and ψ = ρ′ as earlier.

3 Shrinkage M-estimators of covariance

Here we consider the complex-valued case. Recall thata random vector (r.v.) z ∈ Cp is said to have a p-variatecomplex elliptically symmetric (CES) distribution [5] ifits p.d.f. is of the form f(z) = Cp,g|Σ|−1g

(zHΣ−1z

),

for some positive definite Hermitian (PDH) p×p scattermatrix parameter Σ and function g : R+

0 → R+, calledthe density generator; We shall write z ∼ CEp(0,Σ, g).Above Cp,g is a normalizing constant and (·)H denotesHermitian transpose. Note that the covariance matrixof z (when exists) is equal to E[zzH] = c ·Σ. Consider

n i.i.d. samples from a CES distribution. Let us adda regularization term λTr(Σ−1) to −1× log-likelihoodfunction and minimize

L(Σ) =n∑i=1

ρ(zHi Σ−1zi) + n ln |Σ|+ λTr(Σ−1). (6)

where ρ(t) = − ln g(t) and λ > 0 is the fixed reg-ularization parameter. Thus we impose a bound onTr(Σ−1) =

∑pi=1

1γi

, where γi’s denote the eigenval-ues of Σ. As a consequence, the solution will not beill-conditioned when n < p (HD-LSS underdeterminedcase). In the Gaussian case, zi ∼ CNp(0,Σ), we havethat ρ(t) = t and the solution to (6) is easily shown to

be Σ = S + λI, where S = 1n

∑ni=1 ziz

Hi denotes the

SCM. We shall consider generalization of the method-ology in [1, 7] and consider shrinkage M -estimators ofΣ for general ρ functions.

Applications: Conventional beamforming cannot beused in HD-LSS underdetermined problems, e.g., theminimum variance distortionless response (MVDR)beamformer weight vector requires the inverse of theSCM. Applying the diagonal loading, i.e., using S + λIin place of S is the commonly used approach; see [3].Our simulations show that robust shrinkage covariancematrix estimators (e.g., shrinkage Tyler’s M -estimator[1, 7] or Huber’s M -estimator proposed here) providesuperior performance in non-Gaussian impulsive noise.

References

[1] Y. Chen, A. Wiesel, and A. O. Hero. Robust shrink-age estimation of high-dimensional covariance matrices.IEEE Trans. Signal Processing, 59(9):4097 – 4107, 2011.

[2] A. E. Hoerl and R. W. Kennard. Ridge regression: bi-ased estimation for nonorthogonal problems. Techno-metrics, 12(1):55–67, 1970.

[3] J. Li, P. Stoica, and Z. Wang. On robust Capon beam-forming and diagonal loading. IEEE Trans. Signal Pro-cessing, 51(7):1702 – 1715, 2003.

[4] R. A. Maronna. Robust ridge regression for high-dimensional data. Technometrics, 53(1):44–53, 2011.

[5] E. Ollila, D. E. Tyler, V. Koivunen, and H. V. Poor.Complex elliptically symmetric distributions: survey,new results and applications. IEEE Trans. Signal Pro-cessing, 60(11):5597 – 5625, 2012.

[6] S. A. Razavi, E. Ollila, and V. Koivunen. Robustgreedy algorithms for compressed sensing. In Proc. 20thEuropean Signal Processing Conference (EUSIPCO’12),pages 969–973, Bucharest, Romania, 2012.

[7] A. Wiesel. Geodesic convexity and covariance estima-tion. IEEE Trans. Signal Processing, 60(12):6182–6189,2012.

ROKS 2013

68

Robust Near-Separable Nonnegative Matrix Factorization

Using Linear Optimization

Nicolas Gillis

ICTEAM Institute

Universite catholique de Louvain

B-1348 Louvain-la-Neuve

Email: [email protected]

Robert Luce

Institut fur Mathematik

Technische Universitat Berlin

Strasse des 17. Juni 136 - 10623 Berlin, Germany

Email: [email protected]

Abstract: Nonnegative matrix factorization (NMF) has been shown recently to be tractableunder the separability assumption, which amounts for the columns of the input data matrixto belong to the convex cone generated by a small number of columns. Bittorf, Recht, Reand Tropp (‘Factoring nonnegative matrices with linear programs’, NIPS 2012) proposed alinear programming (LP) model, referred to as HottTopixx, which is robust under any smallperturbation of the input matrix. However, HottTopixx has two important drawbacks: (i) theinput matrix has to be normalized, and (ii) the factorization rank has to be known in advance.In this talk, we generalize HottTopixx in order to resolve these two drawbacks, that is, wepropose a new LP model which does not require normalization and detects the factorizationrank automatically. Moreover, the new LP model is more flexible, significantly more tolerantto noise, and can easily be adapted to handle outliers and other noise models. We show onseveral synthetic datasets that it outperforms HottTopixx while competing favorably with twostate-of-the-art methods.

Keywords: nonnegative matrix factorization, linear programming, robustness to noise

1 Introduction

Nonnegative matrix factorization (NMF) is a power-ful dimensionality reduction technique as it automat-ically extracts sparse and meaningful features from aset of nonnegative data vectors. Given n nonnega-tive m-dimensional vectors gathered in a nonnegativematrix M ∈ R

m×n+ and a factorization rank r, NMF

computes two nonnegative matrices W ∈ Rm×r+ and

H ∈ Rr×n+ such that M ≈ WH . Unfortunately, NMF

is NP-hard in general [8]. However, if the input datamatrix M is r-separable, that is, if it can be writtenas M = W [Ir, H ′]Π, where Ir is the r-by-r identitymatrix, H ′ ≥ 0 and Π is a permutation matrix, thenthe problem can be solved in polynomial time [2]. Sep-arability means that there exists an NMF (W, H) ≥ 0of M of rank r where each column of W is equal toa columns of M . Geometrically, r-separability meansthat the cone generated by the columns of M has r

extreme rays given by the columns of W . Equiva-lently, if the columns of M are normalized to sum toone, r-separability means that the convex hull gener-ated by the columns of M has r vertices given by thecolumns of W ; but see, e.g., [7]. The separability as-sumption makes sense in several applications, e.g., textmining, hyperspectral unmixing and blind source sep-aration [6]. Several algorithms have been proposed tosolve the near-separable NMF problem, e.g., [2] [6] [7],which refers to the NMF problem of a separable matrix

M to which some noise is added; see Section 2. In thistalk, our focus is on the LP model proposed by Bittorf,Recht, Re and Tropp [3] and referred to as HottTopixx.It is described in the next section.

2 HottTopixx

A matrix M is r-separable if and only if

M = WH = W [Ir, H′]Π = [W, WH ′]Π

= [W, WH ′]ΠΠ−1

(

Ir H ′

0(n−r)×r 0(n−r)×(n−r)

)

Π

︸︷︷︸

X0∈Rn×n

+

,

for some permutation Π and some matrix H ′ ≥ 0. Thematrix X0 is an n-by-n nonnegative matrix with (n−r)zero rows such that M = MX0. Assuming the columnsof M sum to one, the columns of W and H ′ have sumto one as well. Based on these observations, Bittorf,Recht, Re and Tropp [3] proposed to solve the follow-ing optimization problem in order to identifying ap-proximately the columns of the matrix W among thecolumns of the noisy separable matrix M = WH + N

ROKS 2013

69

with ||N ||1 = maxj ||N(:, j)||1 ≤ ǫ :

minX∈R

n×n

+

pT diag(X)

such that ||M − MX ||1 ≤ 2ǫ,

tr(X) = r, (1)

X(i, i) ≤ 1 for all i,

X(i, j) ≤ X(i, i) for all i, j,

where p is any n-dimensional vector with distinct en-tries. The r largest diagonal entries of an optimal so-lution X∗ of (1) correspond to columns of M close tothe columns of W , given that the noise is sufficientlysmall [4]. However, it has two important drawbacks:

• the factorization rank r has to be chosen in ad-vance so that the LP above has to be resolvedwhen it is modified (in fact, in practice, a ‘good’factorization rank for the application at hand istypically found by a trial-and-error approach),

• the columns of the input data matrix have to benormalized in order to sum to one. This may in-troduce important distortions in the dataset andlead to poor performances [7].

3 Contribution

In this talk, we generalize HottTopixx in order to re-solve the two drawbacks mentioned above. More pre-cisely, we propose a new LP model which has the fol-lowing properties:

• It detects the number r of columns of W auto-matically.

• It can be adapted to dealing with outliers.

• It does not require column normalization.

• It is significantly more tolerant to noise thanHottTopixx. In fact, we propose a tight robust-ness analysis of the new LP model proving itssuperiority.

This is illustrated on several synthetic datasets, wherethe new LP model is shown to outperform HottTopixxwhile competing favorably with two state-of-the-artmethods, namely the successive projection algorithm(SPA) from [1, 6] and the fast conical hull algorithm(XRAY) from [7]; see Figure 1 for an illustration onsynthetic datasets. We refer the reader to [5] for allthe details about the proposed algorithm, including theproof of robustness and more numerical experiments.

References

[1] U. Araujo, B. Saldanha, R. Galvao, T. Yoneyama,H. Chame, and V. Visani, “The successive pro-jections algorithm for variable selection in spectro-scopic multicomponent analysis,” Chemom. and In-tell. Lab. Syst., vol. 57, no. 2, pp. 65–73, 2001.

Fig. 1: Comparison of near-separable NMF algorithms onsynthetic datasets. The noisy separable matrices are gen-erated as follows: each entry of W ∈ R

50×100

+ is generateduniformly at random in [0, 1] and then each column of W isnormalized to sum to one, each column of H ′ in R

10+ is gen-

erated using a Dirichlet distribution whose parameters arepicked uniformly at random in [0, 1], and the noise N con-tains one non-zero entry in each column so that ||N ||1 = ǫ.For each noise level ǫ, we generate 25 such matrices anddisplay the percentage of columns of W correctly extractedby the different algorithms (hence the higher the curve, thebetter).

[2] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Com-puting a nonnegative matrix factorization – prov-ably,” in STOC ’12, 2012, pp. 145–162.

[3] V. Bittorf, B. Recht, E. Re, and J. Tropp, “Factor-ing nonnegative matrices with linear programs,” inNIPS’ 12, 2012, pp. 1223–1231.

[4] N. Gillis, “Robustness analysis of hotttopixx, a lin-ear programming model for factoring nonnegativematrices,” 2012, arXiv:1211.6687.

[5] N. Gillis and R. Luce, “Robust near-separable non-negative matrix factorization using linear optimiza-tion,” 2013, arXiv:1302.4385.

[6] N. Gillis and S. Vavasis, “Fast and robust recursivealgorithms for separable nonnegative matrix factor-ization,” 2012, arXiv:1208.1237.

[7] A. Kumar, V. Sindhwani, and P. Kambadur,“Fast conical hull algorithms for near-separablenon-negative matrix factorization,” in InternationalConference on Machine Learning (ICML), 2013.

[8] S. Vavasis, “On the complexity of nonnegativematrix factorization,” SIAM J. on Optimization,vol. 20, no. 3, pp. 1364–1377, 2009.

ROKS 2013

70

Data-Driven and Problem-Oriented Multiple-Kernel

Learning

Valeriya NaumovaJohann Radon Institute for Computational and Applied Mathematics

Altenbergerstrasse 69, 4040 Linz - [email protected]

Sergei V. PereverzyevJohann Radon Institute for Computational and Applied Mathematics

Altenbergerstrasse 69, 4040 Linz - [email protected]

Abstract: The paper is concerned with an adaptive kernel design for addressing the problemof the function reconstruction inside/outside the scope of given data points. We analyzethe state of the art methods and, by doing that, we show a need for a novel and a moresophisticated approach to a data-driven and problem-oriented kernel design. Finally, wepresent such approach and show its superiority with respect to the known methods on thenumerical experiments with real data.Keywords: reproducing kernel Hilbert space, multiple kernel design, approximation theory

1 Introduction

In recent years there has been a very fast growing inter-est in defining and analyzing mathematical predictivemodels, that reconstruct/predict a real-valued func-tion f defined on X ⊂ Rd from available noisy dataz = (xi, yi) ⊂ X ×R, where yi = f(xi) + ξi, and ξi isa measurement error.

It is a well-known fact that such reconstruction prob-lem is ill-posed [2], and its numerical treatment can beperformed in a stable way only by applying special reg-ularization methods.

The most popular among them is the Tikhonov method,which in the present context consists in constructing theapproximant for f as the minimizer fλ of the functional

Tλ(f ;H, z) :=1|z|

|z|∑i=1

(f(xi)− yi)2 + λ‖f‖2H, (1)

where |z| is the cardinality of the set z and λ is a reg-ularization parameter, which trades-off data error withsmoothness measured in terms of a space H.

At the same time, the efficient application of thisscheme is governed by two issues: one of them is thechoice of the regularization parameter λ, which hasbeen extensively studied within the classical regular-ization theory, and another one, which is a challengingand a central problem [4], is the choice of a space H,whose norm is used for penalization.

However, a proper choice of a suitable spaceH has so farbeen elusive, since in most known studies the norm ofthe penalty term in (1) was assumed to be given a priori,for instance, by norm of some Sobolev space [9, 10]. Atthe same time, keeping in mind that a Sobolev spaceis a particular example of a Reproducing Kernel HilbertSpace (RKHS) HK , induced by a symmetric positivedefinite function K(t, x), t, x ∈ X, called a kernel, theissue about the choice of a proper regularization spaceis, in fact, about the choice of a kernel for an RKHS.

One of the leading concepts behind the clarification ofthis issue in the past few years has been universal ker-nels [5], which potentially allow to construct fλ havinga good approximating property. However, as it can beseen from numerical experiments in [6] the concept ofthe kernel’s universality does not guarantee good ap-proximation property at points outside the scope of seeninputs, which is our main interest. The same is true forthe kernels given as radial basis functions. Therefore,for prediction of the unknown functional dependencyoutside the scope of seen inputs, the question about aproper choice of a regularization kernel is, in general,open until now. In this talk we aim to shed light onthis important but as of yet under-researched problem.Moreover, we are going to show how the clarificationof this issue could help to improve management of di-abetes, namely by providing more accurate predictionof future blood glucose concentration.

ROKS 2013

71

2 Data-driven and problem-orientedkernel design

Lanckriet et al. [3] were among the first to emphasizethe need to consider multiple kernels or parameteriza-tions of kernels, and not a single a priori fixed kernel.These authors advocate the approach to a data-drivenkernel choice, where one tries to find a “good” kernelK as a linear combination

K =N∑

j=1

βjKj (2)

of a priori prescribed kernels Kj ∈ K(Kj), j =1, 2, . . . , N. Such approach, often referred to as the mul-tiple kernel learning (MKL), has been an attractivetopic in learning theory (see, e.g., [8] and referencestherein). At the same time, most MKL-methods em-ploy a standard simplex constraint β1 +β2 + . . .+βN =1, βj > 0, j = 1, 2, . . . , N, on the combination weightsof kernels (2). As a result, the minimization of theTikhonov functional (1) in this case is performed againwith a priori given regularization term that means thatmost MKL-methods are not really intended for a data-driven choice of the regularization space.

Moreover, it is worthy of notice that for some practicalapplications the set of linear combinations of kernelsK(Kj) is not rich enough, and, thus, more generalparameterizations are also of interest.

Let us consider a set K(X) of all kernels defined onX ⊂ Rd. Let also Ω be a compact metric space andG : Ω → K(X) be an injection such that for any t, x ∈X, the function ω → G(ω)(t, x) is a continuous mapfrom Ω to R, here G(ω)(t, x) is the value of the kernelG(ω) ∈ K(X) at (t, x) ∈ X ×X.

Each such mapping G determines a set of kernels

K(Ω, G) = K : K = G(ω),K ∈ K(X), w ∈ Ω

parameterized by elements of Ω. In contrast to the setof linear combinations of kernels, K(Ω, G) may be anon-linear manifold.

Assume now that one is interested in choosing a ker-nel K from some admissible set K(Ω, G) of parameter-dependent kernels K = K(ω; t, x), where all paramet-ric dependences are collected in a parameter vectorω ∈ Ω ⊂ Rl.

To this end, we present a novel approach to the data-driven kernel choice from K(Ω, G) based on the con-cept of meta-learning [1, 7]. In particular, the kernelis learned to adjust to each given input data x by re-constructing a vector function ω = ω(x) governing thechoice of the kernel K = K(ω; t, x), from K(Ω, G) onthe basis of experience with similar tasks. In addition,

we illustrate that the presented approach is superior tothe “traditional” kernel selection procedure, in which akernel is chosen “globally” for the whole given trainingset z, but it does not account for particular features ofinput x. The constructed data-driven approach yieldsbetter performance as it is seen from the results of theextensive numerical experiments with real clinical datafor blood glucose prediction problem [6, 7].

We argue that the presented approach is a new promis-ing direction that could lead to the construction of theefficient data-driven algorithms. It appears that suchapproach has not been systematically studied so far inthe framework of the regularization theory. The inves-tigations presented in the current work are contributingto the first steps in this challenging direction.

References

[1] P. Brazdil, C. Giraud-Carrier, C. Soares, andR. Vilalta. Metalearning: Applications to DataMining. Springer-Verlag, Berlin Heidelberg, 2009.

[2] H. Engl, M. Hanke, and A. Neubauer. Regular-ization of Inverse Problems, volume 375 of Math-ematics and Its Applications. Kluwer AcademicPublishers, Dordrecht, Boston, London, 1996.

[3] G. Lanckriet, N. Christianini, L. Ghaoui,P. Bartlett, and M. Jordan. Learning the kernelmatrix with semidefinite programming. J. Mach.Learn. Res., 5:27–72, 2004.

[4] C. A. Micchelli and M. Pontil. Learning the kernelfunction via regularization. J. Mach. Learn. Res.,6:1099–1125, 2005.

[5] C. A. Micchelli, Y. Xu, and H. Zhang. Universalkernels. J. Mach. Learn. Res., 7:2651–2667, 2006.

[6] V. Naumova, S. V. Pereverzyev, and S. Sampath.Extrapolation in variable RKHSs with applicationto the blood glucose reading. Inverse Problems,27(7):075010. 13 pp., 2011.

[7] V. Naumova, S. V. Pereverzyev, and S. Sivanan-than. A meta-learning approach to the regularizedlearning – case study: Blood Glucose prediction.Neural Networks, 33:181–193, 2012.

[8] C. S. Ong, A. Smola, and R. Williamson. Learningthe kernel with hyperkernels. J. Machine LearningResearch, 6:1043–1071, 2005.

[9] A. N. Tikhonov and V. B. Glasko. Use of the Reg-ularization Methods in Non-Linear Problems, vol-ume 5. USSR Comput. Math. Phys., 1965.

[10] G. Wahba. Spline Models for Observational Data,volume 59 of Series in Applied Mathematics.CBMS-NSF Regional Conf., SIAM, 1990.

ROKS 2013

72

Support Vector Machine with spatial regularization

for pixel classification

R. FlamaryLagrange Lab., Universite de Nice Sophia-Antipolis,CNRS, Observatoire de la Cote d’Azur, Nice, France

[email protected]

A. RakotomamonjyLITIS Lab.

Universite de Rouen, [email protected]

Abstract: We propose in this work to regularize the output of a svm classifier on pixelsin order to promote smoothness in the predicted image. The learning problem can be castas a semi-supervised SVM with a particular structure encoding pixel neighborhood in theregularization graph. We provide several optimization schemes in order to solve the problemfor linear SVM with `2 or `1 regularization and show the interest of the approach on an imageclassification example with very few labeled pixels.

Keywords: Support Vector Machine, Large scale learning, semi-supervised learning

1 IntroductionPixel classification is the problem of assigning a class toevery pixel in an image. This is a classical problem withseveral applications in medical imaging or in geoscienceremote sensing where it is denoted as image classifica-tion [2, 6]. A common approach for solving this prob-lem is to use discriminative machine learning techniquesand to treat pixels as independent vectors. In order totake into account the spatial prior over the pixels, sev-eral approaches have been proposed. One example isto include spatial features or kernels to the pixel rep-resentation such as filter output images [6]. Anotherapproach is to use a post-processing on the output ofthe classifier, for instance by using a Markov RandomField to include spatial information [8]. While the post-processing approach can integrate high order relationsbetween pixels, it is also more computationally inten-sive.

Another challenge of pixel classification is the datasetitself. The number of pixel N increases quadraticallywith the size of the image and the number n of la-beled pixels is usually small. This suggest the use ofsemi-supervised learning methods [2] which have led todramatic performance improvement when the numberof labeled pixels is small. Note that in their works, thelarge number of pixels are handled using low-rank ker-nel approximations, leading to the learning of a linearSVM on a small number d of nonlinear features.

In this paper, we focus on linear SVM applied ond N features (potentially nonlinear) extracted formthe data. We want not only to use unlabeled pixelsin the learning problem but also to promote spatialsmoothness on the output of the prediction function,thus using unlabeled pixels. We propose to this endto regularize the SVM output using a term that en-codes the spatial neighborhood of the pixels as seen

in [5]. This approach is a particular case of manifold-based regularized semi-supervised learning. We discussin the following how to solve the learning problem fordifferent regularizations on the linear SVM. Finally, nu-merical experiments are performed in order to show theinterest of the approach on a difficult pixel classificationproblem.

2 SVM with spatial regularization

The dataset consists in a full image of N pixels with dfeatures per pixel (possibly hyperspectral spectrum orother features). These pixels x are stored in the matrixX ∈ RN×d. Only n < N of these pixels with indexesi ∈ L are labeled with yi ∈ −1,+1. We want to learna prediction function f(·) of the form

f(x) =∑i

wixi + b = w>x + b (1)

where w ∈ Rd is the normal vector to the separatinghyperplane and b ∈ R is a bias term.

2.1 Learning problemWe propose to learn the prediction function with thefollowing optimization problem:

minf

∑i∈L

H(yi, f(xi))+λs∑i,j

Wi,j(f(xi)−f(xj))2+λrΩ(f)

(2)where H(y, f(x)) = max(0, 1 − yf(x))2 is the squaredhinge loss, W ∈ RN×N is a symmetric matrix of generalterm Wi,j that encodes the similarity between pixel iand j, λs and λr are regularization parameters and Ω(·)is the SVM regularization term.

This problem is a classical semi-supervised learningproblem. If the similarity matrix W were chosen to bea Gaussian kernel matrix, the problem would boil down

ROKS 2013

73

Ground truth IID SVM (ACC=0.73,t=0.01) SS−SVM (ACC=0.73,t=2.70) SSS−SVM l2 (ACC=0.89,t=0.02) SSS−SVM l1 (ACC=0.89,t=0.03)

Fig. 1: Ground truth labels, accuracy ACC, training time t in second and decision maps for IID SVM, SS-SVM and SSS-SVMwith `2 and `1 regularization. The regularization parameters of each methods are selected in order to maximize test accuracy.

to a Laplacian SVM [2]. But it requires the computa-tion of a O(N2) kernel matrix. In our case, we wantto promote smoothness in the output on the predictionfunction i.e. we want neighbor pixels to have similarprediction score. To this end, we propose a W matrixsuch that Wi,j = 0 everywhere except when pixels xi

and xj are spatial neighbors (Wi,j = 1). Note that thisregularization is similar to a total variation regulariza-tion but with a quadratic penalty term. Moreover theLaplacian regularization term can be computed with acomplexity O(N) which is essential to large scale learn-ing.

2.2 Optimization algorithmProblem (2) can be reformulated in the linear case as

minw,b

∑i∈L

H(yi,w>xi + b) + λsw

>Σw + λrΩ(w) (3)

where Σ = X>(D−W)X with D the diagonal matrix

such that Di,i =∑N

j=1Wi,j .

When Ω(w) = ‖w‖22, the problem is a classical `2 SVM

with a metric regularization (Σ = Σ + λr/λsI). Oneapproach suggested by [7] and [5] is to perform a changeof variable w = Σ1/2w and x = Σ−1/2x. The resultingproblem can be solved with a classical linear SVM solversuch as the one proposed by [3].

When Ω(·) is a more complex regularization term suchas the `1 norm, we propose to use a proximal splittingalgorithm such as ADMM to solve the problem [1]. Thisapproach allows us to use iteratively the efficient solverdiscussed above while integrating prior information tothe problem through regularization.

3 Numerical experiments

Numerical experiments are performed on a simulatedimage of size (100×100). The simulated image that canbe seen in Fig. 1 is generated as follows: i) the groundtruth image is obtained by generating random circles inthe image that are set to +1 (−1 for the background),ii) 10 discriminant features are generated by applyingGaussian noise to the ground truth image (σ = 5), iii)the previous images are filtered by a 3 × 3 average fil-ter and a 3× 3 median filter resulting in 20 additionalfeatures. iv) 10 images containing only Gaussian noiseare added to obtain 40 features.

In order to demonstrate the interest of our approach werandomly select 10 labeled samples from each classes

and we learn an independent SVM (IID-SVM), asemi-supervised Laplacian SVM (SS-SVM), and ourproposed approach, the spatially regularized semi-supervised SVM (SSS-SVM) for both `2 and `1 regu-larization. Results show that smooth classification andprediction maps are enforced leading to an importantimprovement in recognition performances (see Fig. 1).

References

[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eck-stein. Distributed optimization and statistical learn-ing via the alternating direction method of multipli-ers. Foundations and Trends R© in Machine Learn-ing, 3(1):1–122, 2011.

[2] G. Camps-Valls, T. Bandos Marsheva, and D. Zhou.Semi-supervised graph-based hyperspectral imageclassification. Geoscience and Remote Sensing,IEEE Transactions on, 45(10):3044–3054, 2007.

[3] O. Chapelle. Training a support vector machine inthe primal. Neural Comput., 19(5):1155–1178, 2007.

[4] M. Dundar, B. Krishnapuram, J. Bi, and R.B. Rao.Learning classifiers when the training data is notiid. In Proceedings of the 20th International JointConference on Artificial Intelligence, 2007.

[5] M. Dundar, J. Theiler, and S. Perkins. Incorporat-ing spatial contiguity into the design of a supportvector machine classifier. In IGARSS, pages 364–367. IEEE, 2006.

[6] M. Fauvel, J. Chanussot, and J. A. Benediktsson. Aspatial-spectral kernel-based approach for the classi-fication of remote-sensing images. Pattern Recogn.,45:381–392, January 2012.

[7] V. Sindhwani, P. Niyogi, M. Belkin, and S. Keerthi.Linear manifold regularization for large scale semi-supervised learning. In Proc. of the 22nd ICMLWorkshop on Learning with Partially ClassifiedTraining Data, 2005.

[8] Y. Tarabalka, M. Fauvel, J. Chanussot, andJ.A. Benediktsson. Svm-and mrf-based methodfor accurate classification of hyperspectral images.Geoscience and Remote Sensing Letters, IEEE,7(4):736–740, 2010.

ROKS 2013

74

Regularized structured low-rank approximation

Mariya Ishteva, Konstantin Usevich, Ivan MarkovskyDept. ELEC, Vrije Universiteit Brussel

mariya.ishteva, konstantin.usevich, [email protected]

Abstract: We consider the problem of approximating a linearly structured matrix, for ex-ample a Hankel matrix, by a low-rank matrix with the same structure. This problem occursin system identification, signal processing and computer algebra, among others. We imposethe low-rank by modeling the approximation as a product of two factors with reduced dimen-sion. The structure of the low-rank model is enforced by introducing a regularization term inthe objective function. In contrast to approaches based on kernel representations (in linearalgebraic sense), the proposed algorithm is designed to address the case of small targetedrank.

Keywords: low-rank approximation, affine structure, regularization

1 Introduction

Low-rank approximations are widely used in data min-ing, machine learning and signal processing as a toolfor dimensionality reduction and factor analysis. Insystem identification, signal processing and computeralgebra, the matrices are often structured, e.g., (block)Hankel, (block) Toeplitz, Sylvester, or banded matri-ces with fixed bandwidth. Note that sparse matricesare also structured matrices. Unstructured matricescan be considered as a special case of structured matri-ces as well. Therefore, the goal of structured low-rankapproximation is to preserve the given structure whileobtaining a low-rank approximation.

Existing approaches [1–4, 6] are based on kernel repre-sentation or on solving a system of linear equations inthe total least squares sense. We propose a new view ofthe problem that makes connection with the machinelearning literature.

We consider an image representation of the approx-imation, i.e., given a structured matrix D ∈ R

m×n

and a number r such that r ≪ m,n, find two factorsP ∈ R

m×r and L ∈ Rr×n, such that

D ≈ PL and PL is a structured matrix.

Although each of the constraints can easily be handledseparately, imposing both low-rank and fixed structureon the approximation is nontrivial.

In contrast to kernel representations which are more ef-ficient for large r, the proposed approach is meant forproblems with small r. Finally, for general structures,existing kernel approaches have restrictions on the pos-sible values of the rank r. With the new approach wecan overcome this limitation.

2 Problem formulation

2.1 Structures

Affine structures can be defined as

S(p) = S0 +

np∑

k=1

Skpk,

where S0, S1, . . . , Snp∈ R

m×n, p ∈ Rnp and np ∈ N

is the (minimal) number of parameters. Let vec(X)denote the vectorized matrix X and let

S =[

vec(S1) · · · vec(Snp)]

∈ Rmn×np .

Since np is minimal, S has full column rank.

For simplicity, we assume that S0 = 0, the elements ofS are only 0 and 1, and there is at most one nonzero ineach row (non-overlap across Sk), i.e., every element ofthe structured matrix corresponds to only one elementof p.

2.2 Orthogonal projection on image(S)Let ΠS = (S⊤ S)−1S⊤. It can be shown that the or-thogonal projection of a matrix X on image(S) is givenby

PS(X) ≡ S(ΠS vec(X)). (1)

The effect of applying ΠS on a vectorized matrix X isproducing a pX structure vector by averaging elementscorresponding to the same Sk. Note that applying ΠS

on a (vectorized) structured matrix extracts its struc-ture vector, since ΠS S p = p. Finally,

vec(PS(X)) = SΠS vec(X).

ROKS 2013

75

2.3 Optimization problem

The weighted structured low-rank approximation prob-lem is formulated as

minp

‖p− p‖W , such that rank(S(p)) ≤ r, (2)

where W ∈ Rnp×np is a symmetric positive definite

matrix of weights and ‖x‖2W

= x⊤W x. If W is theidentity matrix, ‖ · ‖W = ‖ · ‖2.

Although each of the constraints can easily be handledseparately, imposing both low-rank and fixed structureon the approximation is nontrivial. We approach theproblem with a regularization technique.

We have the following two choices:

→ low-rank

→ structure

→ low-rank

→ regularized structure

→ regularized low-rank

→ structure

• Regularize the structure constraint

minP,L

‖D − PL‖2W + λ‖PL− PS(PL)‖2F , (3)

where W is a weight matrix, λ is a regularization pa-rameter and F stands for the Frobenius norm.

• Regularize the rank constraint

minP,L

‖D − PS(PL)‖2W + λ‖PL− PS(PL)‖2F . (4)

Note that for λ = ∞ the term ‖PL − PS(PL)‖ haveto be 0 and the three problems (2), (3) and (4) areequivalent. The interpretations of (3) and (4) are how-ever different. In (4) the main part is the structureand the low rank is ‘secondary’. In (3) it is the otherway around, although in both cases both constraintsare satisfied at the solution. We will focus on (4), sinceit can be formulated using W in the following way

minP,L

‖p−ΠS vec(PL)‖2W

+ λ‖PL− PS(PL)‖2F .

3 The proposed algorithm

3.1 High-level idea

We solve the minimization problem (4) by alternatinglyimproving the approximations of P and of L

minL


+ λ‖PL− PS(PL)‖2F ,minP


+ λ‖PL− PS(PL)‖2F(5)

until convergence.

3.2 Details

Let In be the n × n identity matrix, ’⊗’ denote the

Kronecker product and W = M⊤

M . Using (1) andthe equality vec(XY Z) = (Z⊤ ⊗X) vec(Y ), (5) can bereformulated as

minL

∥

∥

∥

∥

∥

[

M ΠS√λ(Imn − SΠS)

]

(In ⊗ P ) vec(L)−[

Mp

0

]∥

∥

∥

∥

∥

2

2

,

minP

∥

∥

∥

∥

∥

[

M ΠS√λ(Imn − SΠS)

]

(L⊤⊗ Im) vec(P )−[

Mp

0

]∥

∥

∥

∥

∥

2

2

.

These are least squares problems and can easily besolved by standard techniques.

The matrix P can be initialized by a matrix represent-ing the left dominant subspace of A. We declare thatPL is a structured matrix if

‖PL− PS(PL)‖2F < 10−12.

3.3 Parameter λIn theory, if we fix λ = ∞, then we have the exact struc-tured low-rank approximation problem. In practice, westart from a small value and increase it with each it-eration until it reaches a “large enough” value. Thisway we allow the algorithm to move to a “good region”quickly and then impose more strictly all constraints.For convergence properties, we rely on the theory ofquadratic penalty method from [5, §17.1].

References

[1] M. Chu and G. H. Golub. Inverse eigenvalue prob-

lems. Oxford University Press, 2005.[2] B. De Moor. Structured total least squares and

L2 approximation problems. Linear Algebra Appl.,188–189:163–207, 1993.

[3] I. Markovsky. Structured low-rank approximationand its applications. Automatica, 44(4):891–909,2008.

[4] I. Markovsky and K. Usevich. Software for weightedstructured low-rank approximation. Technical Re-port 339974, Univ. of Southampton, 2012.

[5] J. Nocedal and S. J. Wright. Numerical Opti-

mization. Springer Series in Operations Research.Springer, 2nd edition, 2006.

[6] K. Usevich and I. Markovsky. Variable projec-tion for affinely structured low-rank approximationin weighted 2-norm, 2012. Available from http:

//arxiv.org/abs/1211.3938.

ROKS 2013

76

A Heuristic Approach to Model Selection for Online

Support Vector Machines

Davide Anguita, Alessandro Ghio, Isah Abdullahi Lawal, Luca OnetoDITEN - University of Genoa, Via Opera Pia 11A, I-16145, Genoa, Italy

Davide.Anguita, Alessandro.Ghio, Abdullahi.LawalIsah, [email protected]

Abstract: In this abstract, we cope with the problem of Model Selection (MS) of SupportVector Machine (SVM) classifiers in the Online Learning (OL) framework. Though oftenneglected in OL, MS is paramount to guarantee that model effectiveness is guaranteed whileexploiting new gathered samples. In particular we propose a heuristic approach, which canbe feasibly applied in OL applications to SVM. The effectiveness of the proposal is supportedby preliminary experimental results.

Keywords: Online Learning, Support Vector Machines, Model Selection

1 Introduction

Support Vector Machines (SVMs) [9, 10] are one ofthe most effective technique for classification purposes.The SVMs learning phase is a computationally inten-sive process that consists of two phases: during thetraining (TR) step, a set of parameters is found [8];during the Model Selection (MS) phase, a set of addi-tional variables (hyperparameters) is tuned to find theSVM characterized by optimal performance in classify-ing previously unseen data [1]. In case of linear SVMclassifiers, for example, the set of hyperparameters isC = C, where C weights a regularization term [10].

With the growth of Online Learning (OL) applications(involving large data streams with varying distribu-tion) [? ], the naıve SVM is incapable of handlingsuch problems without any modifications: in fact, inthe OL framework model update is necessary everytime a new sample is gathered [2, 3], but a completeMS-TR run from scratch becomes computationally pro-hibitive. While incremental TR algorithms have beendeveloped in order to incorporate additional trainingsamples [3, 7], the extension to the OL framework ofMS approaches is not as much straightforward, despiterepresenting a desirable feature [5, 7].

In this paper, we thus focus on the MS step of SVMlearning in the OL framework, by proposing an efficientheuristic approach for updating the set of hyperparam-eters C when new samples are collected.

2 MS in the OL framework

Let us consider a continuos online stream of data Dt =(x1, y1), . . . , (xt, yt), . . ., xi ∈ <d, yi = ±1 where therelation between x and y is encapsulated by an un-known distribution Pt that can change over time. The

goal is to exploit newly gathered data, obtained fromthe stream, in order to learn, or improve, a function ftthat approximates Pt.

Typically, in the OL framework, MS is performedonly once in a batch mode by exploiting the firstt data collected: by using a conventional MS ap-proach, such as the K-Fold Cross Validation (KCV),the SVM hyperparameters are fixed to the optimalvalue at the t-th step C∗t and a first model is trained(ft). When ∆ new samples are gathered at stept + ∆, the learning set is updated Dt+∆ = Dt \(x1, y1), ..., (x∆, y∆)∪(xt+1, yt+1), ..., (xt+∆, yt+∆)and ft+∆ is consequently modified, e.g. accordingly tothe method proposed in [3]. As C∗t is kept constant inthis process, we define this OL approach as NO-MS.

Ideally, instead, we should both be able to update themodel and the set of hyperparameters thank to the newcollected samples. A possible, but computationally ex-pensive, approach consists in performing a full MS-TRprocedure at every step, i.e. a complete learning fromscratch every time new patterns are collected: we definethis approach as Complete-MS (C-MS).

In this work, we propose a new heuristic approach,where MS is not completely neglected in OL, as in NO-MS, but a whole re-learning is avoided, contrarily toC-MS. Analogously to NO-MS, we start by identifyingthe best hyperparameters set C∗t and model ft at the t-th step with a KCV procedure; differently from NO-MS,

we also have to keep track of the k models f(1)t , . . . , f

(k)t

trained while applying KCV. When ∆ new samples areacquired, we want both to modify the hyperparame-ters set and the model: it is reasonable to assume that,if ∆ is not large (i.e. it is not comparable to t), thehyperparameters will not vary too much from the pre-vious best value. Thus, we can define a neighborhood

ROKS 2013

77

!" !"

!" !"

!" !"

!"

!"

k

∆k(k − 1)∆

k

!" !"

!"

!"

!"

!"

k

(k − 1)∆

k

!"

!"

!"

!"

!"

∆k!"

New Samples Excluded Samples

Fig. 1: Online k–Fold Cross Validation.

set Cεt , centered around C∗t : for example, if linear ker-nels are concerned, we can define a neighborhood set

C ∈[C∗

t

ε , εC∗t

], where ε > 1. We can update the KCV

samples set (refer to Fig. 1) and, accordingly, the KCV

models f(1,...,k)t+∆ by exploiting the approach proposed in

[4] for every value of the hyperparameters included inthe neighborhood set: through a conventional MS ap-proach based on KCV, we can thus identify the best hy-perparameters configuration C∗t+∆ ∈ Cεt . Finally, ft+∆

is trained accordingly to C∗t+∆ and the ∆ new samplescollected by exploiting the procedure described in [4].We define this approach as OL-MS.

The supplementary computational burden with respectto NO-MS is limited: in fact, we have to update k + 1models instead of 1 and we have to perform a KCVMS at each updating step, but limited to a restrictedneighborhood set; however, the OL-MS approach copeswith (reasonably slow) modifications of Pt, allowing toproperly re-tune the models and the hyperparametersconfiguration and, thus, representing an effective trade-off between the computationally unfeasible C-MS andthe unadaptive NO-MS procedures.

3 Preliminary Results and Discussion

We compare the three approaches, introduced above, byusing the Spam dataset [6], consisting of 9324 samplesand d = 39917 features. Since d n, a linear classifieris sufficient to separate the samples (i.e. C = C).We use an initial batch of t = 1000 samples for thefirst learning, and we perform OL by adding ∆ = 50samples per iteration. We look for the hyperparameterC in the range

[10−5, 102

]by using 50 points equally

spaced in logarithmic scale, while we fix k = 4 for theKCV procedure. We also set ε = 10 in OL–MS.

For the three approaches presented, Fig. 2 shows theerror rates on the test set, consisting of the next 10∆samples not yet exploited for OL, while Tab. 3 presentsthe learning times on a conventional PC. Results clearlyshow that the proposed OL-MS procedure representsa good trade-off between accuracy and training time:while the former is comparable to the one of C-MS, thelatter are more than one order of magnitude smaller

than the time needed by C-MS and acceptable in sev-

Fig. 2: Comparison of error rates on the test set as newpatterns are considered.

eral OL applications. These results are encouraging,nevertheless the OL-MS approach requires that exten-sive simulations are performed to assess its performanceand to study the effect of variations of parameters (e.g.ε) on the quality of the results (both in terms of errorrate and training time).

Method Training Time (seconds)

NO–MS 0.60 ± 0.25

C–MS 43.92 ± 1.31

OL–MS 2.57 ± 0.35

Tab. 1: Training times for the different OL approaches.

References

[1] D. Anguita, A. Ghio, L. Oneto, and S. Ridella. In-sample andout-of-sample model selection and error estimation for supportvector machines. IEEE Transactions on Neural Networks andLearning Systems, 23(9):1390–1406, 2012.

[2] A. Blum. On-line algorithms in machine learning. Online algo-rithms, pages 306–325, 1998.

[3] G. Cauwenberghs and T. Poggio. Incremental and decrementalsupport vector machine learning. Advances in neural informa-tion processing systems, pages 409–415, 2001.

[4] C. P. Diehl and G. Cauwenberghs. Svm incremental learn-ing, adaptation and optimization. In IEEE International JointConference on Neural Networks, pages 2685–2690, 2003.

[5] T. Diethe and M. Girolami. Online learning with (multiple)kernels: A review. Neural computation, 25(3):567–625, 2013.

[6] I. Katakis, G. Tsoumakas, and I. Vlahavas. Tracking recurringcontexts using ensemble classifiers: an application to email fil-tering. Knowledge and Information Systems, 22(3):371–391,2010.

[7] P. Laskov, C. Gehl, S. Kruger, and K.R. Muller. Incrementalsupport vector learning: Analysis, implementation and appli-cations. The Journal of Machine Learning Research, 7:1909–1936, 2006.

[8] J. Shawe-Taylor and S. Sun. A review of optimizationmethodologies in support vector machines. Neurocomputing,74(17):3609–3618, 2011.

[9] J.A.K. Suykens and J. Vandewalle. Least squares support vec-tor machine classifiers. Neural processing letters, 9(3):293–300,1999.

[10] V. Vapnik. Statistical Learning Theory. Wiley New York, 1998.

ROKS 2013

78

Lasso and Adaptive Lasso with Convex Loss Functions

Wojciech RejchelNicolaus Copernicus University

Chopina 12/18, 87-100 Torun, [email protected]

Abstract: Variable selection is a fundamental challenge in statistical learning if one workswith data sets containing huge amount of predictors. In the paper we consider procedurespopular in model selection: Lasso and adaptive Lasso. Our goal is to investigate propertiesof estimators based on minimization of Lasso-type penalized empirical risk with a convex lossfunction, in particular nondifferentiable. We obtain theorems concerning rate of convergencein estimation, consistency in model selection or oracle properties for Lasso estimators if thenumber of predictors is fixed, i.e. it does not depend on the sample size. Moreover, we studyproperties of Lasso and adaptive Lasso estimators on simulated and real data sets.

Keywords: Lasso, Adaptive Lasso, Model Selection, Oracle Property, Convex Loss Function

Variable selection is a fundamental challenge in sta-tistical learning if one works with data sets contain-ing huge amount of predictors. Such problems oftenoccur in computational biology, medicine or banking.Finding significant (relevant) variables helps to betterunderstand the problem and improves statistical infer-ence. One of methods solving this problem is Lasso [7].The main characteristic of this procedure is an abil-ity to select significant variables and estimate unknownparameters simultaneously if the penalty term is cho-sen in an appropriate way. Our goal is to investigateproperties of estimators based on minimization of suchpenalized empirical risk with a convex loss function.In the paper we consider a model with X being a d-dimensional random vector and Y a random variable.For convenience we assume that X ∈ Rd and Y ∈ R.We treat X as a vector of predictors and Y as a re-sponse variable. The dependence between Y and Xis contained in a parameter θ∗ ∈ Rd. We think of thej-th variable being relevant in the considered model ifθ∗j 6= 0 and irrelevant otherwise. Let relevant predic-tors be determined by a set A = 1, . . . , d0 for some0 < d0 < d, so our model depends on a subset of allpredictors. Denote θ∗A = θ∗1 , . . . , θ∗d0.Let f(θ, y, x) be a real loss function. We assume that fis convex with respect to θ for fixed z = (y, x) andmeasurable with respect to z for fixed θ. Considerthe convex function Q(θ) = Ef(θ, Z) and its mini-mizer θ∗. If we do not know the distribution of Z,then we cannot calculate θ∗ directly. However, if wehave a sample Z1, . . . , Zn of independent copies of Z,then we can minimize Qn(θ) = 1

n

∑ni=1 f(θ, Zi) instead

of Q(θ). This standard approach is often modified byadding some penalty to the empirical risk Qn. Consid-ering Lasso we are to minimize a convex in θ functionΓn(θ) = Qn(θ) + λn

n |θ|, where | · | is the l1-norm of the

vector θ, i.e. |θ| =∑dj=1 |θj | and λn > 0 is a number

dependent on n that is chosen by a researcher. It isa balance between minimizing the empirical risk andthe penalty. We denote the minimizer of Γn(θ) by θ.The form of the penalty is crucial, because its singu-larity at the origin implies that some coordinates ofthe minimizer θ are exactly equal to zero if λn is suffi-ciently large. Thus, by minimizing the function Γn(θ)we estimate unknown parameters and select significantpredictors simultaneously.Lasso is very popular in model selection and estima-tion. Its theoretical properties for linear models withthe quadratic loss function or generalized linear modelswere widely studied [1, 3, 4, 9, 10]. However, there is acomprehensive literature where one uses Lasso estima-tors with different loss functions, for instance the abso-lute value or the ”hinge” loss well-known in machinelearning. There is a natural question if using moreappropriate in investigated model non-quadratic lossfunction we can expect that the estimator behaves inmodel selection similarly to the relatively well-studiedquadratic loss case. In our paper we describe prop-erties of estimators based on minimization of Lasso-type penalized empirical risk with a convex loss func-tion, in particular this loss function can be nondifferen-tiable. Our results generalize theorems from the above-mentioned papers if d is fixed, i.e. it does not dependon n. While working with the convex setting we needsome standard regularity assumptions:(a) θ∗ is unique,(b) Q is twice differentiable at θ∗ and H = ∇2Q(θ∗) ispositive definite,(c) E|∂f(θ, Z)|2 <∞ for each θ in some neighbourhoodof θ∗. We do not require that the subgradient ∂f(θ, z)of a convex function f(θ, z) is unique, we only requirethat it is a measurable selection of the subgradient.

ROKS 2013

79

We prove (Theorem 0.1) that considered estimator can-not be an ”oracle”, because if the rate in estimation isoptimal then the consistency in model selection is notpossible. Denote An = j ∈ 1, . . . , d : θj 6= 0.

Theorem 0.1. If λn/√n→ λ0 ≥ 0, then

(a)√n(θ − θ∗

)→d arg min

θV (θ), where

V (θ) = 12θTHθ+WT θ+ λ0

∑j∈A

θjsign(θ∗j ) + λ0∑j /∈A|θj |

and W ∼ N(0, D) with D = Var∂f(θ∗, Z),(b) lim sup

nP(An = A) ≤ K < 1, where K is a constant.

Moreover, we show that even if one agrees on the worserate in estimation, then consistency in model selectionis possible but not guaranteed. We obtain necessaryand sufficient conditions (Theorem 0.2) for consistencyin model selection in this case.

Theorem 0.2. Suppose that λn

n 0, and λn√n→∞.

(a) If Lasso estimator is consistent in model selection,then the inequality∣∣HT

2 H−11 sign(θ∗A)

∣∣ ≤ 1,

holds componentwise and matrices H1 and H2 are takenfrom the matrix

H =

d0×d0︷︸︸︷H1

d0×(d−d0)︷︸︸︷H2

HT2 H3

.

(b) If the inequality∣∣HT2 H

−11 sign(θ∗A)

∣∣ < 1

holds componentwise, then Lasso estimator is consis-tent in model selection.

Furthermore, we describe an improvement of Lasso thatwas proposed in [10] and called ”adaptive Lasso”. It re-lies on adding different weights to different predictorsin the penalty. Namely, we are to minimize the function

Γan(θ) = Qn(θ) + λn

n

∑dj=1

|θj ||βj |

, where β is an arbitrary

estimator of θ∗ such that√n(β − θ∗

)= OP (1). We

prove that the oracle property holds in the convex set-ting that extends previous theorems [8, 10].

Theorem 0.3. If λn →∞ and λn√n→ 0, then the adap-

tive lasso estimator θa = arg minθ Γan(θ) is an oracle:(a) lim

n→∞P (Aan = A) = 1,

(b)√n(θaA − θ∗A

)→d N

(0, H−11 D1H

−11

)and the ma-

trix D1 is a (d0 × d0) upper-left submatrix of D.

Notice that convexity of the loss function and methodsfrom the convex empirical process theory [2, 5, 6] play

crucial roles in our argumentation. Finally, we discussregularity assumptions in the non-trivial case that theloss function is non-differentiable, for instance f(θ, z) =|y − θTx|. If we want the risk Q(θ) = E|Y − θTX| tobe twice differentiable at θ∗, we need some regularityassumptions on the probability distribution of (Y,X).

Consider the standard linear model Y = (θ∗)TX + ε,

where E|X|2 < ∞, besides ε and X are independent.Then we only need that ε has a density l(·) continu-ous and positive in a neighbourhood of zero. In thiscase one can calculate that H = 2 l(0)EXXT andD = EXXT (see [6]).

References

[1] J. Fan and R. Li. Variable selection via noncon-cave penalized likelihood and its oracle properties.Journal of the American Statistical Association,96:1348–1360, 2001.

[2] C. J. Geyer. On the asymptotics of convex stochas-tic optimization. Unpublished manuscript, 1996.

[3] K. Knight and W. Fu. Asymptotics for lasso-typeestimators. Annals of Statistics, 28:1356–1378,2000.

[4] N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with thelasso. Annals of Statistics, 34:1436–1462, 2006.

[5] W. Niemiro. Asymptotics for M-estimators de-fined by convex minimization. Annals of Statistics,20:1514–1533, 1992.

[6] D. Pollard. Asymptotics for least absolute devi-ation regression estimators. Econometric Theory,7:186–199, 1991.

[7] R. Tibshirani. Regression shrinkage and selectionvia the lasso. Journal of the Royal Statistical So-ciety, Series B, 58:267–288, 1996.

[8] H. Wang, G. Li, and G. Jiang. Robust regressionshrinkage and consistent variable selection throughthe lad-lasso. Journal of Business & EconomicStatistics, 25:347–355, 2007.

[9] P. Zhao and B. Yu. On model selection consistencyof lasso. Journal of Machine Learning Research,7:2541–2563, 2006.

[10] H. Zou. The adaptive lasso and its oracle proper-ties. Journal of the American Statistical Associa-tion, 101:1418–1429, 2006.

ROKS 2013

80

Conditional Gaussian Graphical Models for Multi-output

Regression of Neuroimaging Data

Andre F MarquandKing’s College London

[email protected]

Maria Joao RosaUniversity College London

[email protected]

Orla DoyleKing’s College [email protected]

Abstract: Pattern recognition has shown considerable promise for automated diagnosis andfor predicting outcome in clinical neuroimaging studies. Many of these decision problems canmost naturally be framed as multi-output learning problems but nearly all studies to datehave adopted sub-optimal analytical approaches, modeling different outputs independently.In this work, we apply a dual elastic net regularisation to the precision matrix of a conditionalGaussian graphical model to provide true multi-output learning for clincial neuroimaging data.This method improves the accuracy over modeling the outputs independently and can quantifythe relationships between outputs, which is useful to help understand disease pathology.

Keywords: Multi-output learning, multi-task learning, conditional Gaussian graphical mod-els, neuroimaging, MRI, multiple system atrophy

1 Introduction

Pattern recognition has shown to be highly promisingfor clinical neuroimaging research both for computer-assisted diagnosis and for predicting future disease out-come. To date, nearly all applications have focussedon binary classification, with only a few applications of(single output) regression. These approaches are over-simplifications for most clinical problems, where it maybe necessary to predict multiple outcomes to adequatelydescribe a disease state (e.g. to measure different typesof symptoms). Multi-output learning is thus a naturalcandidate for this type of problem, but has received lit-tle attention from the neuroimaging community. Here,we demonstrate its suitability for neuroimaging. Weextend a recently proposed approach based on condi-tional Gaussian graphical models (CGGMs) where weenforce sparsity on the precision matrix using a dualelastic net penalty, derive patterns of predictive weightsthat are coupled between outputs and automatically es-timate correlations between the outputs. We apply thisapproach to a challenging and clinically relevant prob-lem: predicting clinical scales from neuroimaging dataderived from patients with a fatal neurological disorder.

2 Methods

We begin with a dataset X,Y, where X =[x1, ...,xn]T is an n×d matrix containing d-dimensionaldata vectors. Each data vector is associated withan m-dimensional vector of real-valued outputs, whichare collected in the n × m matrix Y = [y1, ...,yn]T .The goal is to simultaneously predict the outputswhile properly accounting for the dependencies betweenthem. Conventionally, this is achieved by generalis-

ing single output regression, using a different weightvector to predict each output, i.e. Y = XB + E,where B is a d ×m weight matrix and E is an n ×mresidual matrix. Most existing approaches to esti-mate B have limitations in that they either: (i) ac-count for dependencies between the weight vectors orbetween the targets, but not both or (ii) require thedependency structure between outputs to be speci-fied a priori, which requires knowledge that is unlikelyto be available for many applications. In this work,we generalise a recently proposed approach that over-comes these limitations [4]. Here, the problem is re-cast in a CGGM framework which assumes a jointGaussian distribution over the inputs and targets, i.e.

[xTi ,y

Ti ]T ∼ N

(0,

(Σxx Σxy

ΣTxy Σyy

)). The aim is then

to estimate the entries of the precision matrix: Θ =

Σ−1 =

(Θxx Θxy

ΘTxy Θyy

), which is a natural parameteri-

sation because of the well-known fact that zeros in theprecision matrix encode the conditional independencestructure between variables. Finally, predictions can bemade by p(y∗|x∗) ∼ N(−Θ−1

yyΘTxyx∗,Θ

−1yy ). It turns

out that this representation has a correspondence tothe conventional formulation, with B = −Θ−1

yyΘTxy [4].

Model parameters can be estimated by considering theconditional log-likelihood:

L(X,Y|Θxy,Θyy) = −0.5[−n log det(Θyy) +

tr[(Y + XΘxyΘ−1

yy )Θyy(Y + XΘxyΘ−1yy )T

]]Maximum likelihood estimation of this model is equiv-alent to estimating each output independently, but ap-plying regularisation allows the weights for each output

ROKS 2013

81

to borrow strength from one another. We apply twoelastic net penalties, which forces some entries of Θxy

and Θyy to zero, while retaining correlated features.This last property is important because neuroimag-ing data is characterised by substantial spatial andtemporal correlation. Therefore, we aim to maximise:L(X,Y|Θxy,Θyy)−λ1[α1|Θxy|1 + (1−α1)‖Θxy‖2F ]−λ2[α2|Θyy|1 + (1− α2)‖Θyy‖2F ], where |.|1 is a matrixL1 norm, ‖.‖F is the Frobenius norm and αj and λjare regularisation hyperparameters. This is a convexoptimisation problem (see [4]), which we solve using aprojected scaled subgradient method. Note that it isnot necessary to explicitly estimate Θxx, which is ad-vantageous because neuroimaging data are often veryhigh dimensional.

We apply this method to predicting clinical scores fromneuroimaging data from patients with a motor neu-rodegenerative disorder (see [1] for details). Here, weuse the structural neuroimaging data from 19 patientswith multiple system atrophy. We aim to predict aset of eight standard clinical scales: Universal Parkin-son’s disease rating scale 1 (UPDRS1 - mentation andmood), UPDRS2 (activities of daily living), UPDRS3(motor function), PIGD (postural stability), the shortmotor disability scale (SMDS), Hoehn and Yahr (HY- measuring disease staging) and the Schwab and Eng-land activities of daily living scale (SE-ADL). This is ahighly challenging problem because the scales are cal-ibrated across different ranges. We use nested cross-validation with a grid search to optimise the hyperpa-rameters. While CGGM is computationally tractablefor whole-brain prediction, a full grid search becomesquite computationally demanding. Therefore, we re-duce the dimensionality using agglomerative clustering[2]. We construct features from a set of ”scalar momen-tum” image features [3], with the number of clustersfixed to 200 (100 grey- and 100 white matter).

3 Results

The error obtained by the CGGM on all symptom scalesis reported in Table 1 along with a baseline model thatpredicts each variable independently (ridge regression).The CGGM produced equivalent or better performancefor seven of the eight scales. A second important out-come is an estimate of the partial correlation betweenvariables (ρ), which can be derived from Θyy (Fig 1)and is useful to quantify the relationships between out-put variables. A third outcome is a visualisation of thepredictive weights in the original voxel space, whichis particularly crucial for neuroimaging, but a lack ofspace precludes its presentation here. Note that the op-timal parameters from the grid search favoured a largelynon-sparse model for both Θyy and Θxy. Therefore aridge penalty would probably perform equivalently forthis data, although this would have been difficult todetermine in advance.

Clinical Scale Ridge Regression CGGMUPDRS 1 1.28 1.14UPDRS 2 1.05 0.99UPDRS 3 0.61 0.51PIGD 1.01 0.92Stability 0.66 0.74SMDS 0.62 0.59Hoehn-Yahr 0.72 0.72Schwab-England 0.73 0.65

Tab. 1: Standardized mean squared error for the CGGMand a baseline model that predicts each target separately

Fig. 1: Partial correlation (ρ) between clinical variables

4 Conclusions

We have demonstrated a flexible multi-output learn-ing approach based on CGGMs for neuroimaging data.This approach produced highly promising performanceon the clinical dataset investigated and provides mea-sures that are useful to help understand the relation-ships between different outputs and how well they canbe predicted from the neuroimaging data.

AcknowledgmentsAFM was supported by the KCL Medical Engineering Centre,funded by the Wellcome Trust and EPSRC.

References

[1] M. Filippone, A. Marquand, C. Blain, Steven Williams, J.Mourao-Miranda, Mark Girolami, Probabilistic predictionof neurological disorders with a statistical assessment of neu-roimaging data modalities Annals of Applied Statistics, 6:1883–1905, 2012.

[2] V. Michel, A. Gramfort, G. Varoquaux, E. Eger, C. Keribin,B. Thirion. A supervised clustering approach for fMRI-based inference of brain states. Pattern Recognition, 45(6):2041–2049, 2012.

[3] N. Singh, P. Fletcher, J. Preston, L. Ha, R. King, J. Marron,M. Wiener, S. Joshi. Multivariate statistical analysis of de-formation momenta relating anatomical shape to neuropsy-chological measures. Lecture Notes in Computer Science,6363:529–537, 2010

[4] K. Sohn and S. Kim. Joint estimation of structured spar-sity and output structure in multiple-output regression viainverse-covariance regularization Journal of Machine Learn-ing Research - Proceedings Track, 22:1081–1089, 2012.

ROKS 2013

82

High-dimensional convex optimization via optimal affine

subgradient algorithms

Masoud AhookhoshFaculty of MathematicsUniversity of Vienna

[email protected]

Arnold NeumaierFaculty of MathematicsUniversity of Vienna

[email protected]

Abstract: This study is concerned with some algorithms for solving high-dimensional convexoptimization problems appearing in applied sciences like signal and image processing, machinelearning and statistics. We improve an optimal first-order approach for a class of objectivefunctions including costly affine terms by employing a special multidimensional subspacesearch. We report some numerical results for some imaging problems including nonsmoothregularization terms.

1 Introduction and Basic Idea

Lots of applications in signal and image processing, ma-chine learning, statistics, geophysics, etc include affineterms in their structure that are the most costly partof function evaluations. Therefore, we consider the fol-lowing unconstrained convex optimization problem

minimize f(x) :=

n1∑i=1

fi(x,Ai(x)), (1)

where fi(x, vi) (i = 1, · · · , n1) are convex functions de-fined for x ∈ Rn and vi = Ai(x) ∈ Rmi and the Ai arelinear operators from Rn to Rmi . The aim is to derivean approximating solution x just using function val-ues f(x) and subgradients g(x). Consider the followinggeneric first-order descent algorithm:

Generic first-order descent algorithm

Input: x0 ∈ Rn; ε > 0;Output: xb;Begin1 choose xb;2 calculate fxb

← f(xb); gxb← g(xb);

3 Repeat4 generate x′b by an extended algorithm;5 generate xb by a heuristic algorithm;6 xb ← argminxb,x′

b,xb f(x);

7 Until a stopping criterion holds;End

On the basis of high-dimensional data used in applica-tions, the most costly part of function and subgradientevaluations relates to applying the linear operator andits transpose. Then we should decrease the number oflinear operator calculations as much as possible. Tofind an appropriate xb in line 5 of this algorithm, weuse in each step of this algorithm a multidimensionalsubspace search as inner algorithm. The main ideais a generalization of line search techniques by saving

M n columns u and the previously computed vectorsvi = Ai(u) in the matrices U and Vi (i = 1, · · · , n1),find t∗ ≈ argmint∈RM f(t) where

f(t) := f(xb + Ut) =

n1∑i=1

fi(x+ Ut, vi + Vit) (2)

can be calculated without further calls to the Ai.

Multidimensional subspace search procedure

Input: xb ∈ Rn; U ; vi ∈ Rmi ; Vi(i = 1, · · · , n1);Begin

approximately solve the M -dimensional problem

t∗ ≈ argmint∈RM

f(t); (3)

xb = xb + Ut;End

2 Specific version of the algorithm

The general idea is made specific by choosing a partic-ular descent algorithm. We use the OSGA algorithm,[3], which monotonically reduces a bound on the errorf(xb) of the function value of the best known point xb.The modified version of OSGA including affine scalingand the multidimensional subspace search, as definedabove, is called ASGA which is suitable to deal withhigh-dimensional convex optimization appearing in ap-plications. In ASGA, the subproblem (3) is also solvedby OSGA.

The OSGA algorithm generates and updates linear re-laxations

f(z) ≥ γ + hT z ∀z ∈ Rn,

where γ ∈ R, h ∈ Rn. In line 3 of the algorithm we setx′b = xb + α(u − xb), where u = U(γ, h) ∈ C solves aminimization problem of the form

E(γ, h) := − infx∈Rn

γ + hT z

Q0 + 12‖z − z0‖2

. (4)

ROKS 2013

83

For details see [3]. It is proved in [3] that OSGAachieves the optimal complexity bounds O(1/

√ε) for

the optimization of smooth convex functions andO(1/ε2) for nonsmooth convex functions [2], no mat-ter which heuristic choice xb is made, so ASGA is anoptimal complexity algorithm.

3 Numerical Results

This section reports some numerical results to show theefficiency of the proposed algorithms for solving practi-cal problems arising in applications. On the basis of thefact that OSGA and ASGA just need function and sub-gradient evaluations, they can be employed in solvingwide range of problems including regularization terms.Examples include lasso, basis pursuit denoising, l1 andl2 decoding, isotropic and anisotropic total variation,group regularizations, elastic net, nuclear norm, linearsupport vector machine, kernel-based models.

Here, we consider the l22 − ITV regularization problemdefined by

f(x) =1

2‖A(x)− b‖22 + λ‖x‖ITV ,

where ‖.‖ITV denotes the isotropic TV norm and A isa linear operator chosen as in the TwIST package forreconstruction of the 512×512 blurry-noisy Lena image.

0 10 20 30 40 50 603.8

4

4.2

4.4

4.6

4.8

5

5.2

5.4x 10

4

iterations

function v

alu

es

TwIST

IST

OSGA

ASGA

Fig. 1: A comparison of function values for IST, TwIST,OSGA and ASGA

If we count the images of Fig. 2 row by row, the firstimage is the original image, and the second image showsa blurry-noisy image constructed by adding noise anduniform 9 × 9 blur with BSNR = 40 dB. The rest ofthe images are restored by the minimization problem(3) where x0 is given by a Wiener filter for all consid-ered algorithms. The third image is produced by theIST algorithm, while the fourth image is recovered byTwIST [4]. Also the fifth image restored by OSGA andthe sixth images was reconstructed by ASGA. It is clearthat the final function value of the proposed algorithms

Original image Blurry−noisy image

IST (f = 48941.89) TwIST (f = 41840.64)

OSGA (f = 40114.38) ASGA (f = 39860.51)

Fig. 2: Isotropic TV-based image reconstruction from40% missing sample with the considered algorithms: IST,TwIST, OSGA and ASGA

is less than those of IST and TwIST and the restoredimage visually looks good. This shows that the pro-posed algorithms can effectively reconstruct blurry andnoisy images at a reasonable cost.

References

[1] M. Ahookhosh, A. Neumaier, An affine subgradi-ent algorithm for large-scale convex optimization.Manuscript, University of Vienna, 2013.

[2] Y. Nesterov, Introductory lectures on convex opti-mization: A basic course, Kluwer, Dordrecht 2004.

[3] A. Neumaier, Convex optimization with optimalcomplexity. Manuscript, University of Vienna,2013.

[4] TwIST website: http://www.lx.it.pt/ biou-cas/TwIST/TwIST.htm

ROKS 2013

84

Joint Estimation of Modular Gaussian Graphical Models

Jose Sanchez

Mathematical Sciences, Chalmers University of Technology and University of Gothenburg,

SE-412 96 Gothenburg, Sweden

[email protected]

Rebecka Jornsten

Mathematical Sciences, Chalmers University of Technology and University of Gothenburg,

SE-412 96 Goteborg, Sweden

[email protected]

Abstract: We propose a method to estimate several Gaussian graphical models that sharea common structure with modular topology. To encourage modularity we define a noveladaptive fused penalty. We also propose a generalization of the fused penalty to a moregeneral one defined by graphs. We use this penalty to correct for unequal sample sizes in thedata or integrate ordered variables. We optimize the penalized log-likelihood using ADMMand show by simulation that our method performs better than competitors. We apply ourmethod in the study of regulators in glioblastoma, breast and ovarian cancer.

Keywords: Gaussian graphical models, gene networks, lasso, fused lasso, elastic net, preci-sion matrix estimation, data integration.

1 Introduction

Modeling of transcription networks is a popular ap-proach to cancer research at molecular level. It hasbeen used to integrate several types of genomic cancerdata, identify genes with altered copy number as dis-ease drivers, construction of features for prediction ofpatient survival and identification of potential thera-peutic targets as hubs [3].

Assume that transcription information (mRNA) ofclass k (such as a cancer type), k = 1, 2, . . . ,K, can bemodeled as a realization of a multivariate normal distri-bution with mean µk and covariance matrix Σk. In thiscase, the problem of estimating the network is equiva-

lent to estimate the precision matrix Ωk =(

Σk)

−1.

To jointly estimate these networks, we optimize a penal-ized version of the log-likelihood function using ADMM[1]. We include an elastic net penalty in order to ob-tain sparse networks for ease of interpretation. Sinceit is biologically reasonable to assume that some of theclasses will share regulators, we also include a fused

penalty, to achieve equality across classes. Finally, webelieve that this equality constraint should be smoothat least locally, thus we defining a module or a modular

network structure. To encourage modularity we definean adaptive fused penalty through an adaptivity fac-tor. This adaptivity factor is computed from an initialzero-consistent [5] solution, that encourages fusion forneighborhoods of links.

2 Methods and Results

Consider K data sets X1, X2, . . . , XK with K ≥ 2corresponding to K classes. Data set Xk consists ofnk observations and p variables, which are common toall K classes. We assume the observations within eachdata set to be i.i.d. N(0,Σk). Let Θk =

(

Σk)

−1and

Sk be the empirical covariance matrix of the k class.We propose, similarly to [2], to optimize the penalizedlog-likelihood function

l (Θ) =K∑

k=1

nk

[

ln(

det(

Θk))

− tr(

SkΘk

)]

(1)

− λ1

K∑

k=1

∑

i6=j

[

α

∣

∣

∣θkij

∣

∣

∣+ (1− α)

(

θkij

)2]

− λ2

∑

k<k′

∑

i,j

ωij

∣

∣

∣θkij − θ

k′

ij

∣

∣

∣,

where θkij is the ij element of Θk; λ1, λ2 are the tun-ing parameters for the elastic net and fused penalties,respectively; and ωij are the adaptivity factors. Wedefine the latter as follows

ωij =

∑

k<k′

|θkij − θk′

ij |∑

k<k′

∑

l∈Nij

(

|θkil − θk′

il |+ |θkjl − θk′

jl |)

−γ

,

(2)

where the θkij are the initial estimates of the network,Nij denotes the set of neighbors of link (i, j), that is,the set of links connected to genes i and j; and γ is apositive parameter that controls the level of adaptivity.This adaptivity factor encourages fusion of link (i, j)

ROKS 2013

85

across all classes when they are already close, or whenits neighbors are. By doing this, we adapt the tuningparameter for the fused penalty, λ2, based on an initialzero-consistent estimate of the network ([7]).

Figure 1 shows the ROC curves, averaged over 50 repli-cations, for modular and non-modular simulated data.They show a comparison of our method, Adaptivity I,for γ = 1 and γ = 0.5, and the regular fused lasso (Noadaptivity). The figure shows that our method per-forms better in the presence of modularity and as wellas the regular fused lasso when modularity is absent.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FPR

TP

R

No adaptivityAdaptivity I−1Adaptivity I−0.5

(a) Non-modular networks

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FPR

TP

R

No adaptivityAdaptivity I−1Adaptivity I−0.5

(b) Modular networks

Fig. 1: ROC curves. All methods perform similarly fornon-modular networks. Adaptivity I performs better inpresence of modularity.

Generalization to specific pairwise penalties

The adaptivity parameter (2) can be modified so it pe-nalizes determined pairwise differences. This is neces-sary when looking for neighborhoods where all links areequal across the same subset of classes. In this case theadaptivity factor ωkk′

ij has to be defined for each pair ofclasses kk′.

Data sets with unequal sample sizes present a challengefor network estimation. Classes with larger sample sizestend to dominate the estimation in two ways. Theymake the estimated networks of classes with smallersample sizes sparser than those with larger sample sizes.Also, they make classes with smaller sample sizes tofuse to each other faster than to classes with largersample sizes. To alleviate this, we define a class specificsparsity penalty and a pairwise specific fused penalty,both corrected by an effective sample size [4].

Another interesting problem arises in the presence ofordered variables. For example, if survival data is avail-able, samples can grouped in T survival levels, thus cre-ating a total of KT classes. The fact that survival is anordered variable implies that, for a given cancer class k,network links can be fused only for consecutive survivallevels t and t+ 1. This is part of our future work.

To optimize equation (1) in the presence of pairwisespecific adaptivity factors ωkk′

ij , we solve the requiredfused lasso problem following [6].

We simulate data with pair-specific modular structure(Modular networks 2), as opposed example (Modu-lar networks 1). In Figure 2 we show the box plots,

computed over 100 replications, for the TPR and con-strained FPR to approximately 0.1. No adaptivity cor-responds to the regular fused lasso, Adaptivity I to thenon-pairwise specific fused penalty with and Adaptiv-ity II to the pairwise specific case, also computed withγ = 1.

Fig. 2: Box plots for TPR with FPR=0.1. All meth-ods perform similarly for non-modular networks. Ourmethod performs better in presence of modularity.

We see that, in the absence of modularity all methodsperform similarly. When the same modularity structureis present across all classes, Adaptivity I performs bet-ter. In the case of pair-specific class modularity Adap-tivity II shows a superior performance.

Acknowledgments

We thank our collaborators from the SalhgrenskaAcademy Sven Nelander and Teresia Kling, and Hol-ger Hoefling for sharing with us unpublished results.

References

[1] Boyd, S. and Parikh, N. and Chu, E. and Peleato, B.and Eckstein, J. Distributed Optimization and Sta-tistical Learning via the Alternating Direction Methodof Multipliers. Foundations and Trends in Machine

Learning, 3(1):1–122, 2011.

[2] Danaher, P., Wang, P. and Witten, D. The joint graph-ical lasso for inverse covariance estimation across mul-tiple classes. arXiv:1111.0324v1, 2011.

[3] Jornsten, R. et al. Network modeling of the transcrip-tional effects of copy number aberrations in glioblas-toma. Molecular Systems Biology, 7(486):485–516,2011.

[4] Sanchez, J. and Kling, T. and Johansson, P. andNelander, S. and Jornsten, R. Comparative networkanalysis of human cancer: sparse graphical models withmodular constraints and sample size correction. Tech-

nical report, Chalmers University of Technology., 2013.

[5] Sharma, D. B. and Bondell, H. D. and Zhang, H. H.Consistent Group Identification and Variable Selectionin Regression with Correlated Predictors. Journal of

Computational and Graphical Statistics In press., 2012.

[6] Ye, G. and Xie, X. Split Bregman method for largescale fused Lasso. Computational Statistics and Data

Analysis., 55(4):1552–1569, 2011.

[7] Zou, H. The Adaptive Lasso and its Oracle Proper-ties. Journal of the American Statistical Association.,101(476):1418–1429, 2006.

ROKS 2013

86

Learning Rates of `1−regularized Kernel Regression

Lei Shi, Xiaolin Huang and Johan A.K. SuykensDepartment of Electrical Engineering, KU Leuven,

ESAT-SCD-SISTA, B-3001 Leuven, [email protected] (L. Shi), [email protected] (X. Huang)

[email protected] (J. Suykens)

Abstract: We study the learning behavior of `1−regularized kernel regression. We showthat a function space involved in the error analysis induced by the `1-regularizer and kernelfunction has nice behaviors in terms of the `2-empirical covering numbers of its unit ball.Based on this result, we obtain the best learning rates of the algorithm so far.Keywords: Learning theory, `1-regularization, `2-empirical covering number, error analysis

1 Introduction

The regression problem aims at estimating the functionrelations from random samples and occurs in variousstatistical inference applications. An output estimatorof regression algorithms is usually expressed as a linearcombination of features, i.e., a collection of candidatefunctions. As an important issue in statistical leaningtheory and methodologies, sparsity focuses on studyingthe sparse representations of such linear combinationsresulting from the algorithms. It is widely known thatan ideal way to obtain the sparsest representations is topenalize the combinatorial coefficients by the `0−norm.However, the algorithms based on `0−norm often leadto an NP-hard discrete optimization problem, whichmotives the researchers to consider the `q−norm (0 <q ≤ 1) as the substitution. In particular, the `1−normconstrained or penalized algorithms have achieved greatsuccess in a wide range of areas from signal recoveryto variable selection in statistics. Due to the intensivestudy on compressed sensing, the algorithms involvingthe `1−norm have drawn much attention in the last fewyears and been used for various applications, includingimage denoising, medical reconstruction and databaseupdating.

Here, we focus on the `1−regularized kernel regression.This algorithm minimizes a least-square loss functionaladding a coefficient-based `1−penalty term over a lin-ear span of features generated by some kernel function.We establish a rigorous mathematical analysis on theasymptotic behavior of the algorithms under the frame-work of statistical learning theory.

2 Formal Setting

Let X be a compact subset of Rd and Y ⊂ R, ρ bea Borel probability distribution on Z = X × Y . Forf : X → Y and (x, y) ∈ Z, the least-square loss is given

by (f(x) − y)2. Then the resulting target function iscalled regression function and satisfies

fρ = arg min∫

Z

(f(x)− y)2dρ

∣∣∣∣f : X → Y,measurable

.

In the supervised learning framework, ρ is unknownand one estimates fρ based on a set of observationsz = (xi, yi)m

i=1 ∈ Zm which is assumed to be drawnindependently according to ρ. We additionally supposethat the conditional distribution ρ(·|x) is supported on[−M, M ], for some M ≥ 1 and each x ∈ X.

Given a kernel function K : X×X → R, the output es-timator of `1−regularized kernel regression is expressedas f =

∑mi=1 czi K(x, xi), where its coefficient sequence

cz = (czi )mi=1 is a solution of the optimization problem

minc∈Rm

1m

m∑

j=1

(yj −

m∑

i=1

ciK(xj , xi)

)2

+ γ‖c‖1

.

Here γ > 0 is called a regularization parameter and ‖c‖1denotes the `1−norm of c. Recall that for any sequencew = (wn)∞n=1, the `1−norm is defined as

‖w‖1 =∑

n∈supp(w)

|wn|,

where supp(w) := n ∈ N : wn 6= 0. The kernelK here is not necessarily symmetric or positive semi-definite, which leads to much flexibility.

3 Main Results

3.1 Capacity of the hypothesis space under`1−constraintWe shall show a nice feature of the `1-regularizer ontight bounds for `2-empirical covering numbers of a re-lated function space H1.

ROKS 2013

87

Definition 1. Define a Banach space H1 =f : f =∑∞

j=1 αjKuj, αj ∈ `1, uj ⊂ X

with the norm

‖f‖ = inf

∞∑

j=1

|αj | : f =∞∑

j=1

αjKuj

.

The continuity of K ensures that H1 consists of con-tinuous functions. Denote the ball of radius R > 0 asBR = f ∈ H1 : ‖f‖ ≤ R.

The `2−empirical covering number is defined by meansof the normalized `2-metric d2 on the Euclidian spaceRl, i.e, for any a = (ai)l

i=1,b = (bi)li=1 ∈ Rl, the `2-

metric is given by

d2(a,b) =

(1l

l∑

i=1

|ai − bi|2)1/2

.

Definition 2. For a subset S of a pseudo-metric space(M, d) and ε > 0, the covering number N (S, ε, d) isdefined to be the minimal number of balls of radius εwhose union covers S. For a set F of functions on Xand ε > 0, the `2-empirical covering number of F isgiven by

N2(F , ε) = supl∈N

supu∈Xl

N (F |u, ε, d2),

where for l ∈ N and u = (ui)li=1 ⊂ X l, we denote the

covering number of the subset F |u = (f(ui))li=1 : f ∈

F of the metric space (Rl, d2) as N (F |u, ε, d2).

We will present a bound on the logarithmic `2-empiricalcovering numbers of B1. This bound holds for a generalclass of input space satisfying an interior cone condi-tion.

Definition 3. A subset X of Rd is said to satisfy an in-terior cone condition if there exist an angle θ ∈ (0, π/2),a radius RX > 0, and a unit vector ξ(x) for every x ∈ Xsuch that the cone

C(x, ξ(x), θ, RX) =

x + ty : y ∈ Rd, |y| = 1,

yT ξ(x) ≥ cos θ, 0 ≤ t ≤ RX

is contained in X.

Remark 1. The interior cone condition excludes thosesets X with cusps. It is valid for any convex subset ofRd with Lipschitz boundary [1].

Now we are in the position to give our result on thecapacity of B1.

Theorem 1. Let X be a compact subset of Rd. Supposethat X satisfies an interior cone condition and K ∈C s(X ×X) with s ≥ 2 . Then there exists a constantCX,K that depends on X and K only, such that

log N2(B1, ε) ≤ CX,Kε−2d

d+2bsc log(

2ε

),∀ 0 < ε ≤ 1,

where bsc denotes the integral part of s.

3.2 Convergence rates of `1−regularized kernelregressionDue to the least-square nature, the performance of thealgorithm can be measured by the error ‖f − fρ‖2L2

ρX

,

where ρX is the marginal distribution of ρ on X [3].We say that K is a Mercer kernel if it is continuous,symmetric and positive semi-definite on X×X. Such akernel can generate a reproducing kernel Hilbert space(RKHS) HK [2]. For a continuous kernel function K,define

K(u, v) =∫

X

K(u, x)K(v, x)dρX(x). (1)

Then one can verify that K is a Mercer kernel.

The following result is stated in terms of properties ofthe input space X, the measure ρ and the kernel K.

Theorem 2. Assume that X is a compact convex sub-set of Rd with Lipschitz boundary, K ∈ C s(X × X)with s > 0 and fρ ∈ H eK with K defined by (1). Let0 < δ < 1 and

Θ =

d+2s2d+2s , if 0 < s < 1,d+2bsc2d+2bsc , otherwise,

(2)

where bsc denotes the integral part of s. Take γ = mε−Θ

with 0 < ε ≤ Θ− 12 . Then with confidence 1− δ, there

holds

‖f − fρ‖2L2ρX

≤ Cε log(6/δ) (log(2/δ) + 1)6 mε−Θ, (3)

where Cε > 0 is a constant independent of m or δ.

AcknowledgmentsSupported by GOA/10/09 MaNet, CoE EF/05/006(OPTEC), FWO: G0226.06, G.0302.07, G.0588.09,SBO POM, IUAP P6/04 (DYSCO, 2007-2011), ERCAdG A-DATADRIVE-B, FWO structured modeling.

References

[1] R. Adams and J. Fournier, Sobolev Spaces, Aca-demic press, 2003.

[2] N. Aronszajn, Theory of reproducing kernels,Trans. Amer. Math. Soc. 68 (1950), 337–404.

[3] F. Cucker and D. X. Zhou, Learning Theory:An Approximation Theory Viewpoint, CambridgeUniversity Press, 2007.

ROKS 2013

88

Reduced Fixed-Size LSSVM for Large Scale Data

Raghvendra MallKU Leuven, ESAT-SCDKasteelpark Arenberg 10B-3001 Leuven, Belgium

[email protected]

Johan A.K. SuykensKU Leuven, ESAT-SCDKasteelpark Arenberg 10B-3001 Leuven, Belgium

[email protected]

Abstract: We present sparse reductions to Fixed-Size LSSVM (FS-LSSVM) which can han-dle large scale data. The FS-LSSVM model achieves sparsity by solving the problem in theprimal using a Nystrom approximated feature map. To create this feature map a set of pro-totype vectors (PV) are selected. However, this solution is not the sparsest. We investigatethe sparsity-error trade-off by introducing a second level of sparsity. This is done by meansof iterative sparsifying L0-norm based reduction on the FS-LSSVM model. We conduct ex-periments on two large scale real world datasets - Forest Covertype for classification andYear Prediction for regression to show the effectiveness of the proposed models and trade-offbetween sparsity and error estimations.Keywords: sparse models, FS-LSSVM, L0-norm

.

1 Introduction

The LSSVM model [1] has become a state-of-the-arttechnique in classification and regression. The LSSVMmodel solves an optimization problem which in the dualleads to solving a system of linear equations. A draw-back of the LSSVM model is that each data point be-comes a support vector(SV). Several works in literatureincluding [2–8] address the problem of sparsity in theLSSVM model. But these techniques cannot guaran-tee a great reduction in the number of support vectors.One approach that proposed to directly enforce spar-sity from the beginning was introduced in [9, 11] andis referred as the fixed-size least squares support vectormachines. This method uses an explicit expression forthe feature map using the Nystrom method [10] usinga set of prototype vectors (PV) of cardinality M N .However, this is not the sparsest solution and choiceof optimal value of M is an open problem. In [11],they select M as the number of PV required for theFS-LSSVM performance to be similar to the LSSVMperformance.

In recent years, the L0-norm has received increased at-tention. The L0-norm counts the number of non-zeroelements of a vector. So, when minimized it can re-sult in extremely sparse models. But since it is a NP-hard problem several approximations to it have beendiscussed in [12, 14]. In this paper, we propose sparsereductions to a FS-LSSVM model using the iterativesparsifying procedure for L0-norm introduced in [14].The major motivation of obtaining a sparse solution

is that sparseness allows memory and computationallyefficient techniques and reduces the number of supportvectors to decreases the out-of-sample prediction time.

2 Sparse Reductions to FS-LSSVM

2.1 ALL L0-norm FS-LSSVM modelFor this method, we first need to build a FS-LSSVMmodel. For this purpose, we select M prototype vec-tors by maximizing the quadratic Renyi entropy whichapproximates the information in the large N ×N ker-nel matrix with a smaller M ×M kernel matrix. Wethen generate an explicit approximate feature map us-ing Nystrom approximation and solve the optimizationproblem in the primal resulting in the model (w, b) andthe L0-norm problem can be formulated as:

minw,b,e

J (w, e) = ‖w‖0 +γ

2

N∑i=1

e2i

s.t. wᵀφ(xi) + b = yi − ei, i = 1, . . . , N.

(1)

The weight vector w can be approximated as a lin-ear combination of the M prototype vectors i.e. w ≈∑Mj=1 βj φ(xj) [13] where βj ∈ R and need not be

the Lagrange multipliers. We apply the regularizationweight λj on each of these βj to iteratively sparsify anapproximate L0-norm solution. Based on [14], we con-struct the following generalized primal problem:

minβ,b,e

J(β, e) =12

M∑j=1

λj β2j +

γ

2

N∑i

e2i

s.t.M∑j=1

βjKij + b = yi − ei, i = 1, . . . , N

(2)

ROKS 2013

89

where K is the kernel matrix such that Kij =φ(xi)ᵀφ(xj), xi belongs to training set and xj belongsto the protoype vectors PV set. After elimination of ei,the optimization problem can be re-written as:

minβ,b

J(β, b) =12

M∑j

λj β2j +

γ

2

N∑i

(yi − (M∑j

βjKij + b))2.

(3)The equation (3) allows extending this approach forlarge scale datasets. The solution to (3) can be obtainedby solving:[

KᵀK + 1γ diag(λ) Kᵀ1N

1ᵀNK 1ᵀ

N1N

] [β

b

]=[Kᵀy1ᵀNy

](4)

where KᵀK is a M ×M matrix. The procedure to ob-tain sparseness involves iteratively solving the system(4) for decreasing values of λ as shown in [14]. To pre-vent the L0-norm approximation from falling in a badlocal minima, we initialize βj = wj , j = 1, . . . ,M . Foreach iteration the λj = 1

β2j

. Once we obtain the reduced

set (SV) of support vectors, we re-perform FS-LSSVMusing this SV set. More details about the proposedmethod can be found in [15].

2.2 ExperimentsWe conducted experiments on two large scale datasetswith nearly 0.5 million points each. Results are de-picted in Table 1. We observe that the number of sup-port vectors reduce without much difference in the errorestimations.

Forest Cover Year PredictionAlgorithm Error SV Time Error SV TimeFS-LSSVM 0.2± 0.02 763118500± 9294 0.4± 0.03 718248046± 15330

ALL L0-norm0.22± 0.03505118570± 92480.44± 0.05595248412± 15248

Tab. 1: Mean Test Performance for 10 randomizations

3 Conclusion

We proposed a sparse reduction to FS-LSSVM modelfor large scale data which results into sparser solutionthan the FS-LSSVM model without significant differ-ence in error estimations.

AcknowledgmentsThis work was supported by Research Council KUL,ERC AdG A-DATADRIVE-B, GOA/10/09MaNet,CoE EF/05/006, FWO G.0588.09, G.0377.12, SBOPOM, IUAP P6/04 DYSCO, COST intelliCIS.

References

[1] J. A. K. Suykens and J. Vandewalle. Least Squares SupportVector Machine Classifiers. Neural Processing Letters, 9(3):293-300, 1999.

[2] D. Geebelen, J. A. K. Suykens and J. Vandewalle. Reducingthe Number of Support Vectors of SVM Classifiers Using theSmoothed Separable Case Approximation. IEEE Transactionson Neural Networks and Learning Systems, 23(4): 682-688,April 2012.

[3] C. J. C Burges. Simplified support vector decision rules. In Pro-

ceedings of 13th International Conference on Machine Learning,71-77, 1996.

[4] B. Scholkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. Muller,G. Ratsch and A. J. Smola. Input space verus feature space inkernel-based methods”. IEEE Transactions on Neural Net-works and Learning Systems, 10(5): 1000-1017, September1999.

[5] T. Downs, K. E. Gates and A. Masters. Exact Simplificationof Support Vector Solutions. Journal of Machine Learning Re-search, 2:293-297, December 2001.

[6] Y. J. Lee and O. L. Mangasarian. RSVM: Reduced SupportVector Machines. In Proceedings of the 1st SIAM InternationalConference on Data Mining, 2001.

[7] J. A. K. Suykens, L. Lukas and J. Vandewalle. Sparse approx-imation using Least Squares Support Vector Machines. In Pro-ceedings of IEEE International Symposium on Circuits and Sys-tems (ISCAS 2000), 757-760, 2000.

[8] Y. Li, C. Lin and W. Zhang. Improved Sparse Least-SquaresSupport Vector Machine Classifiers. Neurocomputing, 69(13):1655-1658, 2006.

[9] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moorand J. Vandewalle. Least Squares Support Vector Machines,World Scientific Publishing Co, Pte, Ltd (Singapore), (ISBN:981-238-151-1), 2002.

[10] C. K. I. Williams and M. Seeger. Using the Nystrom methodto speed up kernel machines. Advances in Neural InformationProcessing Systems, 13:682-688, 2001.

[11] K. De Brabanter, J. De Brabanter, J. A. K. Suykens and BartDe Moor. Optimized Fixed-Size Kernel Models for Large DataSets. Computational Statistics & Data Analysis, 54(6): 1484-1504, 2010.

[12] E. J. Candes, M. B. Wakin and S. Boyd. Enhancing Sparsity byReweighted l1 Minimization. Journal of Fourier Analysis andApplications, 14(5): 877-905, Special issue on sparsity, 2008.

[13] G. C. Cawley and N. L. C. Talbot. Improved sparse least-squares support vector machines. Neurocomputing, 48(1-4):1025-1031, 2002.

[14] K. Huang, D. Zheng, J. Sun, Y. Hotta, K. Fujimoto and S.Naoi. Sparse Learning for Support Vector Classification. PatternRecognition Letters, 31(13):1944-1951, 2010.

[15] M. Raghvendra and J. A. K. Suykens. L0-reduced LSSVMs forLarge Scale Data. Internal Report 12-185, ESAT-SISTA, K.U.Leuven (Leuven, Belgium), 2012.

A NotationsFS-LSSVM L0-norm ALL L0-norm FS-LSSVM

SV/Train N/N M/N M ′/N

Primal w Step 1 β Step 2 w′

φ : Rd → RM → K ∈ RN×M → φ′ : Rd → RM′

Tab. 2: Related to the sparse reduction of FS-LSSVMmethod. First FS-LSSVM is performed in the Primal. InStep 1, an iterative sparsifying L0-norm procedure is per-formed in Primal where w ≈

PMi=1 βφ(xi) and regulariza-

tion weights λj are applied on β. The K matrix incorporatesinformation about the entire training set (N). Finally, FS-LSSVM is re-performed in the Primal using the reduced SVset (M ′).

ROKS 2013

90

Pattern Recognition for Neuroimaging Toolbox

Jessica SchrouffUniversity of Liege, Belgium

[email protected]

Maria J. RosaUniversity College London, UK

[email protected]

Jane RondinaUniversity College London, UK

Andre F. MarquandKings College, London, UK

Carlton ChuNIMH, NIH, USA

John AshburnerUniversity College London, UK

Jonas RichiardiStanford University, USA

Christophe PhillipsUniversity of Liege

Janaina Mourao-MirandaUniversity College London, UK

Abstract: In the past years, mass univariate statistical analyses of neuroimaging data havebeen complemented by the use of multivariate pattern analyses, especially based on machinelearning models. While these allow an increased sensitivity for the detection of spatiallydistributed effects compared to univariate techniques, they lack an established and accessiblesoftware framework. Here we introduce the “Pattern Recognition for Neuroimaging Toolbox”(PRoNTo), an open-source, cross-platform and MATLAB-based software comprising manynecessary functionalities for machine learning modelling of neuroimaging data.

Keywords: software, neuroimaging, machine learning

1 Introduction

Various imaging modalities, such as func-tional/structural Magnetic Resonance Imaging(fMRI/sMRI) and Positron Emission Tomography(PET), have been developed to record brain structureand activity. Until recently, such data were analysedusing standard univariate statistics, for example bylinking the time-series of the signal in each voxelwith a regressor, such as in the General Linear Model(GLM) implemented in Statistical Parametric Mapping(SPM, [2]). Although univariate analyses have provenpowerful for making regionally specific inferences onbrain function and structure, there are limitations tothe type of research questions that they can address.More recently, these mass univariate analyses havebeen complemented by the use of pattern recogni-tion analyses, in particular using machine learningbased predictive models [3]. These analyses focus onpredicting a variable of interest (e.g. mental state1 vs. mental state 2, or patients vs. controls) fromthe pattern of brain activation/anatomy over a setof voxels. Due to their multivariate properties, thesemethods can achieve relatively greater sensitivity andare able to detect subtle, spatially distributed patternsin the brain. Potentially, pattern recognition canalso be used to perform computer-aided diagnosticof neurologic or psychiatric disorders. Currently,the existing implementations consist of small codesnippets, or sets of packages, and lack a dedicated

single, integrated, and flexible software framework. Inaddition, the use of existing packages often requireshigh-level programming skills.

2 Pattern Recognition forNeuroimaging Toolbox

The “Pattern Recognition for Neuroimaging Toolbox”(PRoNTo1, [6]) is a user-friendly and open-source tool-box that makes machine learning modelling available toevery neuroimager. In PRoNTo, brain scans are treatedas spatial patterns and learning models are used to iden-tify statistical properties of the data that can be used todiscriminate between experimental conditions or groupsof subjects (classification models) or to predict a con-tinuous measure (regression models). In terms of neu-roimaging modalities, PRoNTo accepts NIfTI files2 andcan therefore be used to analyse sMRI and fMRI, PET,SPM contrast images and potentially any other modal-ity in NIfTI file format. Its framework allows fully flex-ible machine learning based analyses and, while its userequires no programming skills, advanced users can eas-ily access technical details and expand the toolbox withtheir own developed methods. Each step of the analysiscan also be reviewed via user-friendly displays. Figure1 provides an overview of the toolbox framework.

1PRoNTo, and all its documentation, are available to down-load freely from: http://www.mlnl.cs.ucl.ac.uk/pronto/

2http://nifti.nimh.nih.gov/nifti-1/

ROKS 2013

91

Fig. 1: PRoNTo framework. PRoNTo consistsin five main analysis modules (blue boxes in thecentre): dataset specification, feature set selection,model specification, model estimation and weightscomputation. In addition, it provides two main re-viewing and displaying facilities (model, kernel andcross-validation displays, as well as, results display).PRoNTo receives as input any NIfTI image (com-prising the data and a first-level mask, while an op-tional second-level mask can also be entered). Theoutputs of PRoNTo include: a data structure calledPRT.mat, a data matrix (with all features), one ormore kernels, and (optionally) images with the clas-sifier weights.

PRoNTo can be used in three ways: through a graph-ical user-interface requiring no programming skills, us-ing the MATLAB-batch system , or by scripting func-tion calls. In the current version of PRoNTo, two lin-ear kernel classification algorithms are embedded in theframework: Support Vector Machines ([1], LIBSVMimplementation) and (binary and multiclass) GaussianProcess classification ([4], GPML toolbox). Regres-sion can be performed using Kernel-Ridge Regression(KRR, [5]), Relevance Vector Regression (RVR, [7])or Gaussian Processes Regression [4]. All algorithmsare wrapped into what is called a “machine”, whichis independent from the design definition and cross-validation procedure. This allows an easy integrationof new machine learning algorithms, enhancing the ex-change of newly developed methods within the com-munity, and the possibility to develop more advancedvalidation frameworks (e.g. nested cross-validation).

Several data sets were analysed with PRoNTo, showingthe breadth of questions it can address [6]. As exam-ples: Do patterns of brain activation recorded in fMRIencode information about a mental state? Can groupsof subjects be distinguished based on features derivedfrom their sMRI? Could their age be predicted withthese neuro-anatomical features?

3 Discussion and Conclusions

In this work, we presented PRoNTo, a freely availablesoftware which addresses neuroscientific questions us-ing machine learning based modelling. Although in itsfirst version, PRoNTo provides both graphical inter-faces for an easy use and a flexible programming frame-work. The authors therefore hope to facilitate the inter-action between the neuroscientific and machine learningcommunities. On one hand, the machine learning com-munity should be able to contribute to the toolbox with

novel machine learning models. On the other hand, thetoolbox should provide a variety of tools for the (clin-ical) neurosciencists, enabling them to ask new ques-tions that cannot be easily investigated using existingstatistical analysis tools.

AcknowledgmentsThe development of PRoNTo was mainly supported bythe Pascal2 network and its Harvest Programme.

References

[1] C. Burges. A tutorial on support vector machinesfor pattern recognition. Data Mining and Knowl-edge Discovery, 2:121–167, 1998.

[2] K. Friston, J. Ashburner, S. Kiebel, T. Nichols,and W. Penny. Statistical Parametric Mapping: theanalysis of functional brain images. Elsevier Aca-demic Press, London, 2007.

[3] T. Hastie, R. Tibshirani, and J. Friedman. Elementsof Statistical Learning. Springer, 2003.

[4] C. Rasmussen and C. Williams. Gaussian Processesfor Machine Learning. the MIT Press, 2006.

[5] C. Saunders, A. Gammerman, and V. Vovk. Ridgeregression learning algorithm in dual variables. InProceedings of the 15th International Conference onMachine Learning, pages 515–521, 1998.

[6] J. Schrouff, M. Rosa, J. Rondina, A. Marquand,C. Chu, J. Ashburner, C. Phillips, J. Richiardi, andJ. M. ao Miranda. PRoNTo: Pattern Recognitionfor Neuroimaging Toolbox. Neuroinformatics, pages1–19, 2013.

[7] M. Tipping. Sparse bayesian learning and the rele-vance vector machine. Journal of Machine LearningResearch, 1:211–244, 2001.

ROKS 2013

92

Stable LASSO for High-Dimensional Feature Selection

through Proximal Optimization

Roman Zakharov, Pierre DupontMachine Learning Group, ICTEAM Institute,

Universite catholique de Louvain, B-1348 Louvain-la-Neuve, Belgiumroman.zakharov,[email protected]

Abstract: The l1-norm regularization is commonly used when estimating (generalized) lin-ear models while enforcing sparsity. The automatic feature selection embedded in such anestimation is however known to be highly unstable since, among correlated features, an l1penalty tends to favor the selection of a single feature, essentially picked at random.This paper introduces a modified optimization objective to stabilize LASSO or similar ap-proaches. The solution to this modified problem is constrained by a norm ball rescaledaccording to the variances of the predictor variables. We further describe how such problemscan be efficiently solved through proximal optimization.Classification experiments conducted on several microarray datasets show the benefits of theproposed approach, both in terms of stability and predictive performances, as compared tothe original LASSO, Elastic Net, Trace LASSO and a simple variance based filtering.Keywords: feature selection, regularization, stability, LASSO, proximal optimization

1 Introduction

Feature selection aims at improving the interpretabilityof predictive models and at reducing the computationalcost when predicting from new observations. Such a se-lection is also desirable when it is a priori known thatthe model should be sparse or to prevent overfitting.This is especially relevant when the number p of in-put features, or predictor variables, largely exceeds thenumber n of training observations. In such contexts,feature selection can also increase the predictive per-formances.

The l1-norm is often used to regularize (generalized)linear models while performing an automatic feature se-lection by driving most model coefficients towards zero.The LASSO method [2] precisely combines such a reg-ularization with a least square loss for estimating a re-gression model.

Predictive models estimated with a LASSO penalty arehowever known to be highly unstable, which means thatsmall data perturbations can imply drastic changes inthe subset of automatically selected variables. The lackof stability of the LASSO is generally attributed tothe fact that, among several correlated features, an l1penalty tends to favor the selection of a single feature,essentially picked at random. In contrast, univariatefilter methods, such as a t-test feature ranking, rely ongeneral statistical characteristics of the data, which aremuch less sensitive to small data perturbations. Suchsimple selection methods are typically more stable but

ignore the possible correlations between features andare not embedded into the estimation of a predictivemodel.

The S-LASSO method detailed in section 2 relies on amodified optimization objective to stabilize the LASSO.The solution to this modified problem is constrainedby a norm ball rescaled according to the variancesof the predictor variables. In contrast to the ElasticNet [5] and Trace LASSO [1] approaches, which fa-vor the joined selection of correlated features, S-LASSOtends to discard low variance features because they areexpected to be less informative.

2 A scaled proximal method for featureselection

Let X = (x1, . . . ,xn)T ∈ Rn×p be the design matrixmade of n training observations in Rp, and y ∈ Rn bethe response vector. Learning the weight vector w ofa simple linear model y = wT x + ε, where ε denotes aGaussian noise with 0 mean and variance σ2, is com-monly phrased as a convex optimization problem of theform

minw∈Rp

f(w) + Ω(w), (1)

where f : Rp → R is a convex differentiable loss func-tion and Ω : Rp → R is a convex norm, not necessar-ily smooth or Euclidean. The Ω regularization termaims at reducing over-fitting by penalizing large abso-lute weight values, for which small input changes wouldhave a significant impact on the predicted output.

ROKS 2013

93

In this work we propose to modify the general opti-mization problem (1) while rescaling the norm penaltyaccording to individual feature variances:

minw∈Rp

f(w) + λ

p∑j=1

1rj

Ω(wj), (2)

where vector r ∈ Rp is proportional to the feature vari-ances. We note that those variances are estimated be-fore centering and normalizing the data to unit varianceas usual when estimating a LASSO model. The pro-posed method also offers a general framework beyondthis specific choice of variance weighting. In practice,any vector r can be used to favor the selection of somevariables a priori believed to be more relevant. Thismodified objective can be straightforwardly used withany penalty that can be decomposed component-wise.We focus on the l1-norm as a special case of interestbut we also report positive experimental results withan Elastic Net penalty. As compared to (1), the regu-larization ball associated to Ω(w) now takes the formof an ellipsoid elongated along the directions of highervariance. In other words, the regularization constant λis rescaled in a component-wise fashion.

The proposed method is related to the AdaptiveLASSO [4] which also penalizes the l1 penaltycomponent-wise with adaptive weights. These weightsare initially equal to the ordinary least square estimatesand iteratively updated under the control of an addi-tional tuning parameter. In contrast, S-LASSO doesnot require such an additional parameter and iterativereweighting as it relies on the observed variances alongeach dimension. The proposed framework is also not re-stricted to the l1 penalty. Modification to the LASSOby some form of variance weighting has already beenproposed in [3]. This related work describes bounds onthe prediction error and oracle properties while we fo-cus on the improved stability of the embedded featureselection as a result of such variance weighting. Wealso show how the S-LASSO modified objective can beefficiently solved through proximal optimization.

The modified objective (2) promotes stability since theidentity of high variance features is expected not tochange much while varying the data sampling. Asshown experimentally, the predictive performances mayalso be improved since low variance features acrosslearning observations are expected to be less informa-tive for prediction. We also show that solving (2) offerbetter results than simply pre-filtering features basedon their variances. We further detail how this modifiedobjective can be efficiently solved through proximal op-timization.

3 Results

We report experimental performances of the proposedmethod with a lasso or Elastic Net penalty, and re-fer to those approaches as S-lasso and S-enet re-spectively. The competing approaches are the originallasso or enet with a logistic loss. We also report theperformances obtained with trace lasso adapted toa classification problem. Since s-lasso and s-enet usethe individual feature variances to modify the optimiza-tion objective, we also compare to variance ranking,which is a filter method keeping only a desired num-ber of features with the largest variances. Our exper-iments conducted on 5 microarray datasets illustratethe benefits of the proposed approach both in termsof stability of the gene selection and the classificationperformance, as compared to the original lasso, Elas-tic Net or trace lasso. In contrast, the stability ofVariance ranking is always very high but the predic-tive performances drop drastically when reducing thenumber of selected features.

AcknowledgmentsRoman Zakharov is supported by a FRIA grant(5.1.191.10.F).

References

[1] Edouard Grave, Guillaume Obozinski, and FrancisBach. Trace Lasso: a trace norm regularization forcorrelated designs. In Advances in Neural Informa-tion Processing Systems, Granada, Spain, 2011.

[2] Robert Tibshirani. Regression shrinkage and selec-tion via the lasso. Journal of the Royal StatisticalSociety, Series B, 58:267–288, 1994.

[3] Sara van de Geer, Peter Buehlmann, and ShuhengZhu. The adaptive and the thresholded lasso for po-tentially misspecified models (and a lower bound forthe lasso). Electronic Journal of Statistics, 5:688–749, 2011.

[4] Hui Zou. The adaptive lasso and its oracle prop-erties. Journal of the American Statistical Associa-tion, Vol. 101, No. 476, Theory and Methods:1418–1429, 2006.

[5] Hui Zou and Trevor Hastie. Regularization andvariable selection via the elastic net. Journal ofthe Royal Statistical Society, Series B, 67:301–320,2005.

ROKS 2013

94

Regularization in topology optimization

Atsushi Kawamoto, Tadayoshi Matsumori, Daisuke Murai and Tsuguo Kondoh

Toyota Central R&D Labs., Inc. Nagakute Aichi 480-1192, Japan

[email protected]

Abstract: This paper deals with topology optimization in terms of the regularization ofdesign spaces. Several filtering techniques have so far been proposed for regularizing designvariables and restricting the minimum length scale of designs. They are categorized into twomajor groups, namely density filters and sensitivity filters. Both filters have their pros andcons in theory and practice. This paper is intended to revisit the theoretical background oftopology optimization and discuss the long-lasting issue of regularization in topology opti-mization through the implementation of a PDE-based filter.

Keywords: topology optimization, regularization, filter

1 Introduction

Topology optimization is a design method that can op-timize not only the sizes and dimensions but also shapesand topologies of target structures [1]. The idea oftopology optimization is to consider structural designsas material distribution problems. The structural de-signs are represented by a scalar function called thecharacteristic function that takes on values between0 and 1: 0 stands for void and 1 for solid. Withthis representation one can freely design complicatedlayouts of target structures by distributing materials.However, this representation may cause some numer-ical instabilities due to the lack of smoothness in thecharacteristic function. To alleviate the numerical in-stabilities, image processing-based filtering techniquesare commonly used in continuum-based topology opti-mization. They are categorized into two major groups:density filters [2, 3] and sensitivity filters[6]. Sensitiv-ity filters are applied to the design sensitivities ratherthan the design variables themselves. Among these fil-ters the so-called convolution filter is the most widelyused for removing the checkerboard patterns [1]. It issimple to implement and easy to understand its geo-metric effects. One drawback is, however, that thereremain discrepancies between the filtered sensitivitiesand the actual sensitivities. On the other hand, PDE-based filtering techniques have also been proposed, inwhich Helmholtz type partial differential equations aresolved [4, 5]. From the implementation point of viewthe PDE-based filters are advantageous because onecan utilize the existing computational frameworks ofFEM. In this paper, we explain the basic concept oftopology optimization and the involved methodologi-cal elements, and then discuss the regularization of de-sign spaces by specifically focusing on the applicationof the PDE-based filtering technique directly to the de-sign variables instead of the design sensitivities.

2 Topology optimization

This section gives a short introduction to the topol-ogy optimization method and some inherent issues inthe methodology. Figure 1 shows the design domainwith the boundary and load conditions for a benchmarkdesign problem, in which we maximize the stiffness ofthe so-called MBB beam by distributing a prescribedamount of material within the design domain. Thisproblem can be formulated as the minimization of themean compliance under the total volume constraint:

minρ∈[0,1]

f :=

∫ΓN

t · u dΓ s.t. g :=

∫D

ρ dD− V ≤ 0. (1)

Note that the design variables are relaxed so as to takeintermediate values between 0 and 1. The displacementvector, u, is calculated from the following analysis prob-lem:

−∇ · (E : ε(u)) = 0 in Du = 0 on ΓD

(E : ε(u)) · n = t on ΓN

, (2)

where E stands for a homogeneous isotropic elasticitytensor and ε := (∇u +∇u>)/2. The material densityρ is embedded into the elasticity tensor as

E = ρPEmax + (1− ρP )Emin, (3)

0.1

6.0

1.0

Fig. 1: Design domain and boundary conditions for the 2Dminimum mean compliance problem

ROKS 2013

95

Fig. 2: Optimized structures for P = 1 (left) and P = 3(right)

where Emax is the elasticity tensor of the solid material,i.e.,E = Emax when ρ = 1; Emin is the lower bound setfor avoiding singularity when ρ = 0. P (= 3, typically)is introduced to promote black and white solutions.The design domain is discretized by a square elementwith ength ∆x = 0.025. In this problem we pursue asymmetric design. Therefore, we only solve the righthalf of the domain. This problem can be solved bya continuous gradient-based optimization method suchas SNOPT. The results are shown in Figure 2. In theleft structure a large gray area (grayscale problems) re-mains. By increasing the value of P we can force thedesign variables toward 0 or 1. However, we typicallyend up with a solution like the right illustration of Fig-ure 2 (checkerboard problems).

3 A PDE-based density filter

Since the density function, ρ ∈ L∞(D), can take anypoint-wise value, the raw density function may produceseverely oscillating designs like the checkerboard pat-tern shown in the right illustration of Figure 2. In or-der to regularize the density function we introduce thefollowing partial differential equation for % ∈ H1(D):

−R2∇2%+ % = ρ. (4)

Then we project % onto % as

% = H(%− 0.5;h), (5)

whereH is a regularized (differentiable) Heaviside func-tion with the half bandwidth h. By appropriately set-ting R and h, these equations function like a low-passfilter that acts on the raw density function, ρ, to pro-duce the smoothed density function, %. Then we use theprojected smooth density function, %, for the materialinterpolation in Eq. (3) instead.

4 Numerical examples

Figure 3 illustrates the optimized results for the densityin grayscale when parameter R in the PDE (4) is setto 1.0∆x, 2.0∆x, 4.0∆x, and the half bandwidth, h,in the Heaviside projection (5) is set to 0.1, 0.5, 1.0.The lesser R and the tighter h (the upper left) givefiner details with a clear outline in the final design,while the opposite combination (lower right) makes theoutline extremely blurred. Note that no checkerboardpatterns appeared in all the cases even though the linearelements are used for the discretization.

Fig. 3: Optimized results

5 Conclusions

We revisited the theoretical background of topology op-timization and discussed the issues inherent in the reg-ularization in topology optimization through the imple-mentation of a PDE-based filter. Since the filter actsdirectly on the design variables, the consistency of thedesign sensitivities is preserved. The effectiveness ofthe proposed method is confirmed through numericalexamples.

References

[1] M.P. Bendsøe, O. Sigmund. Topology Optimization– Theory, Methods, and Applications. SpringerVerlag: Berlin Heidelberg, 2003.

[2] B. Bourdin. Filters in topology optimization. In-ternational Journal for Numerical Methods in En-gineering, 50(9):2143–2158, 2001.

[3] E.T. Bruns, D.A. Tortorelli. Topology optimiza-tion of non-linear elastic structures and compliantmechanisms. Computer Methods in Applied Me-chanics and Engineering, 190(26-27): 3443–3459,2001.

[4] A. Kawamoto, T. Matsumori, S. Yamasaki, T. No-mura, T. Kondoh, S. Nishiwaki. Heaviside projec-tion based topology optimization by a pde-filteredscalar function. Structural and MultidisciplinaryOptimization, 44:19–24, 2011.

[5] B.S. Lazarov, O. Sigmund. Filters in topologyoptimization based on Helmholtz-type differentialequations. International Journal for NumericalMethods in Engineering, 86(6): 765–781, 2011.

[6] O. Sigmund. Morphology-based black and whitefilters for topology optimization. Structural andMultidisciplinary Optimization, 33(4-5): 401–424,2007.

ROKS 2013

96

Classification of MCI and AD patients combining PET

data and psychological scores

F. Segovia, C. Bastin, E. Salmon and C. PhillipsCyclotron Research Centre, University of Liege, Belgium

[email protected]

Abstract: This study’s aim was to measure the advantages of using psychological test datain the automatic classification of functional brain images in order to assist the diagnosis ofneurodegenerative disorders such as Alzheimer’s disease (AD). Several computer-aided diag-nosis systems for AD based on PET images were developed. Some of them used psychologicalscores beside the image data in the classification step and others did not. The results showthe ones that take into account the psychological scores achieve higher accuracy rates.

Keywords: Alzheimer’s disease, machine learning, psychological scores

1 Introduction

In recent years, many computer-aided diagnosis (CAD)systems for neurodegenerative disorders have been pre-sented [3, 5, 6]. Based on the assumption that patholog-ical manifestations of these disorders appear some yearsbefore subjects become symptomatic [1, 8], they try todiagnose them even before the classical diagnosis proce-dure based on psychological tests does. These systemstake advantage of the machine learning improvementscarried out in the last decades [7] and report accuracyrates over 90% when trying to distinguish between pa-tient and healthy controls [4].

Most of the recent CAD systems for neurodegenerativedisorders are based on brain imaging (including MRI,fMRI, PET and SPECT data) and report high accuracyrates. The small sample size problem [2] can be ad-dressed by means of a feature extraction technique thatreduces the huge amount of data contained in a brainimage into a relatively small unidimensional vector. Inthis case, the structure of the CAD systems based onneuroimaging and machine learning is as follows: Af-ter the preprocessing of the images (which involves thespatial registration and the intensity normalization), analgorithm is applied to select and summarize the rele-vant information. This information is rearranged in avector and used as feature for the classification step.Finally, a statistical classifier is used to separate patho-logical and control subjects, performing that way theautomatic diagnosis.

This paper shows that combining information fromPET images and psychological tests can improve the ac-curacy of CAD systems for dementia. Specifically, wehave developed several CAD systems to measure thebenefit of using some psychological scores beside theimage features in the classification step. The extrac-

tion of the image features was carried out by apply-ing three different techniques, namely Principal Com-ponent Analysis (PCA), Partial Least Squares (PLS)and Independent Component Analysis (ICA). The re-sults clearly show that using the psychological scoresbeside the image features improves the global accuracyof the systems.

2 Methods

Some dimension reduction techniques were applied toreduce the information contained in the brain images:

Principal Component Analysis (PCA) is a simple,non-parametric method of extracting relevant informa-tion from confusing data sets. Mathematically, PCAperforms a linear transformation that transforms thedata to a new coordinate system such that the largestvariance by any projection of the data comes to lie onthe first dimension, the second largest variance on thesecond dimension, and so on.

Partial Least Squares (PLS) is a statistical methodsimilar to PCA, however PLS carries out the transfor-mation by maximizing the covariance between the dataand some properties of the data. For classification pur-poses the PLS1 variant, which considers the class be-longing information as the unique property of the data,was used.

Independent Component Analysis (ICA) is a com-putational method for separating a multivariate signalinto additive subcomponents supposing the mutual sta-tistical independence of the non-Gaussian source sig-nals. ICA has been successfully applied to dimensionreduction problems by projecting the data into its in-dependent components, performing that way the reduc-tion.

ROKS 2013

97

Fig. 1: Accuracy rates achieved by the developed CAD sys-tems compared with the achieved ones by the same systemsbut using only the PET images as data source.

3 Experiments and results

Three different CAD systems were developed, each oneimplemented one of the three dimension reduction tech-niques mentioned in section 2. The final number ofcomponents (features) per image was determined bythe percentage of variance (90% of the total) in thecase of PCA and using the Fisher Discriminant Ratio incase of PLS and ICA (selected components gather 95%of total FDR). In addition, five psychological scoreswere used. Three of them were derived from a ver-bal cued recall memory task, reflecting respectivelythe efficiency of memory encoding (immediate recall),long-term episodic memory (cued recall) and monitor-ing capacities (intrusions). Two additional scores werephonemic (letter P) and semantic (animals) verbal flu-ency measures, as an index of executive functioning.

In this initial work, the combination of the image fea-tures and the psychological scores was carried out byconcatenating both into a feature vector which size isthe sum of the number of image features and psycho-logical scores. The classification step was performedby means of a Support Vector Machine (SVM) classi-fier and a linear kernel. The accuracy of the systemswas estimated using a database with 46 PET imagesfrom subjects originally diagnosed with Mild CognitiveImpairment (MCI). Those subjects whose diagnosis re-mained as MCI during three years or more were labeledas MCI and those whose diagnosis changed to demen-tia type Alzheimer in the same period were labeledAD. In summary, the database contains 20 MCI and26 AD images and for each of them the 5 psychologicalscores described above. Because of the reduced num-ber of images available, we used a leave-one-out cross-validation scheme to compute the performance of thesystems. Figure 1 shows the accuracy rates obtainedwith both imaging features and psychological scores orwith only the imaging features, for the 3 different di-mension reduction techniques. When the psychological

scores are included, the accuracy is higher than whenonly the image features are provided. The difference be-tween these pairs of accuracy values were statisticallyassessed through a non-parametric test. 1000 sets ofrandom psychological scores (same range as the originalones) were generated, then classifier was trained withthese random scores (with the image features) and theaccuracy estimated. A p−value was then calculated asthe number of cases where the accuracy obtained withthe random scores was larger than that obtained withthe true scores, divided by 1000, i.e. the probabilityof obtaining a better accuracy with a random score. Ap−value of 0.001, 0.002 and 0.01 was obtained for thePCA-, PLS- and ICA- based CAD systems respectively.

In light of these results we can conclude that addingsome information from psychological tests to the auto-matic diagnosis system for neurodegenerative disorderdoes improve the accuracy of the systems.

References

[1] Elisa Canu et al. Microstructural diffusion changesare independent of macrostructural volume loss inmoderate to severe alzheimers disease. The Journalof Alzheimer’s Disease, 19(3):963–976, 2010.

[2] R. P. W. Duin. Classifiers in almost empty spaces.In Proceedings 15th International Conference onPattern Recognition, volume 2, pages 1–7. IEEE,2000.

[3] Y. Fan et al. Spatial patterns of brain atrophy inMCI patients, identified via high-dimensional pat-tern classification, predict subsequent cognitive de-cline. NeuroImage, 39(4):1731–1743, 2008. PMID:18053747.

[4] Manhua Liu et al. Hierarchical fusion of featuresand classifier decisions for alzheimer’s disease di-agnosis. Human brain mapping, February 2013.PMID: 23417832.

[5] Janaina Mouro-Miranda et al. Classifying brainstates and determining the discriminating activa-tion patterns: Support vector machine on functionalMRI data. NeuroImage, 28(4):980–995, December2005. PMID: 16275139.

[6] F. Segovia et al. A comparative study of featureextraction methods for the diagnosis of alzheimer’sdisease using the ADNI database. Neurocomput.,75(1):6471, January 2012.

[7] Vladimir Vapnik. The Nature of Statistical LearningTheory. Springer, 2nd edition, November 1999.

[8] Jennifer L. Whitwell et al. 3D maps from multipleMRI illustrate changing atrophy patterns as sub-jects progress from mild cognitive impairment toAlzheimer’s disease. Brain, 130(7):1777–1786, 2007.

ROKS 2013

98

Kernels design for Internet traffic classification

Emmanuel Herbert, Stephane SenecalOrange Labs

38-40 rue du General Leclerc,92130 Issy-les-Moulineaux, [email protected]

Stephane CanuLITIS EA 4108, INSA Rouen

Avenue de l’Universite,76800 Saint-Etienne-du-Rouvray, France

[email protected]

Abstract: In the telecommunications industry, Internet Service Providers (ISP) are willingto provide secure access to the Internet. Thus, developing solutions for detecting maliciousactivities conducted on the deployed networks is a task of primary importance. Implementingthese monitoring features requires to handle a large volume of traffic data. Machine learningbased techniques are well suited to perform automatically this task. In this work, we takeinto account the non-numerical characteristics and the heterogeneousness of the data for thisproblem. We thus focus on kernel methods, and particularly on multiple kernel learningtechniques. The main challenge is then to design ad hoc kernels adapted to perform trafficclassification operations. Furthermore, such a multiple kernel method allows to achieve acomprehensive representation of the network data for easing malicious activities detection.

Keywords: Telecommunications, Internet, Machine Learning, Classification, Kernel Me-thods, Multiple Kernel Learning.

Important industrial challenges faced nowadays bytelecommunications and especially Internet operatorsdeal with ensuring security for their networks and theirusers. In this respect, being able to detect and to mo-nitor malicious activities on the networks has become acrucial task for operational network deployment. Con-cerning the Internet, Domain Name System (DNS) isthe service mapping a Fully Qualified Domain Name(FQDN, e.g. “www.google.com”) to Internet Protocol(IP) addresses (e.g. “173.194.34.16”). The IP address isthe machine-readable identity number for a computerconnected to the Internet network, a host. A FQDNis the human-readable identity name for a host. DNSplays a central role in the Internet as network commu-nications are performed using the FQDN. Therefore,considering the DNS traffic data is relevant for imple-menting malicious activities detection solutions on anInternet Service Provider’s (ISP) network.The volume of data to process in practice is very large:the DNS traffic can up to 40, 000 queries per second onan Internet resolving server. To monitor such a volume,ISP need to deploy methods to automatically aggregateand classify the data. The goal we pursue is to high-light suspicious DNS traffic, i.e. the small amount oftraffic which announces malicious activities such as abotnet initiating a communication with its commandand control server for instance. In this respect, we fo-cus on FQDN queries on Internet servers, which aredata of heterogeneous nature. Indeed, a FQDN querycan be viewed as a string (i.e. “www.google.com”), in-dicating a location in a labeled tree data structure. TheFQDN is also characterized by the lists of its related IP

addresses and of the caching durations associated tothe DNS mapping. We then seek for temporal patternsin the sets of FQDN queries in order to detect abnor-mal traffic activities. These are the variables of inte-rest for our classification problem. Current solutions toperform malicious activities detection tasks but withslightly different variables are based on decision treesclassification methods [1].Because the data are heterogeneous, as they are com-posed of strings, positions in a labeled tree, numericalvalues and lists of numerical values (from one to severaldozens of items for each FQDN), we naturally considerkernel based methods for implementing the classifica-tion task for the DNS traffic. Kernel methods are aset of powerful regression and classification techniqueswhich are useful for pattern analysis of non-numericaldata, see [2] for instance. We are motivated by therichness of the data representation provided by ker-nel methods compared to the classical representationbased on decision trees. Moreover, it is easier to obtaina visual representation of data with tools provided bykernel methods than with decision trees. For the dataaggregation and classification problem under conside-ration here, an interesting approach is to use multiplekernel learning techniques [3]. The core idea is to dealwith a kernel composed of a linear combination of se-veral kernels, each of them being applied to a homoge-neous part of the heterogeneous dataset. An importantaspect for implementing this technique is to tune thecoefficients of this linear combination. This aspect isnot investigated further in this abstract, we focus in-stead on the design of the kernels for the combination.

ROKS 2013

99

One of the design challenges to tackle is to define akernel enabling the representation of the FQDN pro-ximities, i.e. the representation of the FQDN close-ness on the labeled tree and with respect to theirassociated parameters. For instance how close are“www.google.com” and “www.orange.com” as strings,as elements on a labeled tree of domains, from theirqueries timestamps, and with respect to their relatedlists of IP addresses and associated caching durations?To ensure a relevant representation for the FQDN, werecall that the application of a kernel provides a dis-tance metric in a different representation space, namelythe feature space, thanks to the derived scalar productin this space. Proximity in the feature space shouldreflect the closeness of FQDN data in their initial re-presentation space.The approach we investigate for strings is based onthe kernel design techniques for document classification[4]. For instance, for strings it is possible to considerall-spectrum kernels [2], which measure the similaritybetween two strings based on their common n-grams.Thanks to these techniques, it is possible to design thekernel for strings representing FQDN.Then, for the location in the domain name tree of aFQDN, we can measure the length of the common suf-fix between two FQDN considering the level domain asa character [2]. The related kernel is then based on thelength of this common suffix. Instead of viewing theFQDN only as a string, i.e. a sequence of characters, itcan also be handled as a sequence of level domains. Itis thus possible to locate FQDN on the labeled tree.Also, in order to design kernels for the numerical valueslists characterizing a FQDN, it is interesting to considerthese lists as sets with appearance frequencies associa-ted to each element, as documents for information re-trieval are handled as sets of words (“bag-of-words”).We then take into account similarities between itemsof the lists. For instance, semantic relations betweenwords are considered for document processing. Themethod used to grasp similarities between items of thelists is based on the distance induced by the applicationof the kernel to the list elements.For the IP addresses, the network proximity is charac-terized by the membership to a subnetwork and by thelocalization of this subnetwork on the Internet. Thesubnetwork membership is provided by the first digitsof the IP address in its binary representation. The ker-nel for the IP addresses is then based on the length ofthe common prefix in this binary representation. Forthis kernel design, the longest the common prefix is,the closest the IP addresses considered are. As for thebag-of-words representation technique for documentsinvolving the term proximity matrix, we consider anduse here the Gram matrix of all possible IP addresses.For the caching durations, as they are usual numericalvalues, classical kernels like Gaussian or polynomial, arerelevant to represent the data.For queries timestamps, we design the kernel similarly

as in [5]. For a given FQDN, a first aggregation stepconsists in splitting the observation period into time-slots, the number of the queries are then counted ineach timeslot. Finally, we compute the autocorrelationvector of this time series. This vector highlights perio-dical behaviors for queries. It is used to built the relatedkernel, as an element of the feature space. Moreover, wealso consider queries frequencies instead of their num-ber for a given timeslot. It is thus possible to detectunusual amounts of queries for a specific FQDN.We ran numerical experimentations on a 10mn captureof raw traffic. The FQDN whitelist (resp. blacklist)consists in 500 (resp. 125) randomly chosen FQDN inAlexa top 1000 (resp. in an internal blacklist) and in thequeried FQDN from the traffic capture. As a first step,we tested kernel based unsupervised (kernel PCA [6])and supervised (kernel FDA) methods on this dataset.

The work presented in this extended abstract takes intoaccount the non-numerical features and the heteroge-neousness of the Internet DNS traffic data in compa-rison to existing implementations of Internet networkmalicious activities detection [1]. The approach pro-posed is based on multiple kernel learning method [3].One of the challenges tackled by the solution investi-gated is to design efficiently the kernels for handlingjointly strings, IP addresses lists and timestamps listsrepresenting the FQDN queries. Once such kernels arebuilt, it is possible to come up with a comprehensiverepresentation of the Internet DNS traffic data. A nextstep of this work consists in exploiting fully the rich-ness of this data representation by developping ad hocand efficient classification and data aggregation me-thods dedicated to these kernels.

References

[1] L. Bilge, E. Kirda, C. Kruegel and M. Balduzzi,Exposure: Finding malicious domains using passiveDNS analysis, Proc. of NDSS, 2011.

[2] J. Shawe-Taylor and N. Cristianini, Kernel Methodsfor Pattern Analysis, Cambridge University Press,2004.

[3] A. Rakotomamonjy, F. Bach, S. Canu andY. Grandvalet, SimpleMKL, Journal of MachineLearning Research, 9:2491-2521, 2008.

[4] D. Gusfield, Algorithms on strings, trees and se-quences: computer science and computational bio-logy, Cambridge University Press, 1997.

[5] Z. Li, B. Ding, J. Han, R. Kays and P. Nye, Miningperiodic behaviors for moving objects, Proc. of ACMSIGKDD, 2010.

[6] B. Scholkopf, A. Smola and K.-R. Muller, Nonlinearcomponent analysis as a kernel eigenvalue problem,Neural Computation, 10(5):1299-1319, 1998.

ROKS 2013

100

Kernel Adaptive Filtering:

Which Technique to Choose in Practice

Steven Van VaerenberghDept. of Communications Engineering

University of Cantabria, [email protected]

Ignacio SantamarıaDept. of Communications Engineering

University of Cantabria, [email protected]

Abstract: The field of kernel adaptive filtering has produced a myriad of techniques through-out the past decade. While each algorithm provides some advantages over others in certainscenarios, it is often not clear which technique should be used on a practical problem in whichspecific restrictions are in place. We propose to assess the quality of the solution as a functionof the required cost, which can be measured as computational complexity, required memoryor convergence speed. The obtained figures of merit allow us to decide which algorithm touse in practice. We include results on several benchmark data sets.

Keywords: kernel adaptive filtering, online kernel methods, comparison, benchmarks

1 Machine learning meets adaptive

filtering

Kernel adaptive filtering is the subfield of online kernel-based learning that deals with the problem of regres-sion. Specifically, given an input-output stream of datapairs (xt, yt), the task consists in estimating the func-tion ft(·) that relates input and output,

yt = ft(xt), (1)

and to update this estimate efficiently every time a newdata pair is received. Notice that the unknown functionft(·) may be changing over time, as opposed to stan-dard regression problems that assume static underlyingmodels f(·). This problem is encountered for instancein adaptive signal processing theory, which classicallydeals with linear techniques, i.e. assuming a solutionof the form, ft(xt) = w⊤

txt [1]. Kernel adaptive fil-

tering applies machine learning techniques in order toobtain general nonlinear solutions to the online regres-sion problem. Apart from the online scenario in whichdata arrives sequentially, these techniques are also oftenapplied to static regression problems when the amountof data is too large to fit in the memory.

2 Existing methods

A comprehensive introduction to kernel adaptive filter-ing can be found in [2]. Though kernel adaptive filter-ing techniques draw upon several interesting propertiesof kernel methods, they also present some bottlenecksthat are typically encountered in online kernel-basedlearning. Specifically, they suffer from growing com-plexity, since the number of kernels required to repre-sent their solution can grow linearly or faster with the

number of processed data. Different techniques havebeen proposed throughout the past decade to deal withthis problem. A second bottleneck is parameter selec-tion, which is generally still considered to be an openproblem. Some optimization techniques have been pro-posed to this end, though most methods rely on stan-dard cross-validation or even rules of thumb.

The majority of techniques can be categorized into ei-ther least-mean squares techniques, which have linearcomplexity in terms of the stored data, or recursiveleast-squares based techniques, which have quadraticcomplexity. Two of the pioneering methods stemmedfrom research in Gaussian processes [3] and least-squares support vector machines [4]. A closer relation-ship to classical adaptive filtering was established on theone hand by the Naıve Online regularized Risk Mini-mization Algorithm (NORMA) [5], Kernel Least-MeanSquares [6], and KNLMS/KAP [7], and on the otherhand by the Kernel Recursive Least Squares (KRLS)algorithm [8]. Up to this point, however, all algorithmsexcept NORMA assumed a stationary model.

A second generation of techniques dealt more explicitlywith non-stationarity in order to perform tracking. Sev-eral techniques were proposed, ranging from the morestraightforward Sliding-Window KRLS (SW-KRLS) [9]to the more sophisticated Quantized KLMS (QKLMS)[10], Kernel Adaptive Projection Subgradient Method(KAPSM) [11] and KRLS Tracker (KRLS-T) [12].

3 Benchmark testing framework

The performance of kernel adaptive filtering algorithmscan be measured according to several different criteria,such as regression error, convergence rate or complex-

ROKS 2013

101

−10 −9 −8 −7 −6 −5 −4 −3 −2 −11E1

1E2

1E3

1E4

1E5

1E6

MSE (dB)

Max

imum

byt

es s

tore

d

QKLMSNORMAKRLS−TSW−KRLS

Fig. 1: Profile for memory usage versus prediction MSE,

on the switching nonlinear channel data set. Each dot rep-

resents a single run of one of the algorithms, and a different

parameter set was used for each dot.

ity. In the literature it is common practice to comparedifferent algorithms only according to one measure ata time. Nevertheless, an algorithm that scores well ac-cording to one criterion typically scores badly accordingto at least one other criterion. It is therefore often diffi-cult for the practitioner to decide which method to useon a particular problem and data set.

We propose to represent the solution of an algorithmas a function of its cost. Depending on the application,cost measures of interest could be chosen out of compu-tational complexity, required memory and speed of con-vergence. Two examples of trade-off figures are shownin Figs. 1 and 2. By fixing a performance goal, suchas available memory or maximum allowable error, thesefigures allow us to determine which algorithm shows themost favorable remaining properties.

In order to make our results reproducible and extend-able towards new algorithms, we have developed anopen-source toolbox1 that includes Matlab implemen-tations of kernel adaptive filtering algorithms, and aprofiler tool capable of generating trade-off figures.

References

[1] A. H. Sayed, Fundamentals of adaptive filtering.Wiley-IEEE Press, 2003.

[2] W. Liu, J. C. Prıncipe, and S. Haykin, KernelAdaptive Filtering: A Comprehensive Introduc-tion. Wiley, 2010.

[3] L. Csato and M. Opper, “Sparse representation forGaussian process models,” in Advances in NeuralInformation Processing Systems 13, pp. 444–450,MIT Press, 2001.

[4] A. Kuh, “Adaptive kernel methods for CDMA sys-tems,” in International Joint Conference on Neu-

1Available at http://sourceforge.net/projects/kafbox/

−50 −45 −40 −35 −30 −25 −20 −15 −101E2

1E3

1E4

1E5

1E6

1E7

MSE (dB)

FLO

PS

QKLMSNORMAKRLS−TALD−KRLSSW−KRLS

Fig. 2: Profile for average number of floating point opera-

tions versus prediction MSE, on the Mackey-Glass 30 set.

ral Networks (IJCNN 2001), vol. 4, pp. 2404–2409,IEEE, 2001.

[5] J. Kivinen, A. J. Smola, and R. C. Williamson,“Online learning with kernels,” IEEE Transactionson Signal Processing, vol. 52, pp. 2165–2176, Aug.2004.

[6] W. Liu, P. P. Pokharel, and J. C. Prıncipe,“The kernel least-mean-square algorithm,” IEEETransactions on Signal Processing, vol. 56, no. 2,pp. 543–554, 2008.

[7] C. Richard, J. C. M. Bermudez, and P. Honeine,“Online prediction of time series data with ker-nels,” IEEE Transactions on Signal Processing,vol. 57, pp. 1058–1067, Mar. 2009.

[8] Y. Engel, S. Mannor, and R. Meir, “The kernelrecursive least squares algorithm,” IEEE Transac-tions on Signal Processing, vol. 52, pp. 2275–2285,Aug. 2004.

[9] S. Van Vaerenbergh, J. Vıa, and I. Santamarıa,“A sliding-window kernel RLS algorithm and itsapplication to nonlinear channel identification,” in2006 IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP), vol. 5,(Toulouse, France), pp. 789–792, May 2006.

[10] B. Chen, S. Zhao, P. Zhu, and J. C. Prıncipe,“Quantized kernel least mean square algorithm,”IEEE Transactions on Neural Networks andLearning Systems, vol. 23, pp. 22–32, Jan. 2012.

[11] S. Theodoridis, K. Slavakis, and I. Yamada,“Adaptive learning in a world of projections,”IEEE Signal Processing Magazine, vol. 28, pp. 97–123, Jan. 2011.

[12] S. Van Vaerenbergh, M. Lazaro-Gredilla, andI. Santamarıa, “Kernel recursive least-squarestracker for time-varying regression,” IEEE Trans-actions on Neural Networks and Learning Systems,vol. 23, pp. 1313–1326, Aug. 2012.

ROKS 2013

102

Structured Machine Learning for Mapping Natural

Language to Spatial Ontologies

Parisa [email protected]

Marie-Francine [email protected]

Abstract: We propose a novel structured learning framework for mapping natural language tospatial ontologies. The applied spatial ontology contains spatial concepts, relations and theirmultiple semantic types based on qualitative spatial calculi models. To achieve a tractablestructured learning model, we propose an efficient inference approach based on constraintoptimization techniques. Particularly, we decompose the inference to subproblems, each ofwhich is solved using LP-relaxation. This is done for both training-time and prediction-timeinference. In this framework ontology components are learnt while taking into account theontological constraints and linguistic dependencies among components. Particularly, we for-mulate complex relationships such as composed-of, is-a and mutual exclusivity during learningwhile the previous structured learning models in similar tasks do not go beyond hierarchicalrelationships. Our experimental results show that jointly learning the output componentsconsidering the above mentioned constraints and relationships improves the results comparedto ignoring these. The application of the proposed learning model for mapping to ontologiesis not limited to extraction of spatial semantics, it could be used to populate any ontology.We argue therefore that this work is an important step towards automatically describing textwith semantic labels that form a structured ontological representation of the content.Keywords: Spatial information extraction, Natural language semantics, Structured learning,Decomposed inference

1 Problem

Motivation. Extraction of spatial information fromnatural language is a challenging problem in applica-tions such as robotics, geographical information sys-tems and human-machine interaction. Here, we formu-late a new machine learning task for extraction of spa-tial semantics considering spatial concepts and relationsin sentences. We represent the semantics of the spa-tial relations based on qualitative spatial calculi mod-els. The target ontology is represented in Figure 1. Todeal with computational as well as semantic complexityof this task we assume two semantic layers in the pro-posed spatial ontology: a) spatial role labeling (SpRL)layer, b) spatial qualitative labeling (SpQL) layer.SpRL. The SpRL layer considers cognitive-linguisticspatial roles including spatial indicators(ind), trajec-tors(tr) and landmarks(lm) and their relationships [1].Spatial indicators indicate the existence of spatial in-formation in a sentence. Trajector is an entity whoselocation is described and landmark is a reference objectfor describing the location of a trajector. Figure 2 showsan example sentence, There is a white large statue withspread arms on a hill.; In this example, the prepositionon is a spatial indicator, statue is a trajector and itslandmark is hill. A spatial relation is a triplet of spa-tial roles. There is one spatial relation in the mentionedsentence: <onind statuetrhilllm>. In general, there can

be a number of spatial relations in each sentence. Thespatial ontology is represented as a number of singlelabels that refer to one independent concept and linkedlabels that show the connection between the conceptsin the ontology. For example, the spatial relation is alinked label that shows a composed-of relationship withthe composing single labels of spatial roles.SpQL. In the SpQL layer, the goal is to map thespatial triplets to a formal semantic representation.Spatial language, often conveys multiple meaningswithin one expression. Hence, our representation ofthe spatial semantics is based on multiple spatial cal-culi. Figure 1, shows the semantics that are consid-ered in the SpQL layer. In fact, three general cate-gories of regional (i.e topological), directional and dis-tal relationships are considered and we allow multi-ple semantics to be assigned to each spatial triplet.The fine-grained topological relationships are based onthe RCC-8 model including EC(externally connected),DC(disconnected), EQ(equal), PO(partially overlap-ping), PP(proper part). The directional relationshipsinclude relative directional relations of LEFT, RIGHT,FRONT, BEHIND, ABOVE, BELOW. For the distalinformation, a general class of DISTANCE is used inthe applied spatial ontology. In Figure 2, the relation<on,statue,hill> is labeled with regional and EC to-gether with directional and ABOV E. For this sentenceno distal information is annotated.

ROKS 2013

103

Fig. 1: The ontology populated by spatial roles and rela-tions from the example sentence

2 Structured learning model

We formulate the aforementioned problem as a super-vised structured learning problem and solve it in theframework of structured support vector machines andstructured perceptrons. In the supervised setting welearn a mapping f : X → Y between the input spaceX and discrete output space Y given a set of examples,E = (X(i), Y (i)) ∈ X × Y : i = 1 . . . N. In the struc-tured learning, given the complex inputs and outputs,we learn an F : X × Y → R over input-output pairs.Then the prediction requires an inference task over F tofind the best Y for a given input X. Therefore the gen-eral form of f is : f(X; W ) = arg maxY ∈Y F (X, Y ; W ).F is assumed to be a linear function over a com-bination of input and output features Ψ(X, Y ), i.e.,F (X, Y ; W ) =< W, Ψ(X, Y ) >. The Xs in our modelare the natural language sentences that are composed ofa number of single components such as words and com-posed components such as pairs and triplets of words.The Y s are the populated single and linked labels of theontology with the relevant segments of the input sen-tence (see Figure 1). For a structured learning model,we need to design three main components [2]. The firstcomponent, is the joint feature mapping Ψ(X, Y ).The input features are defined based on the linguisti-cally motivated features of the words in addition to therelations between adjacent and long distance words [1].These features are joined with the spatial roles thateach word plays in the output space and also with thesemantic labels. The second component, is the infer-ence algorithm. Since the number of possible Y ’s foreach input X is very large in structured output predic-tion, the most violated Y is considered in formulatingthe constraints for the structured learning optimization.However an efficient inference algorithm is required forfinding the most violated Y for a given X. To this aimwe use the LP-relaxation technique in which we formu-late the structural characteristics of the Y space in theform of hard constraints. We formulate the is-a andcomposed-of in the form of linear constraints. More-over, the mutual exclusivity also should hold amongthe RCC and among the directional relations. By for-mulating the constraints we take into account the struc-ture of the ontology during the training. We apply the

Fig. 2: Example of the CLEF-IAPR TC-12 benchmark

same technique during the prediction to find the bestontological assignments for the segments of the input.However, solving a global LP-relaxation problem con-sidering all variables in both layers is not tractable withthe off-the-shelf solvers. Therefore we decompose theinference into subproblems based on the two seman-tic layers, each of which is solved using LP-relaxation.The subproblems should communicate to each other toachieve a global optimum. To apply this idea we usean approach similar to alternating optimization and wecall it communicative inference. The third component,is the loss function. We use a component-based ham-ming loss as a part of the objective function duringtraining. Such a loss function is linear; it is decom-posed in terms of the output labels, hence easy to inte-grate in our model; moreover it is compatible with thesemantics of our problem.

3 Experimental resultsIn the experiments we use the texts of CLEF IAPRTC-12 benchmark, containing 1213 sentences and 1706annotated spatial relations. We implement a numberof model variations using the structured support vec-tor machines and structured perceptrons. The exper-imental results show that considering ontological rela-tionships and constraints during training and predictionsharply improved the results in each layer compared totraining local classifiers, in terms of F1-measure using10-fold cross validation. We apply a number of rele-vant decomposed inference approaches as well as ourproposed communicative inference during training andprediction for connecting the two layers. In our bestresults applying the proposed communicateve inferenceduring training in our unified model improved the re-sults of the SpRL layer but not the SpQL and applyingit during prediction improved the results of both layersslightly compared to the pipelining of the two layers.

References

[1] P. Kordjamshidi, M. Van Otterlo, and M.F. Moens.Spatial role labeling: Towards extraction of spa-tial relations from natural language. ACM Trans.Speech Lang. Process., 8:1–36, December 2011.

[2] I. Tsochantaridis, T. Joachims, T. Hofmann, andY. Altun. Large margin methods for structured andinterdependent output variables. Journal of Ma-chine Learning Research, 6(2):1453–1484, 2006.

ROKS 2013

104

Windowing strategies for on-line multiple kernel regression

Manuel HerreraBATir - Universite libre de Bruxelles

[email protected]

Rajan Filomeno CoelhoBATir - Universite libre de Bruxelles

[email protected]

Abstract: This work proposes two on-line learning multiple kernel regression (MKr) versions,to update the current model to a more accurate one, avoiding computational efforts associ-ated with re-calculating the whole process each time that new data are available. The firstapproach is by sliding windows: strategy which maintains the size of the kernel matrix understudy. The another one is by the so-called “worm” windows. It shrinks the kernel matrix assliding windows does, but not at every entry of new data, attempting to lose a minimum ofinformation.

Keywords: Multiple kernel regression, on-line learning, windowing methods

1 Introduction

Most of the kernel-based algorithms cannot be used tooperate on-line since a number of difficulties such astime and memory complexities (due to the growing ker-nel matrix) and the need to avoid over-fitting. However,there are some works obtained in this sense during lastyears [1, 6, 7].

A kernel-based recursive least-squares algorithm thatimplements a fixed size “sliding-window” technique [5]has been proposed by Vaerenbergh et al., 2006 [7]. Wepropose a similar methodology for resizing the kernelmatrix to assist in the on-line process of multiple ker-nel regression (MKr) for mixed variables. The MKrprocess is summarized in the following two equations:Eq. 1 and Eq. 2. Where the kernel matrix used inthe regression (Eq. 2) is calculated by a combinationof kernels (Eq. 1).

K(xi, xj) =M∑s=1

µsKs(xi, xj) (1)

f(x) = b+n∑

i=1

(α+i − α

−i )K(xi,x) (2)

The aim of the windowing strategies for on-line MKr(see Figure 1) is to improve the performance of the pro-cess without increasing the original algorithm’s compu-tation time. The new windowing strategy of “worm-windows” is proposed in this work. This method hastwo “expand-shrink” phases: firstly, it allows increasingthe kernel matrix as its size remains adequate to workand there is not any over-fitting issue. Shrinking thekernel matrix to the original size is proposed when itssize reaches to be not computationally efficient.

Fig. 1: Online multiple kernel regression.

2 Windowing for on-line MKr

2.1 On-line MKr by sliding windowsThe sliding window approach consists in only taking thelast N pairs of the stream to performance the multi-kernel regression. When we obtain a new observedpair xn+1, yn+1, we first down-size the kernel matrix,

K(n)j , by extracting the contribution from xn−N (see

Eq. 3)

K(n)j =

K

(n)j (2, 2) · · · K

(n)j (2, N)

.... . .

...

K(n)j (N, 2) · · · K

(n)j (N,N)

(3)

and then we augment again the K(n)j dimension by im-

porting the data input xn+1 to obtain the kernel ex-pressed in Equation 4.

ROKS 2013

105

K(n+1)j =

(K

(n)j Kj(Xn, xn+1)

Kj(xn+1,Xn) Kj(xn+1, xn+1) + λ

)(4)

where Xn = (xn−N+1, . . . , xn)T

Next, the kernel matrices are summed again (see Figure1) and their weights, µ, should be updated too. As itis a particular case of the calculation of weights corre-sponding to the batch phase of the overall process, theproposal is to follow an Stochastic Gradient Descent(SGD) [2, 3] algorithm.

2.2 On-line MKr by “worm” windowsThe so-called worm window approach consists in aug-menting the kernel matrix size when new data becomeavailable. A shrink to the original size is proposed whenits performance falls below a certain tolerance limit.Then, there are taking into account the last n data.The performance of the first growing phase of the algo-rithm should be checked after the first iteration; sim-ulating its computational efficiency with random dataand establishing a maximum size. Besides of this, over-fitting issues should be considered in order to shrinkthe kernel matrix.

The worm windows alternative should offer a major sta-bility in their predictions as consequence of always con-sidering a number of data equal or greater than slidingwindows. On the other hand, the sliding alternativerequires a minor computational efforts and will take amajor proportion of new data. Thus, depending on thenature of the database, its variability, and the targetsof the analysis we could choose one of these two optionsfor the on-line learning.

3 Numerical results

To validate the on-line MKr approaches introduced inthis work, a series of analytical benchmarks have beenused, along with a structural design test case.

3.1 Analytical test casesThe on-line MKr methods proposed are first vali-dated on a set of three artificial mixed-variable bench-mark functions of 5 continuous and 5 discrete variablesadapted from [4]. In all cases, we test 20 updates of 5elements each time. While sliding window strategy mul-tiplied its RMSE error by six along the learning process,the worm window error remained nearly constant.

3.2 Structural design instanceA structural design example by a 3D rigid frame is alsointroduced to illustrate the performance of these on-line MKr methods. The quantity of interest is the totalmass of the structure which is characterized by ten de-sign, 5 continuous and 5 discrete, variables. Figure 2

shows a comparison between both windowing strategiesintroduced in this work by the boxplot of their RMSEs.

Fig. 2: RMSE errors of the windowing strategies. Struc-tural design case-study.

These results support the worm window strategy for on-line learning MKr models as a more accurate method-ology than sliding windows. In addition, the computa-tional efforts related to worm windows can be controlledby a previous tuning phase.

References

[1] S. C.H. Hoi, Rong Jin, Peilin Zhao, and TianbaoYang. Online multiple kernel classification. MachineLearning, pages 1–27, 2012. In press.

[2] A. Karatzoglou. Kernel methods software, algo-rithms and applications. PhD thesis, 2006.

[3] J. Kivinen, A. Smola, and R. C. Williamson. Onlinelearning with kernels. IEEE Transactions on SignalProcessing, 52(8):2165–2176, 2004.

[4] T. Liao. Improved ant colony optimizationalgorithms for continuous and mixed discrete-continuous optimization problems. Technical re-port, CoDE-IRIDIA Dpt., Universite libre de Brux-elles, Belgium, 2011.

[5] H. Lin, D. Chiu, Y. Wu, and A. Chen. Miningfrequent itemset from data streams with a time-sensitive sliding window. In SIAM InternationalConference on Data Mining 2005, 2005.

[6] M. Martin. On-line support vector machine re-gression. In 13th European Conference on MachineLearning - ECML 2002, pages 282–294, 2002.

[7] S. van Vaerenbergh, J. Vıa, and I. Santamarıa. Asliding-window kernel RLS algorithm and its appli-cation to nonlinear channel identification. In IEEEInternational Conference on Acoustics, Speech andSignal Processing - ICASSP 2006, volume V, pages789–792, 2006.

ROKS 2013

106

Non-parallel semi-supervised classification

Siamak MehrkanoonKU Leuven, ESAT-SCD,Kasteelpark Arenberg 10,

B-3001 Leuven (Heverlee),[email protected]

Johan A. K. SuykensKU Leuven, ESAT-SCD,Kasteelpark Arenberg 10,

B-3001 Leuven (Heverlee), [email protected]

Abstract: A non-parallel semi-supervised algorithm based on kernel spectral clustering is formulated.The prior knowledge about the labels is incorporated into the kernel spectral clusteringformulation viaadding regularization terms. In contrast with the existing multi-plane classifiers such as MultisurfaceProximal Support Vector Machine (GEPSVM) and Twin Support Vector Machines (TWSVM) andits least squares version (LSTSVM) we will not use a kernel-generated surface. Instead we apply thekernel trick in the dual. Therefore as opposed to conventional non-parallel classifiers one does not needto formulate two different primal problems for the linear and nonlinear case separately. Experimentalresults demonstrate the efficiency of the proposed method over existing methods.

Keywords: kernel spectral clustering, semi-supervised learning, classification

1 Introduction

In the last few years there has been a growing interest insemi-supervised learningin the scientific community. Gen-erally speaking, machine learning can be categorized intotwo main paradigms, i.e., supervised versus unsupervisedlearning. Spectral clustering methods, in a completely un-supervised fashion, make use of the eigenspectrum of theLaplacian matrix of the data to divide a dataset into natu-ral groups such that points within the same group are sim-ilar and points in different groups are dissimilar to eachother. However it has been observed that classical spec-tral clustering methods suffer from the lack of an underly-ing model and therefore do not possess an out-of-sampleextension naturally. Kernel spectral clustering (KSC) in-troduced in [1] aims at overcoming these drawbacks. Theprimal problem of kernel spectral clustering is formulatedas a weighted kernel PCA. Recently the authors in [2] haveextended the kernel spectral clustering to semi-supervisedlearning by incorporating the information of labeled datapoints in the learning process. The concept of having twonon-parallel hyperplanes for binary classification was firstintroduced in [3] where two non-parallel hyperplanes weredetermined via solving two generalized eigenvalue problemand the method is termed GEPSVM. In this case one obtainstwo non-parallel hyperplanes where each one is as close aspossible to the data points of the one class and as far as pos-sible from the data points of the other class. Some effortshave been made to improve the performance of GEPSVMby providing different formulations such as in [4–6]. It isthe purpose here to formulate a non-parallel semi-supervisedalgorithm based on kernel spectral clustering for which wecan directly apply the kernel trick and thus its formulationenjoys the primal and dual properties as in a support vectormachines classifier [7].

2 Unsupervised and semi-supervised KSC

2.1 Primal formulation of binary KSCThe method corresponds to a weighted kernel PCA formu-lation providing a natural extension to out-of-sample data.Given training dataxiMi=1, xi ∈ R

d and adopting themodel of the following form:

e = wTϕ(x) + b,

the binary kernel spectral clustering in the primal is formu-lated as follows:

minw,b,e

1

2wTw − γ

2eTV e

subject to Φw + b1M = e.

HereΦ = [ϕ(x1), . . . , ϕ(xM )]T and a vector of all oneswith size M is denoted by1M . ϕ(·) : R

d → Rh

is the feature map andh is the dimension of the featurespace.V = diag(v1, ..., vM ) with vi ∈ R

+ is a user de-fined weighting matrix. It is shown that ifV = D−1 =diag(1/d1, ..., 1/dM ) wheredi =

∑Mj=1 K(xi, xj) is the

degree of thei−th data point, the dual problem is related tothe random walk algorithm for spectral clustering.

2.2 Primal formulation of semi-supervised KSCKSC is an unsupervised algorithm, by nature, but it hasshown its ability to also deal with both labeled and unla-beled data at the same time by incorporating the informa-tion of the labeled data into the objective function. Con-sider training data pointsx1, ..., xN , xN+1, .., xM wherexiMi=1 ∈ R

d. The firstN data points do not have labelswhereas the lastNL = M−N points have been labeled withyN+1, .., yM in a binary fashion. The information of thelabeled samples are incorporated to the binary kernel spec-tral clustering by means of a regularization term which aimsat minimizing the squared distance between the projectionsof the labeled samples and their corresponding labels [1]:

ROKS 2013

107

minw,b,e

1

2wTw − γ

2eTV e+

ρ

2

M∑

m=N+1

(em − ym)2

subject to Φw + b1M = e,

whereV = D−1 is defined as previously. Using the KKToptimality conditions onecan show that the solution in thedual can be obtained by solving a linear system of equations[2]. The model selection is done by using an affine combina-tion (where the weight coefficients are positive) of a Fishercriterion and classification accuracy for labeled data.

3 Non-parallel semi-KSC

Suppose the training data setX consists ofM data pointsand is defined as follows:

X = x1, ..., xN︸︷︷︸

Unlabeled(XU )

, xN+1, .., xN+ℓ1︸︷︷︸

Labeledwith (+1)(XL1

)

, xN+ℓ1+1, .., xN+ℓ1+ℓ2︸︷︷︸

Labeledwith (−1)(XL2

)

,

wherexiMi=1 ∈ Rd. The target values are denotedby set

Y which consists of binary labels:

Y = +1, . . . ,+1︸︷︷︸

y1

,−1, . . . ,−1︸︷︷︸

y2

.

We seek two non-parallel hyperplanes:

f1(x) = wT1 ϕ(x) + b1 = 0, f2(x) = wT

2 ϕ(x) + b2 = 0,

where each one is as close as possible to the points of itsown class and as far as possible from the data of the otherclass. We formulate a non-parallel semi-supervised KSC, inthe primal, as the following two optimization problems [8]:

minw1,b1,e,η,ξ

1

2wT

1 w1 +γ12ηT η +

γ22ξT ξ − γ3

2eTD−1e

subject to wT1 ϕ(xi) + b1 = ηi, ∀xi ∈ I,

y2i

[

wT1 ϕ(xi) + b1

]

+ ξi = 1, ∀xi ∈ II,

wT1 ϕ(xi) + b1 = ei, ∀xi ∈ X , (1)

whereγ1, γ2 andγ3 ∈ R+, b1 ∈ R, η ∈ R

NL1 , ξ ∈ RNL2 ,

e ∈ RM , w1 ∈ R

h. ϕ(·) : Rd → Rh is the feature map and

h is the dimension of the feature space.

minw2,b2,e,ρ,ν

1

2wT

2 w2 +γ42ρT ρ+

γ52νT ν − γ6

2eTD−1e

subject to wT2 ϕ(xi) + b2 = ρi, ∀xi ∈ II,

y1i

[

wT2 ϕ(xi) + b2

]

+ νi = 1, ∀xi ∈ I,

wT2 ϕ(xi) + b2 = ei, ∀xi ∈ X , (2)

whereγ4, γ5 andγ6 ∈ R+. b2 ∈ R, ρ ∈ R

NL2 , ν ∈ RNL1 ,

e ∈ RM , w2 ∈ R

h. ϕ(·) is defined as previously.

The performance of the Semi-KSC [2] and the proposedmethod in this paper when a linear kernel is used are

shown in Fig. 1. The result of the proposed method (NP-Semi-KSC) is compared with that of Semi-KSC, LaplacianSVM (LapSVM) [9] and its recent version LapSVMp [10]recorded in [2] over some benchmark datasets for semi-supervised learning. When few labeled data points are avail-able the proposed method shows a comparable result withrespect to other methods. But as the number of labeled datapoints increases NP-Semi-KSC outperforms in most casesthe other methods.

x1x2

−15 −10 −5 0 5 10 15

−10

−5

0

5

10

Semi-KSC with linear kernel

(a)x1

x2

−15 −10 −5 0 5 10 15

−10

−5

0

5

10

Non-parallel Semi-KSC with linear kernel

(b)

Fig. 1: Toy Problem-Four Gaussians with some overlap. The training and valida-tion parts consist ofNtr = 100 andNval = 100 unlabeled data points respec-tively. The labeled data points of two classes are depicted by the blue squares andgreen circles. (a): Result of semi-supervised kernel spectral clustering when linearkernel is used. The separating hyperplane is shown by blue dashed line. (b): Result ofthe proposed non-parallel semi-supervised KSC when linear kernel is used. Two non-parallel hyperplanes are depicted by blue and green dashed lines.

Tab. 1: Average misclassification test error×100%. The calculation of the testerror is done by evaluating the methods on the full data sets. Two cases for the labeleddata (LD) size are considered (i.e.# labeled data points=10and100).

# of LD Method g241c g241d BCI

10 LapSVM 0.48 ± 0.02 0.42 ± 0.03 0.48 ± 0.03

LapSVMp 0.49 ± 0.01 0.43 ± 0.03 0.48 ± 0.02

Semi-KSC 0.42 ± 0.03 0.43 ± 0.04 0.46 ± 0.03

NP-Semi-KSC 0.44 ± 0.03 0.41 ± 0.02 0.47 ± 0.03

100 LapSVM 0.40 ± 0.06 0.31 ± 0.03 0.37 ± 0.04

LapSVMp 0.36 ± 0.07 0.31 ± 0.02 0.32 ± 0.02

Semi-KSC 0.29 ± 0.05 0.28 ± 0.05 0.28 ± 0.02

NP-Semi-KSC 0.23 ± 0.01 0.26 ± 0.02 0.26 ± 0.01

AcknowledgmentsThis work was supported by:• Research Council KUL: GOA/10/09 MaNet, PFV/10/002 (OPTEC), severalPhD/postdoc & fellow grants• Flemish Government: IOF: IOF/KP/SCORES4CHEM;FWO: PhD/postdoc grants,projects: G.0320.08 (convex MPC), G.0558.08 (Robust MHE), G.0557.08 (Glycemia2), G.0588.09 (Brain-machine),G.0377.09 (Mechatronics MPC); G.0377.12 (Structured systems) research community (WOG: MLDM); IWT: PhDGrants, projects: Eureka-Flite+, SBO LeCoPro, SBO Climaqs, SBO POM, O&O-Dsquare• Belgian Federal Sci-ence Policy Office: IUAP P7/ (DYSCO, Dynamical systems, control and optimization, 2012-2017)• IBBT • EU:ERNSI, FP7-EMBOCON (ICT-248940), FP7-SADCO (MC ITN-264735), ERC ST HIGHWIND (259 166), ERC AdGA-DATADRIVE-B • COST: Action ICO806: IntelliCIS• Contract Research: AMINAL• Other: ACCM. JohanSuykens is a professor at the KU Leuven, Belgium.

References[1] C. Alzate and J. A. K. Suykens, “Multiway spectral clustering with out-of-

sample extensions through weighted kernel PCA,”IEEE Trans- actions onPattern Analysis and Machine Intelligence, vol. 32, no. 2, pp. 335-347, 2010.

[2] C. Alzate and J. A. K. Suykens, “A Semi-Supervised Formulation to BinaryKernel Spectral Clustering,”In Proc. of the 2012 IEEE World Congress onComputational Intelligence, Brisbane, 1992-1999, 2012.

[3] O.L. Mangasarian and E.W. Wild, “Multisurface Proximal Support Vector Ma-chine Classification via Generalized Eigenvalues,”IEEE Transactions on Pat-tern Analysis and Machine Intelligence, vol. 28, no. 1, pp. 69-74, 2006.

[4] Jayadeva, R. Khemchandani and S. Chandra, “Twin Support Vector Machinesfor Pattern Classification,”IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, vol. 29, no. 5, pp.905-910, 2007.

[5] Y.H. Shao, C.H. Zhang, X.B Wang, and N.Y. Deng, “Improvements on TwinSupport Vector Machines,”IEEE Transactions on Neural Networks, vol. 22,no. 6, pp. 962-968, 2011.

[6] M.A. Kumar, M. Gopal, “Least squares twin support vector machines for pat-tern classification,”Expert Systems with Applications, 36, 7535-7543, 2009.

[7] J.A.K. Suykens, C. Alzate, K. Pelckmans, “Primal and dual model representa-tions in kernel-based learning”,Statistics Surveys, vol. 4, pp. 148-183, 2010.

[8] S. Mehrkanoon, J.A.K. Suykens, “Non-parallel semi-supervised classificationbased on kernel spectral clustering”, accepted for publication in IJCNN 2013.

[9] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geomet-ric framework for learning from labeled and unlabeled examples,”J. Mach.Learn. Res., vol. 7, pp. 2399-2434, 2006.

[10] S. Melacci and M. Belkin, “Laplacian support vector machines trained in theprimal,” Journal of Machine Learning Research, 12, 1149-1184, 2011.

ROKS 2013

108

Visualisation of neural networks for model reduction

Tamas KeneseiUniversity of Pannonia,

Department of Process Engineering,P.O. Box 158, H-8201

[email protected]

Janos AbonyiUniversity of Pannonia,

Department of Process Engineering,P.O. Box 158, [email protected]

Abstract: Neural networks are difficult to interpret. Extraction and visualisation of fuzzyif-then rules from neural networks can support the analysis of these black-box models. Weoverview existing rule-extraction and model reduction methods and propose a technique basedon structural equality between sigmoid transfer function based neural networks and fuzzysystems. Visualisation is based on similarity of membership functions represented by sigmoidtransfer functions. Orthogonal least squares (OLS) and merging of fuzzy sets are used formodel pruning. An illustrative example is shown to demonstrate the effectiveness of themethod. This simple example shows that visualization and similarity based ranking of neuronshelp the user to find redundant parts of the model, while OLS based ranking is useful to decidewhich elements can be removed without significant performance loss.

Keywords: model reduction, model visualisation, regularisation

1 Introduction

Neural networks (NNs) are efficient in nonlinear regres-sion and classification. Main disadvantage of NNs isthat they are not interpretable. Other problem is howa–priori knowledge can be utilised and integrated intothis black-box modeling approach, and how a humanexpert can validate the identified NNs. To overcomethese problems, following strategies can be used:

1. Transformation of NN. Convert NN into a moreinterpretable form. A good approach is to extractfuzzy if-then rules from NN [1, 2].

2. Model reduction. Overcome complexity prob-lems with ’importance’ determination of hiddenneurons and weights, remove the insignificantones, and/or merge the similar ones. Regularisedtraining can also be considered as this approach.

3. Visualisation of neural network. Generate atwo dimensional map of the neurons and utilisehuman expert to evaluate the structure.

We combine these methods into an effective tool for it-erative and interactive modelling tool. The neural net-work is transformed into a fuzzy rule base. The fuzzysets at the antecedent part of the rules are merged basedon a similarity measure. Similarities of the rules are alsocalculated. These values are used as distance measurefor the visualisation of the rulebase. This technique canbe used as unsupervised approach for model reduction.As performance based model reduction orthogonal leastsquares technique is applied.

2 Model Transformation

Logistic activation function based NN’s are structurallyidentical to a special type of fuzzy rule based model -called fuzzy additive system (FAS) [2]. In order to keepthe equality relationship between the NN and a corre-sponding fuzzy rule-based system, a new logical oper-ator has been presented. Using this ∗ operator, fuzzyrules extracted from the trained NN can be representedas follows:

Rj : If x1 is A1j ∗ . . . ∗ xn is An

j then y = bj (1)

where ∗ represents the interactive-or or i-or: a ∗ b =(ab)\((1 − a)(1 − b) + ab) operator is used to decom-pose the multivariate logistic function into univariatemembership functions of the rule antecedents.

3 Visualisation and model reduction

The antecedent part of the rules can be analysed basedon the similarity of the membership functions, Si

j,l

Sij,l =

Aij ∩Ai

l

Aij ∪Ai

l

(2)

With this measure, pairwise similarities of hidden neu-rons in the range of [0, 1] can be obtained:

Sj,l =∏i

Sij,l, i = 1, . . . , n. (3)

We apply multidimensional scaling to generate a mapof neurons by preserving the distances dj,l = 1 − Sj,l

ROKS 2013

109

among the neurons. These maps can be effectively usedto extract the hidden structure of the network and con-trol the merging of similar neurons (membership func-tions). Among the most similar pairs unnecessary rules(and neurons) with low error reduction ratio should beremoved. Since the consequent parameters of the rulesare linear in parameters orthogonal least squares tech-niques can evaluate the individual contribution of therules [3].

4 Application example

Neural networks are widely used in modelling of dy-namical systems, mostly in the NARX model form,y(k + 1) = f (y(k), . . . , y(k − ny), u(k), . . . , u(k − nd)),where y represents the output, u the input of the sys-tem, k stands for the discrete time instant and ny

and nu are the input and output orders of the dy-namical system. In these problems the identificationof the proper model structure f(.) is extremely diffi-cult since the performance of the model is sensitive tooverparametrisation. The proposed visualisation andreduction techniques are applied to model a continuousstirred tank reactor [4], where the actual output (thepH) y(k + 1), depends on the state of the reactor (theprevious pH value y(k) and the NaOH feed u(k) at thekth sample time:

y(k + 1) = f (y (k) , u (k)) (4)

Parameters of the neural network were identified by theback-propagation algorithm based on a uniformly dis-tributed training data where FNaOH is in the rangeof 515-525 l/min. Our experiences show that 7 neu-rons are sufficient in the hidden layer of the NN. Fig-ure 1 shows the membership functions extracted fromthis neural network. Analysis of the mapping of theneurons (see Fig. 2.) shows that it is possible to re-move one neuron since the 2nd and the 7th neurons arecloser to each other. OLS based ranking indicates the2nd rule more important. Therefore the 7th neuroncan be removed model without a significant decreasein modelling performance. Result indicates outstand-ing performance of the reduced model even by free runsimulation (The mean square error is 3.5 · 10−3 for 7neurons, and 3.824 · 10−3 for 6 neurons). This simpleexample shows that visualization and similarity basedranking of neurons help the user to find redundant partsof the model, while OLS based ranking is useful to de-cide which elements of these paris could be removedwithout significant loss of performance.

5 Conclusion

Visualisation of hidden structure of neural networks cansupport iterative and interactive model reduction. Si-milarity of neurons is calculated based on the similarityof antecedent fuzzy sets extracted from the neural net-work model.

Fig. 1: Decomposed univariate membership functions

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Neuron 1.

Neuron 2.

Neuron 3.

Neuron 4.

Neuron 5.

Neuron 6.

Neuron 7.

Fig. 2: Distances between neurons mapped into two dimen-sions with MDS. Please note, the axes does not have physicalmeaning, the plot represenst the distances, dj,l = 1−Sj,l ofthe neurons.

Orthogonal least squares technique is used to evaluatethe contribution of the individual rules to support theuser to decide which neuron could be removed from themodel without significant decrease of prediction perfor-mance.

AcknowledgmentsThis publication/research has been supported by theprojects TAMOP-4.2.2.A-11/1/KONV-2012-0071 andTAMOP-4.2.2/B-10/1-2010-0025.

References

[1] Huynh, T. Q. & Reggia, J. Guiding hidden layerrepresentations for improved rule extraction fromneural networks. IEEE Trans. on Neural Networks,22(2), 264-275., 2011

[2] Castro, J. L., Mantas, C. J., & Benitez, J. M. In-terpretation of artificial neural networks by meansof fuzzy rules. IEEE Trans. on Neural Networks,13(1), 101-116. , 2002.

[3] O. Nelles Nonlinear system identificationSpringer-Verlag, 2001.

[4] J. Abonyi Fuzzy Model Identification for ControlBirkhauser, 2002.

ROKS 2013

110

Convergence analysis of stochastic gradient descent on

strongly convex objective functions

Cheng TangDepartment of Computer Science,The George Washington University

[email protected]

Claire MonteleoniDepartment of Computer Science,The George Washington University

[email protected]

Abstract: Recently, [3] posed an open problem on whether the Stochastic Gradient Descent(SGD) algorithm, without averaging, could achieve the optimal O( 1

t ) error convergence rateon strongly convex (possibly non-smooth) functions. Our work gives an affirmative answerto this question for a subclass of functions. In high dimension, challenges arise in uniting theanalyses of different optimality conditions for general strongly convex functions. We provide asubclassification scheme for these functions, using a refined definition of strong convexity anda relaxed definition of strong smoothness. We show how each definition captures the class ofstrongly convex functions and how we can describe the lower and upper bound on the normof the subgradients of the function, which characterizes the difficulty of optimizing it. Ourapproach provides a roadmap for future work.

Keywords: stochastic optimization, gradient descent, strong convexity

Introduction

Stochastic Gradient Descent (SGD) is a simple opti-mization method for solving a convex program. It is theprototypical algorithm to solve online and large scalebatch machine learning problems due to its simplic-ity and efficiency. Strongly convex functions are oftenadopted in learning problems to formalize a regularizedobjective, such as in SVM [4]. While most of the earlierwork assumes twice differentiability of the function F ,our setting follows a more general framework studiedin [1, 2]: we consider the convergence rate of the lastiterate returned by SGD on F .

Our assumptions are: 1. F is λ-strongly convex and hasa bounded convex domain W . 2. We have oracle accessto an unbiased estimator of a subgradient of F at anypoint w, denoted g, i.e., E[g] ∈ ∂F (w), with ‖g‖2 ≤ G.That is, the norm of the subgradients are all boundedby some number G.

We apply SGD with the update rule: wt+1 = ΠW (wt−ηtgt), where ΠW is the projection operator on W . Weset the learning rate ηt to be c

λt , with c ≥ 2.

Our goal is to determine whether SGD without aver-aging has optimal convergence rate on strongly convexfunctions in terms of the expected error, E[F (wt) −F (wopt)]. Since SGD is often used to solve problemswith large datasets, this memory-free version of SGDwill have an obvious advantage and should be adoptedif it can be shown to simultaneously match the opti-mal performance of its computationally more demand-

ing counterparts.

Results

We introduce a scheme to subclassify the strongly con-vex functions according to their regularity conditions atthe optimum, wopt. Previously, [1] points out that if Fis strongly smooth, or very “flat”, at wopt, SGD has anoptimal convergence rate. On the other hand, [3] ob-serves that if F is very “pointy” at wopt, SGD shouldhave a fast descent and conjecture that in this case SGDshould also achieve the optimal rate. Our approachgeneralizes these observations: regularity condition atwopt determines the “shape” of F . Strong convexity atwopt implies that F is more “pointy” around the op-timum while strong smoothness implies F is “flatter”.Thus, convexity and smoothness characterize two op-posite properties of F . Due to our restriction that F isstrongly convex, it can not become any “flatter” thana quadratic function at wopt while at the extreme endof strong convexity, F can be linear at wopt in each di-rection. Based on this insight, we first relax the notionof local strong smoothness to local weak smoothness.

Definition 1. Let ν ∈ [0, 1]. F is (ν, Lν)-localweakly smooth, if ∃Lν > 0 s.t. ∀w ∈ W and∀g(w) ∈ ∂F (w) and ∀g(wopt) ∈ ∂F (wopt), we have∥∥g(w)− g(wopt)

∥∥ ≤ Lν∥∥w − wopt∥∥νDefinition 2. F ∈ Hν , if F is (ν, Lν)-local weaklysmooth.

The nested family of local weakly smooth functions isshown in Fig. 1. Since all strongly convex functions

ROKS 2013

111

Fig. 1: subclassification by local weak smoothness

Fig. 2: subclassification by refined local strong convexity

with finite subgradients form a subset of H0, each Fin our framework is contained in multiple Hν . But foreach F , ∃νmax > 0 such that Hνmax is the largest Hν

containing F , which characterizes the local smoothness,or “flatness” of F . νmax in [0, 1] provides an upperbound on g(w) as a polynomial function of

∥∥w − wopt∥∥.Similarly, we refine the notion of strong convexity tosubclassify local strongly convex functions, which wasfirst defined by [5] in their work with a different goal.

Definition 3. Let µ ∈ [0, 1]. F is (µ, λµ)-localstrongly convex, if ∃λµ > 0 s.t. ∀w ∈ Wand ∀g(w) ∈ ∂F (w) and ∀g(wopt) ∈ ∂F (wopt),∥∥g(w)− g(wopt)

∥∥ ≥ λµ∥∥w − wopt∥∥µDefinition 4. F ∈ Sµ, if F is (µ, λµ)-local stronglyconvex.

The strongly convex functions form another nested fam-ily as shown in Fig. 2. µ characterizes the degree oflocal strong convexity, or “pointiness” of the function atwopt in a similar way. We emphasize that even thoughwe use the parameters ν and µ in our analysis, we donot assume the SGD algorithm knows these parameters.Thus, our analysis does not impose additional assump-tions on the strongly convex functions being studied.

Our next result confirms our intuition that if F is non-differentiable at wopt from every direction, or F ∈ S0,then the SGD attains the optimal convergence rate dueto fast descents.

Theorem 1. Suppose ∃L0 > 0 s.t. ∀wt 6= wopt,F (wt) − F (wopt) ≥ L0

∥∥wt − wopt∥∥ . Consider SGD

with step sizes ηt = cλt , with c ≥ 2. For any T > 1, it

holds that

E[F (wt)− F (wopt)] ≤ G√M

c

λt

where M = maxG4

4L20, 16(G+ L0)2

Future work

Our subclassification scheme generalizes the idea thatsmoothness at wopt provides a bound on the varianceof the error rate. Previous results require local strongsmoothness, which corresponds to F ∈ S1 ∩ H0. Forfuture work, we propose only assuming local weaksmoothness, or F ∈ S1 ∩ ∪ν>0H

ν, which subsumeslocal strong smoothness and varies continuously accord-ing the parameter ν. Based on this scheme, future workon proving optimal convergence on the subclass of localweakly smooth functions might be developed.

Acknowledgements: We thank anonymous reviewersof earlier versions, for their comments used in revision.

Fig. 3: A visualization of the superposition of the two sub-classification schemes

In Fig. 3: The green ellipse represents the class of all the stronglyconvex functions. We proved the optimal convergence of SGDon the subset labeled S0. Previous analysis showed an optimalconvergence on the intersection of the green ellipse and the redcircle. For future work, our proposed subclassification schemecan be used towards extending the result to all of the light greenregion.

References

[1] Alexander Rakhlin, Ohad Shamir and Karthik SridharanMaking Gradient Descent Optimal for Strongly ConvexStochastic Optimization CoRR, abs/1109.5647, 2011.

[2] Elad Hazan and Satyen Kale Beyond the regret minimiza-tion barrier: an optimal algorithm for stochastic strongly-convex optimization. Journal of Machine Learning Research- Proceedings Track, 19, 2011.

[3] Ohad Shamir. Is averaging needed for Strongly ConvexStochastic Gradient Descent?. COLT open problem, 2012.

[4] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pe-gasos: Primal Estimated sub-GrAdient SOlver for SVM.ICML, 2007.

[5] Aaditya Ramdas and Aarti Singh. Optimal Stochastic Con-vex Optimization Through The Lens Of Active Learning.CoRR, abs/1207.3012, 2012.

ROKS 2013

112

List of participants

ROKS 2013

113

ROKS 2013

114

Prof. Janos Abonyi University of Pannonia POB. 158 8200 Veszprem - Hungary [email protected] Masoud Ahookhosh University of Vienna Nordbergstrasse 15 1090 Vienna - Austria [email protected] Reema Al-Aifari Vrije Universiteit Brussel Pleinlaan 2 Vakgroep Wiskunde 1050 Brussels - Belgium [email protected] Dr. Carlos Alzate IBM Research Smarter Cities Technology Centre Damastown Industrial Estate Dublin 15 Mulhuddart - Ireland [email protected] Prof. Andreas Argyriou Ecole Centrale Paris 2, Avenue Sully Prudhomme 92290 Chatenay-Malabry - France [email protected] Prof. Francis Bach Ecole Normale Superieure Paris 23, Avenue d’Italie CS 81321 75214 Paris Cedex 13 - France [email protected] Dr. Alper Bilge Anadolu University 2 Eylul Kampusu M.M.F. Bilgisayar 26470 Eskisehir - Turkey [email protected]

Prof. Stephen Boyd Stanford University Packard 264 CA 94305 Stanford - United States [email protected] Dr. Kris De Brabanter KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Prof. Christine De Mol Universite libre de Bruxelles Department of Mathematics Campus Plaine CPI 217 Boulevard du Triomphe 1050 Brussels - Belgium [email protected] Dr. Francesco Dinuzzo MPI for Intelligent Systems Spemannstrasse 38 72076 Tuebingen - Germany [email protected] Prof. Jose Dorronsoro ADIC EPS-UAM C Tomas y Valiente 11 28049 Madrid - Spain [email protected] Dr. Orla Doyle King’s College London Centre for Neuroimaging Sciences Box 089, Institute of Psychiatry De Crespigny Park London SE58AF - United Kingdom [email protected] Prof. Pierre Dupont Catholic University of Louvain - UCL Machine Learning Group - ICTEAM Place Sainte-Barbe 2 1348 Louvain-la-Neuve - Belgium [email protected]

ROKS 2013

115

Dr. Tillmann Falck KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Basura Fernando KU Leuven, ESAT-PSI/VISICS Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Dr. Alessandro Ghio University of Genoa Via Opera Pia 11A 16145 Genoa - Italy [email protected] Amir Ghodrati KU Leuven, ESAT-PSI Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Dr. Nicolas Gillis Universite catholique de Louvain - UCL Avenue G. Lemaitre, 4 - bte L4.05.01 1348 Louvain-la-Neuve - Belgium [email protected] Prof. Francois Glineur Universite catholique de Louvain - UCL CORE Voie du Roman Pays 34 bte L1.03.01 1348 Louvain-la-Neuve - Belgium [email protected] Paul Grigas Massachusetts Institute of Technology 77 Massachusetts Ave. E40-149 02139 Cambridge - United States [email protected]

Emmanuel Herbert Orange Labs 173, rue lafayette 75010 Paris - France [email protected] Dr. Manuel Herrera Universite libre de Bruxelles Avenue F. Roosevelt, 50 (CP 194/2) 1050 Bruxelles - Belgium [email protected] Pablo Hess TRYCON GCM AG Rheinstrasse 13 60325 Frankfurt - Germany [email protected] Dr. Xiaolin Huang KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Dr. Mariya Ishteva Vrije Universiteit Brussel Pleinlaan 2, Building K 1050 Brussels - Belgium [email protected] Prof. Martin Jaggi Ecole Polytechnique, Paris 37, Boulevard Jourdan 75014 Paris - France [email protected] Prof. Kurt Jetter Universitaet Hohenheim Obere Schneeburgstr. 38 79111 Freiburg - Germany [email protected] Vilen Jumutc KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected]

ROKS 2013

116

Dr. Cihan Kaleli Anadolu University 2 Eylul Kampusu M.M.F. Bilgisayar 26470 Eskisehir - Turkey [email protected] Dr. Atsushi Kawamoto Toyota Central R&D Labs., Inc. Nagakute 480-1192 Aichi - Japan [email protected] Dr. Hyon-Jung Kim Aalto University Otakaari 5 A 02150 Espoo - Finland [email protected] Parisa Kordjamshidi KU Leuven, Dep. Computer Science Celestijnenlaan 200A 3001 Leuven-Heverlee - Belgium [email protected] Anna Krause Gottfried Wilhelm Leibniz Universitaet Hannover Institut fuer Mikroelektronische Systeme Appelstrasse 4 30167 Hannover - Germany [email protected] Prof. James Kwok Hong Kong University of Science and Technology Clear Water Bay Hong Kong - China [email protected] Prof. Gert Lanckriet University of California, San Diego 9500 Gilman Drive 92093-0407 San Diego - United States [email protected]

Rocco Langone KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Augustin Lefevre Universite catholique de Louvain - UCL Avenue Georges Lemaitre 4 1348 Louvain-la-Neuve - Belgium [email protected] Prof. Ignace Loris Universite libre de Bruxelles Departement de Mathematique Boulevard du Triomphe 1050 Bruxelles - Belgium [email protected] Raghvendra Mall KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Prof. Ivan Markovsky Vrije Universiteit Brussel Pleinlaan 2, Building K 1050 Brussels - Belgium [email protected] Dr. Andre Marquand King’s College London Institute of Psychiatry, Box P089 De Crespigny Park London SE173RT - United Kingdom [email protected] Vladimir Matic KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected]

ROKS 2013

117

Siamak Mehrkanoon KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Prof. Marie-Francine Moens KU Leuven, Dep. Computer Science Celestijnenlaan 200A 3001 Leuven-Heverlee - Belgium [email protected] Prof. Claire Monteleoni George Washington University Computer Science Department 801 22nd St. NW, Suite 703 20052 Washington, DC - United States [email protected] Vahid Nassiri Vrije Universiteit Brussel 15, rue de la Jonchaie 1040 Brussels - Belgium [email protected] Dr. Valeriya Naumova Johann Radon Institute for Computational and Applied Mathematics Altenbergerstrasse 69 4040 Linz - Austria [email protected] Prof. Yurii Nesterov Catholic University of Louvain - UCL CORE, 34 voie du Roman Pays 1348 Louvain-la-Neuve - Belgium [email protected] Dr. Esa Ollila Aalto University Otakaari 5 A 02150 Espoo - Finland [email protected]

Francesco Orsini KU Leuven, Dep. Computer Science Celestijnenlaan 200A 3001 Leuven-Heverlee - Belgium [email protected] Dr. Marco Pedersoli KU Leuven, ESAT-PSI/VISICS Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Prof. Kristiaan Pelckmans Uppsala University Department of Information Technology Box 337 751 05 Uppsala - Sweden [email protected] Prof. Christophe Phillips Univerity of Liege Cyclotron Research Centre, B30 Sart Tilman 4000 Liege - Belgium [email protected] Prof. Massimiliano Pontil University College London Malet Place London WC1E 6BT - United Kingdom [email protected] Thomas Provoost KU Leuven, Dep. Computer Science Celestijnenlaan 200A 3001 Leuven-Heverlee - Belgium [email protected] Jose Gervasio Puertas KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected]

ROKS 2013

118

Prof. Jan Ramon KU Leuven, Dep. Computer Science Celestijnenlaan 200A 3001 Leuven-Heverlee - Belgium [email protected] Dr. Wojciech Rejchel Nicolaus Copernicus University Faculty of Mathematics and Computer Science Chopina 12/18 87-100 Torun - Poland [email protected] Flamary Remi Universite de Nice Sophia-Antipolis Laboratoire Lagrange, UMR 7293 Parc Valrose 06108 Nice - France [email protected] Prof. Justin Romberg Georgia Institute of Technology 777 Atlantic Drive NW 30332 Atlanta - United States [email protected] Dr. Maria Rosa University College London 66a Kellett Road London SW2 1ED - United Kingdom [email protected] Alessandro Rudi Istituto Italiano di Tecnologia Via Morego 30 16163 Genova - Italy [email protected] Prof. Tobias Ryden Lynx Asset Management Norrmalmstorg 12 Box 7060 10386 Stockholm - Sweden [email protected]

Jose Sanchez Chalmers University of Technology Matematiska Vetenskaper Chalmers tvargata 3 412 96 Gothenburg - Sweden [email protected] Prof. Bernhard Schoelkopf Max Planck Institute Tuebingen Spemannstrasse 38 72076 Tuebingen - Germany [email protected] Dr. Jessica Schrouff Cyclotron Research Centre University of Liege 8, Allee du 6-Aout, B30 4000 Liege - Belgium [email protected] Dr. Fermin Segovia-Roman University of Liege Centre de recherches du cyclotron (Bat. B30) 8, Allee du 6 Aout 4000 Liege - Belgium [email protected] Dr. Marko Seslija KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Prof. John Shawe-Taylor Catholic University of Louvain - UCL 5 Chapel Square GU25 4SZ Virginia Water - United Kingdom [email protected] Dr. Lei Shi KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected]

ROKS 2013

119

Dr. Marco Signoretto KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Pascal Spincemaille Weill Cornell Medical College 416 E55th St 10022 New York - United States [email protected] Santosh Srivastava Jaypee University of Engineering and Technology Department of Mathematics Guna, M.P, India 473226 Raghogarh - India

[email protected] Prof. Johan Suykens KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Tatiana Tommasi KU Leuven, ESAT-PSI/VISICS Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Prof. Joel Tropp Caltech 1200 E. California Blvd. MC 305-16 91125-5000 Pasadena - United States [email protected] Dr. Konstantin Usevich Vrije Universiteit Brussel Pleinlaan 2, Building K Department ELEC 1050 Brussels - Belgium [email protected]

Dr. Vanya Van Belle KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Dr. Laurens van der Maaten Delft University of Technology Mekelweg 4 2628 CD Delft - The Netherlands [email protected] Dr. Steven Van Vaerenbergh University of Cantabria Departamento de Ingenieria de Comunicaciones Edificio Ingenieria de Telecomunicacion Plaza de la Ciencia s/n 39005 Santander - Spain [email protected] Prof. Joos Vandewalle KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Carolina Varon KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Yuyi Wang KU Leuven, Dep. Computer Science Celestijnenlaan 200A 3001 Leuven-Heverlee - Belgium [email protected] Zi Wang Imperial College London 2 Gainsborough Road London E15 3AF - United Kingdom [email protected]

ROKS 2013

120

Dr. Marco Wiering University of Groningen Department of Artificial Intelligence Nijenborgh 9 9700AK Groningen - The Netherlands [email protected] Dr. Gunnar Wilken Okinawa Institute of Science and Technology 1919-1 Tancha, Lab 1, SCB Unit 904-0495 Onna-son - Japan [email protected] Ines Wilms KU Leuven Onderzoeksgroep Operations Research and Business Statistics Naamsestraat 69 3000 Leuven - Belgium [email protected] Bahman Yari Saeed Khanloo CWI 123 Science Park 1098XG Amsterdam - The Netherlands [email protected]

Pooya Zakeri KU Leuven, ESAT-SCD Kasteelpark Arenberg 10 3001 Leuven-Heverlee - Belgium [email protected] Roman Zakharov Universite catholique de Louvain - UCL 2, Place Sainte-Barbe A158 1348 Louvain-la-Neuve - Belgium [email protected] Dr. Rafal Zdunek Wroclaw University of Technology Wybrzeze Wyspianskiego 27 50-370 Wroclaw - Poland [email protected] Prof. Ding-Xuan Zhou City University of Hong Kong Dep. of Mathematics 83 Tat Chee Avenue, Kowloon Hong Kong - China [email protected]

ROKS 2013

121

ROKS 2013

122

Documents

IIIInnnntttteeeerrrrnnnnaaaattttiiiioooonnnnaaaallll ......Book of Abstracts of the International Workshop on Advances in Regularization, Optimization, Kernel Methods and Support Vector