Machine Learning 10-701 - Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell

Machine Learning Department Carnegie Mellon University

April 12, 2011

Today:

•  Support Vector Machines •  Margin-based learning

Readings:

Required: SVMs: Bishop Ch. 7, through 7.1.2

Optional: Remainder of Bishop Ch. 7

Thanks to Aarti Singh for several slides

SVM: Maximize the margin

margin = γ = a/‖w‖

Margin = Distance of closest examples from the decision line/hyperplane

Maximizing the margin

margin = γ = a/‖w‖

γ γ max γ = a/‖w‖

s.t. (wTxj+b) yj ≥ a ∀j

Note: ‘a’ is arbitrary (can normalize equations by a)

Margin = Distance of closest examples from the decision line/hyperplane

Support Vector Machine (primal form)

Solve efficiently by quadratic programming (QP) –  Well-studied solution

algorithms

max γ = 1/‖w‖ w,b

s.t. (wTxj+b) yj ≥ 1 ∀j

min wTw

s.t. (wTxj+b) yj ≥ 1 ∀j w,b

Primal form:

Primal form: solve for w, b

both are QP problems with a single local optimum!

We can solve either primal or dual forms

Dual form: solve for

Classification test for new x :

Support Vectors

Linear hyperplane defined by “support vectors”

Moving other points a little doesn’t effect the decision boundary

only need to store the support vectors to predict labels of new points

How many support vectors in linearly separable case, given d dimensions?

wTx + b > 0 wTx + b < 0

≤ d+1

Kernel SVM

And because the dual form depends only on inner products, we can apply the kernel trick to work in a (virtual) projected space

Primal form: solve for w, b in the projected higher dim. space

Dual form: solve for in the original low dim. space

SVM Decision Surface using Gaussian Kernel

Circled points are the support vectors: training examples with non-zero

Points plotted in original 2-D space.

Contour lines show constant

[from Bishop, figure 7.2]

What if data is not linearly separable?

Use features of features of features of features….

But run risk of overfitting!

x12, x2

2, x1x2, …., exp(x1)

What if data is still not linearly separable?

min wTw + C #mistakes w,b s.t. (wTxj+b) yj ≥ 1 ∀j

Allow “error” in classification

Maximize margin and minimize # mistakes on training data

C - tradeoff parameter

Not QP 0/1 loss (doesn’t distinguish between near miss and bad mistake)

Support Vector Machine with soft margins

Allow “error” in classification

ξj - “slack” variables = (>1 if xj misclassifed) pay linear penalty if mistake

C - tradeoff parameter (chosen by cross-validation)

Still QP Soft margin approach

min wTw + C Σ ξj w,b

s.t. (wTxj+b) yj ≥ 1-ξj ∀j

ξj ≥ 0 ∀j

Primal and Dual Forms for Soft Margin SVM

Primal form: solve for w, b in the projected higher dim. space

Dual form: solve for in the original low dim. space

both are QP problems with a single local optimum

SVM Soft Margin Decision Surface using Gaussian Kernel

Circled points are the support vectors: training examples with non-zero

Points plotted in original 2-D space.

Contour lines show constant

[from Bishop, figure 7.4]

SVM Summary

•  Objective: maximize margin between decision surface and data

•  Primal and dual formulations –  dual represents classifier decision in terms of support vectors

•  Kernel SVM’s –  learn linear decision surface in high dimension space, working in

original low dimension space

•  Handling noisy data: soft margin “slack variables” –  again primal and dual forms

•  SVM algorithm: Quadratic Program optimization –  single global minimum

SVM: PAC Results?

VC dimension: examples What is VC dimension of •  H2 = { ((w0 + w1x1 + w2x2)>0 y=1) }

–  VC(H2)=3 •  For Hn = linear separating hyperplanes in n dimensions, VC(Hn)=n+1

Structural Risk Minimization

Which hypothesis space should we choose? •  Bias / variance tradeoff

H1 H2 H3 H4

[Vapnik]

SRM: choose H to minimize bound on expected true error!

* unfortunately a somewhat loose bound...

Margin-based PAC Results

recall:

[Shawe-Taylor, Langford, McCallester]

Maximizing Margin as an Objective Function

•  We’ve talked about many learning algorithms, with different objective functions

•  0-1 loss •  sum sq error •  maximum log data likelihood •  MAP •  maximum margin

How are these all related?

Slack variables – Hinge loss

min wTw + C Σ ξj w,b

s.t. (wTxj+b) yj ≥ 1-ξj ∀j

ξj ≥ 0 ∀j

Complexity penalization

Hinge loss

0-1 loss

0 -1 1

SVM vs. Logistic Regression

SVM : Hinge loss

0-1 loss

0 -1 1

Logistic Regression : Log loss ( -ve log conditional likelihood)

Hinge loss Log loss

What you need to know

Primal and Dual optimization problems Kernel functions Support Vector Machines •  Maximizing margin •  Derivation of SVM formulation •  Slack variables and hinge loss •  Relationship between SVMs and logistic regression

–  0/1 loss –  Hinge loss –  Log loss

Machine Learning 10-701 - Carnegie Mellon University

Documents

New York University Carnegie Mellon University arXiv:2010

9 September 2004The Straw Tube Chamber1 The CDC Curtis A. Meyer Carnegie Mellon University Physics Requirements and Specifications Prototype Construction

DAML: ATLAS Project Carnegie Mellon University · Carnegie Mellon University Katia Sycara Anupriya Ankolekar, Massimo Paolucci, Naveen Srinivasan November 2004. 1 CMU: Katia Sycara

Carnegie Mellon Music Understanding and the Future of Music Performance Roger B. Dannenberg Professor of Computer Science, Art, and Music Carnegie Mellon

융합공학부 School of Integrative Engineering나노재료공학 1 이동현 2 Carnegie Mellon University 공학박사 3 의료공학 4 바이오소재공학, 신소재결정학,

Embedded System Lab. 김해천 haecheon100@gmail.com Memory Scaling: A System Architecture Perspective Onur Mutlu Carnegie Mellon University onur@cmu.edu

Visualization - Carnegie Mellon School of Computer Sciencejkh/462_s07/24_visualization.pdf · Scientific Visualization ... • Visualizing an explicit function ... (like the visible

CMUMultiple Integrals under Di erential Constraints: Two-Scale Convergence and Homogenization Irene onsecaF Carnegie Mellon Universit,y fonseca@andrew.cmu.edu Stefan Krömer Carnegie

Perception - Carnegie Mellon School of Computer Scienceillah/16-467/handouts/PerceptionSlideSet2016.pdf · Social Perception •What features do ... • Cultural modeling, expert

© 2015 Carnegie Mellon University Algorithmic Logic-Based Verification with SeaHorn Software Engineering Institute Carnegie Mellon University Pittsburgh,

18 - 322 Fall 2002 Lecture 20 - Carnegie Mellon University

Carnegie Mellon June 5, 2015 Dipayan Bhattacharya 15-213 Recitation: Bomb Lab

Software Development Processes - Carnegie Mellon …koopman/lectures/ece642/04_software... · Software Development Processes ... Individuals and Interactionsover processes and tools

{ Kevin T. Kelly, Hanti Lin } Carnegie Mellon University Responsive -ness CMU

Interior-point methods - Carnegie Mellon University

TANGENT TANGENT A Novel, “Surprise-me”, Recommendation Algorithm Kensuke Onuma : Sony Corporation Hanghang Tong : Carnegie Mellon Univ. Christos Faloutsos

Measurement-based Change in Libraries Case Studies From an Academic Library Joan Stein, Head, Access Services Carnegie Mellon University Libraries

Kazuya Kawakamik-kawakami.com/static/pdf/cmu_160429.pdf · LTI @ Carnegie Mellon University Multilingual and Multimodal word representation Kazuya Kawakami

Carnegie Mellon UniversityCarnegie Mellon University

Nick Sahinidis Center for Advanced Process Decision-making National Energy Technology Laboratory Department of Chemical Engineering Carnegie Mellon University