22
Support Vector Machines Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate

Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

  • Upload
    vutram

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Support Vector Machines

Nicholas Ruozzi

University of Texas at Dallas

Slides adapted from David Sontag and Vibhav Gogate

Page 2: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Announcements

• Homework 1 is now available online

• Join the Piazza discussion group

• Reminder: my office hours are 11am-12pm on Tuesdays

Page 3: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Binary Classification

• Input 𝑥(1), 𝑦1 , … , (𝑥(𝑛), 𝑦𝑛) with 𝑥𝑖 ∈ ℝ𝑚 and 𝑦𝑖 ∈ {−1,+1}

• We can think of the observations as points in ℝ𝑚 with an associated

sign (either +/- corresponding to 0/1)

• An example with 𝑚 = 2

++

++

+

+

+

+

+

++ +

_

_

_ _

_

_

_

_ _

_

𝑤𝑇𝑥 + 𝑏 = 0

𝑤𝑇𝑥 + 𝑏 < 0

𝑤𝑇𝑥 + 𝑏 > 0

Page 4: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Binary Classification

• Input 𝑥(1), 𝑦1 , … , (𝑥(𝑛), 𝑦𝑛) with 𝑥𝑖 ∈ ℝ𝑚 and 𝑦𝑖 ∈ {−1,+1}

• We can think of the observations as points in ℝ𝑚 with an associated

sign (either +/- corresponding to 0/1)

• An example with 𝑚 = 2

++

++

+

+

+

+

+

++ +

_

_

_ _

_

_

_

_ _

_

𝑤𝑇𝑥 + 𝑏 = 0

𝑤𝑇𝑥 + 𝑏 < 0

𝑤𝑇𝑥 + 𝑏 > 0

𝑤 is called the vector of weights and 𝑏 is called the bias

Page 5: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

What If the Data Isn‘t Separable?

• Input 𝑥(1), 𝑦1 , … , (𝑥(𝑛), 𝑦𝑛) with 𝑥𝑖 ∈ ℝ𝑚 and 𝑦𝑖 ∈ {−1,+1}

• We can think of the observations as points in ℝ𝑚 with an associated

sign (either +/- corresponding to 0/1)

• An example with 𝑚 = 2

+

+

++

+

+

++

+

+

+

+

_

_

_ __ _

_

_ _

_

Page 6: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

What If the Data Isn‘t Separable?

• Input 𝑥(1), 𝑦1 , … , (𝑥(𝑛), 𝑦𝑛) with 𝑥𝑖 ∈ ℝ𝑚 and 𝑦𝑖 ∈ {−1,+1}

• We can think of the observations as points in ℝ𝑚 with an associated

sign (either +/- corresponding to 0/1)

• An example with 𝑚 = 2

+

+

++

+

+

++

+

+

+

+

_

_

_ __ _

_

_ _

_

Page 7: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Adding Features

• The idea:

– Given the observations 𝑥(1), … , 𝑥(𝑛), construct a feature vector

𝜙(𝑥)

– Use 𝜙 𝑥(1) , … , 𝜙 𝑥(𝑛) instead of 𝑥(1), … , 𝑥(𝑛) in the

learning algorithm

– Goal is to choose 𝜙 so that 𝜙 𝑥(1) , … , 𝜙 𝑥(𝑛) are linearly

separable

– Learn linear separators of the form 𝑤𝑇𝜙 𝑥 (instead of 𝑤𝑇𝑥)

Page 8: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Adding Features

• Sometimes it is convenient to group the bias together with

the weights

• To do this

– Let 𝜙 𝑥1, 𝑥2 =𝑥1𝑥21

and 𝑤 =

𝑤1

𝑤2

𝑏,

– This gives

𝑤𝑇𝜙 𝑥1, 𝑥2 = 𝑤1𝑥1 +𝑤2𝑥2 + 𝑏 = 𝑤𝑇𝑥 + 𝑏

Page 9: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Support Vector Machines

+

+

+

+

+

+

+

+

+

+++

_

_

_

_

_

_

_

_ _

_

• How can we decide between perfect classifiers?

Page 10: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Support Vector Machines

+

+

+

+

+

+

+

+

+

+++

_

_

_

_

_

_

_

_ _

_

• How can we decide between perfect classifiers?

Page 11: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Support Vector Machines

+

+

+

+

+

+

+

+

+

+++

_

_

_

_

_

_

_

_ _

_

• Define the margin to be the distance of the closest data

point to the classifier

Page 12: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

• Support vector machines (SVMs)

• Choose the classifier with the largest margin

– Has good practical and theoretical performance

Support Vector Machines

+

+

++

+

+

+

+

+

++

+_

_

_

_

_

_

_

_ _

_

Page 13: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

• In 𝑛 dimensions, a hyperplane is a solution to the equation

𝑤𝑇𝑥 + 𝑏 = 0

with 𝑤 ∈ ℝ𝑛, 𝑏 ∈ ℝ

• The vector 𝑤 is sometimes called the normal vector of the

hyperplane

Some Geometry

𝑤𝑇𝑥 + 𝑏 = 0

𝑤

Page 14: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

• In 𝑛 dimensions, a hyperplane is a solution to the equation

𝑤𝑇𝑥 + 𝑏 = 0

• Note that this equation is scale invariant for any scalar 𝑐

𝑐 ⋅ 𝑤𝑇𝑥 + 𝑏 = 0

Some Geometry

𝑤𝑇𝑥 + 𝑏 = 0

𝑤

Page 15: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

• The distance between a point 𝑦 and a hyperplane 𝑤𝑇 +𝑏 = 0 is the length of the vector perpendicular to the line

through the point 𝑦

𝑦 − 𝑧 = 𝑦 − 𝑧𝑤

𝑤

Some Geometry

𝑤𝑇𝑥 + 𝑏 = 0

𝑧

𝑦

Page 16: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

• By scale invariance, we can assume that 𝑐 = 1

• The maximum margin is always attained by choosing

𝑤𝑇𝑥 + 𝑏 = 0 so that it is equidistant from the closest

data point classified as +1 and the closest data point

classified as -1

Scale Invariance

𝑤𝑇𝑥 + 𝑏 = 0

𝑧

𝑦

𝑤𝑇𝑥 + 𝑏 = 𝑐

Page 17: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

• We want to maximize the margin subject to the constraints

that

𝑦𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 ≥ 1

• But how do we compute the size of the margin?

Scale Invariance

𝑤𝑇𝑥 + 𝑏 = 0

𝑧

𝑦

𝑤𝑇𝑥 + 𝑏 = 𝑐 𝑤𝑇𝑥 + 𝑏 = −𝑐

Page 18: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

Putting it all together

𝑦 − 𝑧 = 𝑦 − 𝑧𝑤

𝑤

and

𝑤𝑇𝑦 + 𝑏 = 1𝑤𝑇𝑧 + 𝑏 = 0

Some Geometry

𝑤𝑇 𝑦 − 𝑧 = 1

and

𝑤𝑇 𝑦 − 𝑧 = 𝑦 − 𝑧 𝑤

which gives

𝑦 − 𝑧 = 1/ 𝑤

𝑤𝑇𝑥 + 𝑏 = 0

𝑧

𝑦

𝑤𝑇𝑥 + 𝑏 = 1 𝑤𝑇𝑥 + 𝑏 = −1

Page 19: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

SVMs

• This analysis yields the following optimization problem

max𝑤

1

𝑤

such that

𝑦𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 ≥ 1, for all 𝑖

• Or, equivalently,

min𝑤

𝑤 2

such that

𝑦𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 ≥ 1, for all 𝑖

Page 20: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

SVMs

min𝑤

𝑤 2

such that

𝑦𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 ≥ 1, for all 𝑖

• This is a standard quadratic programming problem

– Falls into the class of convex optimization problems

– Can be solved with many specialized optimization tools (e.g.,

quadprog() in MATLAB)

Page 21: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

SVMs

• Where does the name come from?

– The set of all data points such that 𝑦𝑖(𝑤𝑇𝑥(𝑖) + 𝑏) = 1 are

called support vectors

𝑤𝑇𝑥 + 𝑏 = 0

𝑧

𝑦

𝑤𝑇𝑥 + 𝑏 = 1 𝑤𝑇𝑥 + 𝑏 = −1

Page 22: Nicholas Ruozzi University of Texas at Dallasnrr150130/cs6375/2015fa/lects/Lecture_3_SVM.pdf · Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and

SVMs

• What if the data isn’t linearly separable?

– Use feature vectors

• What if we want to do more than just binary classification

(i.e., if 𝑦 ∈ {1,2,3})?

– One versus all: for each class, compute a linear separator

between this class and all other classes

– All versus all: for each pair of classes, compute a linear separator