Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Mathematical Programming and Research
Methods (Part II)
4. Convexity and Optimization
Massimiliano Pontil
(based on previous lecture by Andreas Argyriou)
1
Today’s Plan
• Convex sets and functions
• Types of convex programs
• Algorithms
• Convex learning problems
2
Convexity
• Simple intuition, originates from simple geometric shapes (e.g.polygons)
Convex Convex NOT
• Convexity plays a very important role in optimization
3
Convex Sets
Definition 1. A set C ⊆ IRd is called convex if
λx + (1 − λ)y ∈ C for all x, y ∈ C, λ ∈ [0, 1]
• I.e. if x and y are in the set C, then the whole line segment{λx + (1 − λ)y : λ ∈ [0, 1]} lies also in C
4
Convex Sets (contd.)
• We call λx + (1 − λ)y a convex combination of x and y wheneverλ ∈ [0, 1]
• Generally, for any k ∈ IN, the sum
k∑
i=1
λixi
is called a convex combination of the points x1, . . . , xk ∈ IRd whenever
λ1, . . . , λk ≥ 0 and
k∑
i=1
λi = 1
5
Convex Sets (contd.)
• Clearly, if a set C is convex, all convex combinations of points in C (forall k ∈ IN) belong to C
• This set of all convex combinations is called the convex hull of C
• In general, given a set S ⊆ IRd (S need not be convex), the convex hullof S is the set
conv(S) :=
{
k∑
i=1
λixi : xi ∈ S, λ1, . . . , λk ≥ 0,k∑
i=1
λi = 1, k ∈ IN
}
• The convex hull is the smallest convex set containing S
6
Convex Sets (contd.)
• S = conv(S) if and only if S is convex
7
Examples of Convex Sets
• Affine sets, i.e. sets of solutions of linear equations {x : Ax = b}
• Convex cones, i.e. sets containing any nonnegative combinationk∑
i=1
θixi , θ1, . . . , θk ≥ 0, of their points
0
0.2
0.4
0.6
0.8
1
−1−0.5
00.5
10
0.2
0.4
0.6
0.8
1
8
Examples of Convex Sets (contd.)
• Hyperplanes, i.e. sets of the form {x : a⊤x = b}, wherea ∈ IRd, a 6= 0, b ∈ IR (since they are special cases of affine sets)
• Halfspaces, i.e. sets of the form {x : a⊤x ≤ b}, wherea ∈ IRd, a 6= 0, b ∈ IR
9
Polyhedra
• A polyhedron is a set defined by a finite number of affine equalities andinequalities
P = {x : a⊤
i x ≤ bi, i = 1, . . . ,m, c⊤j x = dj, j = 1, . . . , p}
where a1, . . . , am, c1, . . . , cp ∈ IRd, b1, . . . , bm, d1, . . . , dp ∈ IR
• Polyhedra are convex sets
10
Polyhedra (contd.)
• A bounded polyhedron is called a polytope
• A set is a polytope if and only if it is the convex hull of a finite set ofpoints
11
The Positive Semidefinite Cone
• We use the notations
X � 0 X ≻ 0
to denote that a d × d matrix X is positive semidefinite and positivedefinite, respectively
• The setsS
d+ := {X ∈ IRd×d : X � 0}
andS
d++ := {X ∈ IRd×d : X ≻ 0}
are called the positive semidefinite cone and positive definite cone,respectively
12
The Positive Semidefinite Cone (contd.)
• Sd+ and S
d++ are convex cones
Proof. For any A, B ∈ Sd+, θ1, θ2 ≥ 0, the matrix θ1A + θ2B is psd. ⊓⊔
• E.g. in IR2×2, the positive semidefinite cone consists of the matrices ofthe form
(
x y
y z
)
such that x, z ≥ 0, xz ≥ y2
0
0.2
0.4
0.6
0.8
1
−1−0.5
00.5
10
0.2
0.4
0.6
0.8
1
13
Norms
• A norm, denoted by ‖ · ‖, is a function from IRd to IR+ such that
1. ‖w‖ ≥ 0, for all w ∈ IRd
2. ‖w‖ = 0 if and only if w = 0
3. ‖aw‖ = |a|‖w‖, for all a ∈ IR, w ∈ IRd (homogeneity)
4. ‖w + z‖ ≤ ‖w‖ + ‖z‖, for all w, z ∈ IRd (triangle inequality)
• Important example: the Lp norm
‖w‖p :=
(
d∑
i=1
|wi|p
)
1p
where p ∈ [1,+∞)
14
Norms (contd.)
• We have already seen the L2 norm – it is the regularizer in ridgeregression, SVM etc.
‖w‖2 =
(
d∑
i=1
w2i
)
12
= (w⊤w)12
• The L1 norm‖w‖1 =
d∑
i=1
|wi|
• Letting p → +∞, we get the L∞ norm
‖w‖∞ =d
maxi=1
|wi|
15
Norm Balls
• The unit ball for a norm is the set {w : ‖w‖ ≤ 1}
−1−0.5
00.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
L1 unit ball L2 unit ball L∞ unit ball
16
Norm Balls (contd.)
• In general, any norm ball of the form
{w : ‖w − c‖ ≤ r}
where c ∈ IRd and r ≥ 0 are the center and radius of the ball,respectively, is a convex set
17
Convex Functions
• A function f : IRd → IR is called convex if
f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)
for all x, y ∈ IRd, λ ∈ [0, 1]
• Intuition: the line segment connecting any two points on the graph of f
lies above the graph
Convex Non-convex
18
Convex Functions (contd.)
• Similarly, a function f is called concave if
f(λx + (1 − λ)y) ≥ λf(x) + (1 − λ)f(y)
for all x, y ∈ IRd, λ ∈ [0, 1]or, equivalently, if −f is convex
19
Strict Convexity
• A function f : IRd → IR is called strictly convex if
f(λx + (1 − λ)y) < λf(x) + (1 − λ)f(y)
for all x, y ∈ IRd, x 6= y, λ ∈ (0, 1)
• I.e. the line segment connecting any two points on the graph of f liesstrictly above the graph
• Equivalently, the graph of a strictly convex function does not containany line segments
20
Strict Convexity (contd.)
Not strictly convex Strictly convex
21
Jensen’s Inequality
• If f is convex, it follows by induction that
f
(
k∑
i=1
λixi
)
≤
k∑
i=1
λif(xi)
for all k ∈ IN, x1, . . . , xk ∈ IRd, λ1, . . . , λk ≥ 0, such thatk∑
i=1
λi = 1
• It can be generalised to integrals and expected values(it is used e.g. to derive the EM algorithm)
22
Continuity / Differentiability
Theorem 1. If a function f is convex on IRd then it is also continuous
on IRd.
Proof. Not easy. ⊓⊔
• There are convex functions which are not differentiable everywhere(and others which are)
23
Second Order Condition
Theorem 2. Assume that a function f is twice differentiable on IRd.
Then f is convex if and only if its Hessian is psd.
∇2f(w) � 0 for all w ∈ IRd
• Recall that the Hessian is the matrix formed by the second partialderivatives
∇2f(w) :=
(
∂f
∂wi∂wj
(w)
)d
i,j=1
• Note: the condition ∇2f ≻ 0 implies strict convexity, but the converseis not true – see [Boyd & Vanderberghe]
24
Examples of Convex Functions
• Affine functions w 7→ a⊤w + b are both convex and concave
• Exponentials, powers, log-sum-exp
−5 −4 −3 −2 −1 0 1 2 30
5
10
15
20
25
30
35
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w ∈ IR 7→ eaw w ∈ IR 7→ |w|p w ∈ IRd 7→ log(
∑d
i=1ewi
)
for any a ∈ IR for any p ≥ 1
25
Examples of Convex Functions (contd.)
• Psd. quadratic functions
f(w) = w⊤Aw + a⊤w + b for all w ∈ IRd
where A ∈ Sd+, a ∈ IRd, b ∈ IR
26
Examples of Convex Functions (contd.)
• Max function w 7→ max{w1, . . . , wd}
27
Norms Are Convex Functions
• Every norm is a convex function
Proof. Let w, z ∈ IRd and λ ∈ (0, 1). Then
‖λw +(1−λ)z‖ ≤triangle
inequality
‖λw‖+ ‖(1−λ)z‖ =homogeneity
λ‖w‖+(1−λ)‖z‖
⊓⊔
• No norm is strictly convex (to see this, select z = aw with a > 0, a 6= 1)
• The square of every norm, w 7→ ‖w‖2, is a convex function
Proof. Do it as an exercise. ⊓⊔
28
Operations that Preserve Convexity
Question: If f1, . . . , fq : IRd → IR are convex functions, which operationsF can we apply so that f := F (f1, . . . , fq) is also convex?
• Nonnegative weighted sums
f =
q∑
i=1
θifi
where θ1, . . . , θq ≥ 0
Proof. Easy from the definition. ⊓⊔
29
Operations that Preserve Convexity (contd.)
• Composition of a convex function with an affine map
f(x) = g(Ax + b) for all x ∈ IRd
where g : IRn → IR is convex, A ∈ IRn×d is a matrix and b ∈ IRn
Proof. Let x, y ∈ IRd, λ ∈ (0, 1). Then
f(λx + (1 − λ)y) = g(λAx + (1 − λ)Ay + b)
= g (λ(Ax + b) + (1 − λ)(Ay + b))
≤ λg(Ax + b) + (1 − λ)g(Ay + b) = λf(x) + (1 − λ)f(y)
⊓⊔
30
Operations that Preserve Convexity (contd.)
• Maximum of convex functions
f = max{f1, . . . , fq}
Proof (for q = 2, can be easily generalised to any q). Let x, y ∈ IRd,
λ ∈ (0, 1). Then
f(λx + (1 − λ)y) = max{f1(λx + (1 − λ)y), f2(λx + (1 − λ)y)}
≤ max{λf1(x) + (1 − λ)f1(y), λf2(x) + (1 − λ)f2(y)}
≤ max{λf1(x), λf2(x)} + max{(1 − λ)f1(y), (1 − λ)f2(y)}
= λf(x) + (1 − λ)f(y) ⊓⊔
• Extends also to infinite sets of convex functions
31
Proving Convexity
• Thus, to prove convexity of a function f , there are several approaches
– From the definition of convexity
– Compute the Hessian of f and show that it is psd.
– Decompose f as a nonnegative weighted sum of convex functions
– Decompose f as the composition of a convex and an affine function
– Decompose f as a maximum of convex functions
32
Examples
• Show that the function
f(w) = w⊤w = ‖w‖2
is strictly convex
Proof. The Hessian of f equals ∇2f(w) = 2Id which is positivedefinite. ⊓⊔
33
Examples (contd.)
• Show that the quadratic function
f(w) = w⊤Aw + a⊤w + b
where A ∈ Sd+, a ∈ IRd, b ∈ IR, is convex
Proof. Write A = R⊤R for some matrix R ∈ IRd×d. Thenw⊤Aw = (Rw)⊤(Rw), which is a composition of the convex functionw 7→ w⊤w and a linear map. The term w 7→ a⊤w + b is affine andhence convex. Thus f is convex, as the sum of convex functions. ⊓⊔
Alternatively, we may compute the Hessian, which equals 2A, which ispsd.
34
Proving Strict Convexity
• To prove strict convexity of a function f
– Use the definition (with strict inequality, x 6= y and λ ∈ (0, 1))
– Compute the Hessian of f and show that it is positive definite
– Decompose f as a sum of a convex and a strictly convex function
(easy to prove this property)
Note: When does the convex-affine composition operation apply?
• Example: the quadratic function f(w) = w⊤Aw + a⊤w + b is strictlyconvex if and only if A ≻ 0(since the Hessian equals 2A)
35
Convex Optimization
• The problem
minw∈IRd
f(w)
subject to fi(w) ≤ 0 i = 1, . . . ,M (1)
a⊤
j w = bj j = 1, . . . , P
where f, f1, . . . , fM are convex functions, is called a convex program orconvex optimization problem
• The function f whose value we wish to minimise is called the objectivefunction
36
Remarks
• The set of points w satisfying the constraints fi(w) ≤ 0, a⊤
j w = bj iscalled the feasible set
• The feasible set is convex: for any feasible points x, y,fi(λx + (1 − λ)y) ≤ λfi(x) + (1 − λ)fi(y) ≤ 0 anda⊤
j (λx + (1 − λ)y)) = λbj + (1 − λ)bj = bj
• In general, if we minimize a convex objective function over a convex set,the problem can be rewritten in form (1) (in principle at least;sometimes not practically possible)
• Many problems of interest can be rewritten in the form (1); they do notnecessarily appear in that form however
37
Remarks (contd.)
• Minimum (1) does not always exist! (could be an infimum or could be−∞)
−5 −4 −3 −2 −1 0 1 2 30
5
10
15
20
25
30
35
• The set of solutions (minimisers) of problem (1) is convex (easy toshow)
• In particular, if the function f is strictly convex, then there is a uniqueminimiser (if any exists)
38
Remarks (contd.)
• There are no local minima outside the set of minimisers
• This is important because it implies that algorithms will not get stuckaway from the solution(s)
• Thus, the great appeal of convex programs is that they can be solved!(many of them in polynomial time)
39
Examples
minw∈IRd
w⊤Aw + a⊤w
subject to w⊤Bw + b⊤w + c ≤ 0
d⊤w = e
where A, B � 0
minw∈IRd
a⊤w
subject to b⊤1w ≤ c1
b⊤2w ≤ c2
d⊤
1w = e1
40
Regularization
minw∈IRd
m∑
i=1
E(
w⊤xi , yi
)
+ γ ‖w‖2 (R)
• Assume that E(·, y) is a convex function for every y ∈ IR
• Then, problem (R) is a convex program; this program is unconstrained
• Indeed, the objective function is convex, as a sum of convex functions:E(w⊤xi , yi) is convex as a convex-affine composition and ‖w‖2 is alsoconvex, as we have already seen
• It can be shown that the minimum exists (under mild assumptions onE)
41
Regularization (contd.)
• Example 1: ridge regression
minw∈IRd
m∑
i=1
(yi − w⊤xi)2 + γ ‖w‖2
• Convex program, since the function z 7→ (z − y)2 is convex for everyy ∈ IR
42
Regularization (contd.)
• Example 2: we have seen (in Lecture 1) that SVM is equivalent to theregularization problem
minw∈IRd
m∑
i=1
max{1 − yi(w⊤xi) , 0} + γ ‖w‖2
with γ =1
2C
• This is a convex program, since the function z 7→ max{1 − yz , 0} (thehinge loss) is convex for every y ∈ {−1, 1}; indeed, it is a maximum ofconvex (in particular, affine) functions
43
Regularization (contd.)
• How about the SVM primal
minw∈IRd
1
2‖w‖2 + C
m∑
i=1
ξi
subject to yi(w⊤xi) ≥ 1 − ξi, ξi ≥ 0, for i = 1, . . . ,m
• It is also easy to see that this is a convex program, but now thevariables include wi and ξi as well
• The objective function is convex (quadratic in w, linear in ξi); thefunctions 1 − ξi − yi(w
⊤xi) and −ξi in the inequality constraints arealso convex
44
Regularization (contd.)
• Similarly, the SVM and ridge regression dual problems can be seen to beconvex problems
• In general, the dual of regularization
minc∈IRm
m∑
i=1
E(
c⊤gi , yi
)
+ γ c⊤Gc (C)
is a convex problem (assuming as before that the loss function isconvex)
• Indeed, the quadratic form c⊤Gc is a convex function of c since theGram matrix G is positive semidefinite
45
Regularization (contd.)
minw∈IRd
m∑
i=1
E(
w⊤xi , yi
)
+ γ ‖w‖2 (R)
minc∈IRm
m∑
i=1
E(
c⊤gi , yi
)
+ γ c⊤Gc (C)
• Problem (R) has a unique solution; indeed, the term ‖w‖2 is strictlyconvex; hence the objective function is also strictly convex
• However, problem (C) has a unique solution only if G ≻ 0, i.e. only ifthe feature vectors φ(xi) are linearly independent;otherwise, there are infinite optimal c, but the correspondingw =
∑m
i=1ciφ(xi) is unique
46
Convex Programs with Linear Equality Constraints
• The following special type of convex program can be solved usingLagrange multipliers
minw∈IRd
f(w)
subject to a⊤
j w = bj j = 1, . . . , P (2)
where f is a convex and differentiable function
• Set the gradient to zero: ∇f(w) =∑P
j=1cjaj, for some cj ∈ IR; the
set of solutions of this equation is the same as the set of minimisers of(2) (by a theorem); a dual problem can also be obtained
47
Important Convex Optimization Problems
• Linear Programming
• Quadratic Programming
• Semidefinite Programming
• Dedicated off-the-shelf algorithms exist for each of the above categories
• In machine learning, algorithms have been developed for specialsubtypes of such problems
48
Linear Programming (LP)
minw∈IRd
c⊤w
subject to d⊤
i w ≤ ei i = 1, . . . , M (3)
a⊤
j w = bj j = 1, . . . , P
• The feasible set is a polyhedron (bounded or not)
• Problem (3) may have one, none, or infinite solutions
• Interesting fact: the dual problem is also a linear program
49
Linear Programming (contd.)
• It can be shown that the solution (if unique) will be one of the vertices
• The simplex algorithm is one of the oldest optimization algorithms(Dantzig in the 40s)
50
Linear Programming (contd.)
• Intuition of simplex: find a vertex to start from; from each vertex, moveto a neighbour so that the objective function decreases; terminate ifthere is no such neighbour
• Time complexity: very good in almost all cases, but very bad(exponential) in the worst case
• In practice, very fast for typical problems and can be applied to largedata sets
• Methods developed in the 80s (interior-point methods) have beenapplied to linear programming and are of polynomial-time complexity
51
Quadratic Programming (QP)
minw∈IRd
w⊤Aw + c⊤w
subject to d⊤
i w ≤ ei i = 1, . . . ,M (4)
a⊤
j w = bj j = 1, . . . , P
where A � 0
• If A ≻ 0, the solution (if any) is unique (due to strict convexity)
• The dual is also a quadratic program
52
Quadratic Programming (contd.)
• The idea behind simplex does not apply here; in fact, the minimisercould be anywhere in the feasible set (on the boundary or in the interior)
• The difficulty in solving QP is due to the fact that the solution may lieon the boundary of the feasible set
53
Interior-Point Methods
• Idea: change the objective function by adding to it a barrier function
• The barrier depends on the constraints and is parameterised by aparameter t
• Unconstrained minimisation of the barrier function gives a solution inthe interior of the feasible set
• Changing t appropriately, the algorithm converges to the solution of (4)in polynomial time
• These methods can handle problems of reasonably large size
54
Ridge Regression as QP
minw∈IRd
m∑
i=1
(yi − w⊤xi)2 + γ ‖w‖2
• Ridge regression is an unconstrained QP
• Just need to solve a linear system using standard methods
55
SVM as QP
minw∈IRd
1
2‖w‖2 + C
m∑
i=1
ξi maxα∈IRm
−1
2α⊤Aα +
m∑
i=1
αi
s. t. yi(w⊤xi) ≥ 1 − ξi s. t. 0 ≤ αi ≤ C
ξi ≥ 0 for i = 1, . . . , m
for i = 1, . . . , m
• SVM is a QP with inequality constraints
• The SVM dual is a QP with “box” constraints
56
Algorithms for SVM
• One approach to solve SVMs is with interior-point methods
• For large datasets (say m > 103) it is practically impossible to solve thedual problem with such methods (matrix A is dense!)
• A typical approach is to iteratively optimize wrt. an ’active set’ A ofdual variables, fixing the rest. Set α = 0, choose q ≤ m and an activeset A of q variables. We repeat until convergence the steps
– Solve the problem wrt. the variables in A– Remove one variable from A which satisfies the KKT conditions and
add one variable, if any, which violates the KKT conditions. If nosuch variable exists, stop
57
QCQP
minw∈IRd
w⊤Aw + c⊤w
subject to w⊤Bw + d⊤
i w ≤ ei i = 1, . . . ,M
a⊤
j w = bj j = 1, . . . , P
where both A,B are psd.
• It is called a quadratically constrained quadratic program (QCQP)
• Larger family; contains the family of QP
• The dual problem is not a QCQP, in general
58
QCQP (contd.)
• The feasible set is the intersection of ellipsoids and/or a polyhedron
• It is faster to solve a QP with a dedicated method than to use a QCQPsolver
59
SDP
minw∈IRd
c⊤w
subject to w1F1 + · · · + wnFn + G � 0
a⊤
j w = bj j = 1, . . . , P
• There is a linear matrix inequality (LMI) constraint
• Multiple LMIs reduce to an equivalent problem with just one LMI
• The dual problem of an SDP is also an SDP
• LP ⊆ QP ⊆ QCQP ⊆ · · · ⊆ SDP (LPs, QPs, QCQPs can be rewrittenas SDPs)
60
Bibliography
Lectures available at:
http://www.cs.ucl.ac.uk/staff/M.Pontil/courses/index-ATML10.htm
See also Boyd and Vandenberghe, Convex Optimization, 2004,
http://www.stanford.edu/boyd/cvxbook/
Secs. 2.1.4-2.2.5, 3.1.1, 3.1.5, 3.1.8, 3.2.1-3.2.3, 4.1.1, 4.2.1, 4.2.2, 4.3, 4.4, 4.6.2
61