Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Lecture 10 Subgradient Methods
April 29 - May 6 2020
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 1 32
1 Convex optimization problem and ε-optimal solution
Consider the problem
minx
f(x) subject to x isin C
where f Rn 983041rarr R cup +infin is convex
f(x) = +infin for x isin domf
and C is a closed convex set
ε-optimal solution 983141x
f(983141x)minus f(x983183) le ε and dist(983141x C) le ε
where x983183 isin C is an optimal solution
Newton method interior point method prohibitive due tocomplexity issue Subgradient method is preferable low costper-iteration achieve low accuracy solutions very quickly
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 2 32
11 Gradient descent and Newtonrsquos method
Linear approximation f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉Quadratic approximations
f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ L
2983042xminus xk98304222
f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ 1
2(xminus xk)Tnabla2f(xk)(xminus xk)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 3 32
2 Subgradient methods for C = Rn
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = xk minus αkgk
= argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
The subgradient method is not in general a descent methodThat is f(xk) may not decreasing For example f(x) = 983042x9830421and let x1 = e1 We have
partf(x1) = e1 +
n983131
i=2
tiei ti isin [minus1 1]
Any g1 isin partf(x1) with983123n
i=2 |ti| gt 1 is an ascent direction ie
f(x1 minus αg1) gt f(x1) forallα gt 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 4 32
Contour plot of a convex function
Subdifferential = blue zone + red zone
Gray zone = negative blue zone ascent directions
Green zone = negative red zone descent directions
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32
Theorem 1 (Distance to a minimizer is decreasing)
If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that
983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422
Proof Note that
1
2983042xk minus αgk minus x98318398304222 =
1
2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2
2983042gk98304222
andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉
Any α satisfying
0 lt α lt2(f(xk)minus f(x983183))
983042gk98304222is desired
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32
21 Assumptions and convergence analysis
The function f is convex
There is at least one (possibly non-unique) minimizing pointx983183 isin argmin
xf(x) with f(x983183) = inf
xf(x) gt minusinfin
The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)
Theorem 2
Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1
K983131
k=1
αk[f(xk)minus f(x983183)] le
1
2983042x1 minus x98318398304222 +
1
2
K983131
k=1
α2kM
2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32
Corollary 3
Let AK =
K983131
k=1
αk and define xK =1
AK
K983131
k=1
αkxk xK
best = argminxkkleK
f(xk)
Then for all K ge 1
f(xK)minus f(x983183) le983042x1 minus x98318398304222 +
K983131
k=1
α2kM
2
2AK
and
f(xKbest)minus f(x983183) le
983042x1 minus xlowast98304222 +K983131
k=1
α2kM
2
2AK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
1 Convex optimization problem and ε-optimal solution
Consider the problem
minx
f(x) subject to x isin C
where f Rn 983041rarr R cup +infin is convex
f(x) = +infin for x isin domf
and C is a closed convex set
ε-optimal solution 983141x
f(983141x)minus f(x983183) le ε and dist(983141x C) le ε
where x983183 isin C is an optimal solution
Newton method interior point method prohibitive due tocomplexity issue Subgradient method is preferable low costper-iteration achieve low accuracy solutions very quickly
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 2 32
11 Gradient descent and Newtonrsquos method
Linear approximation f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉Quadratic approximations
f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ L
2983042xminus xk98304222
f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ 1
2(xminus xk)Tnabla2f(xk)(xminus xk)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 3 32
2 Subgradient methods for C = Rn
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = xk minus αkgk
= argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
The subgradient method is not in general a descent methodThat is f(xk) may not decreasing For example f(x) = 983042x9830421and let x1 = e1 We have
partf(x1) = e1 +
n983131
i=2
tiei ti isin [minus1 1]
Any g1 isin partf(x1) with983123n
i=2 |ti| gt 1 is an ascent direction ie
f(x1 minus αg1) gt f(x1) forallα gt 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 4 32
Contour plot of a convex function
Subdifferential = blue zone + red zone
Gray zone = negative blue zone ascent directions
Green zone = negative red zone descent directions
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32
Theorem 1 (Distance to a minimizer is decreasing)
If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that
983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422
Proof Note that
1
2983042xk minus αgk minus x98318398304222 =
1
2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2
2983042gk98304222
andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉
Any α satisfying
0 lt α lt2(f(xk)minus f(x983183))
983042gk98304222is desired
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32
21 Assumptions and convergence analysis
The function f is convex
There is at least one (possibly non-unique) minimizing pointx983183 isin argmin
xf(x) with f(x983183) = inf
xf(x) gt minusinfin
The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)
Theorem 2
Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1
K983131
k=1
αk[f(xk)minus f(x983183)] le
1
2983042x1 minus x98318398304222 +
1
2
K983131
k=1
α2kM
2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32
Corollary 3
Let AK =
K983131
k=1
αk and define xK =1
AK
K983131
k=1
αkxk xK
best = argminxkkleK
f(xk)
Then for all K ge 1
f(xK)minus f(x983183) le983042x1 minus x98318398304222 +
K983131
k=1
α2kM
2
2AK
and
f(xKbest)minus f(x983183) le
983042x1 minus xlowast98304222 +K983131
k=1
α2kM
2
2AK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
11 Gradient descent and Newtonrsquos method
Linear approximation f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉Quadratic approximations
f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ L
2983042xminus xk98304222
f(x) asymp f(xk) + 〈nablaf(xk)xminus xk〉+ 1
2(xminus xk)Tnabla2f(xk)(xminus xk)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 3 32
2 Subgradient methods for C = Rn
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = xk minus αkgk
= argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
The subgradient method is not in general a descent methodThat is f(xk) may not decreasing For example f(x) = 983042x9830421and let x1 = e1 We have
partf(x1) = e1 +
n983131
i=2
tiei ti isin [minus1 1]
Any g1 isin partf(x1) with983123n
i=2 |ti| gt 1 is an ascent direction ie
f(x1 minus αg1) gt f(x1) forallα gt 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 4 32
Contour plot of a convex function
Subdifferential = blue zone + red zone
Gray zone = negative blue zone ascent directions
Green zone = negative red zone descent directions
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32
Theorem 1 (Distance to a minimizer is decreasing)
If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that
983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422
Proof Note that
1
2983042xk minus αgk minus x98318398304222 =
1
2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2
2983042gk98304222
andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉
Any α satisfying
0 lt α lt2(f(xk)minus f(x983183))
983042gk98304222is desired
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32
21 Assumptions and convergence analysis
The function f is convex
There is at least one (possibly non-unique) minimizing pointx983183 isin argmin
xf(x) with f(x983183) = inf
xf(x) gt minusinfin
The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)
Theorem 2
Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1
K983131
k=1
αk[f(xk)minus f(x983183)] le
1
2983042x1 minus x98318398304222 +
1
2
K983131
k=1
α2kM
2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32
Corollary 3
Let AK =
K983131
k=1
αk and define xK =1
AK
K983131
k=1
αkxk xK
best = argminxkkleK
f(xk)
Then for all K ge 1
f(xK)minus f(x983183) le983042x1 minus x98318398304222 +
K983131
k=1
α2kM
2
2AK
and
f(xKbest)minus f(x983183) le
983042x1 minus xlowast98304222 +K983131
k=1
α2kM
2
2AK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
2 Subgradient methods for C = Rn
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = xk minus αkgk
= argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
The subgradient method is not in general a descent methodThat is f(xk) may not decreasing For example f(x) = 983042x9830421and let x1 = e1 We have
partf(x1) = e1 +
n983131
i=2
tiei ti isin [minus1 1]
Any g1 isin partf(x1) with983123n
i=2 |ti| gt 1 is an ascent direction ie
f(x1 minus αg1) gt f(x1) forallα gt 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 4 32
Contour plot of a convex function
Subdifferential = blue zone + red zone
Gray zone = negative blue zone ascent directions
Green zone = negative red zone descent directions
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32
Theorem 1 (Distance to a minimizer is decreasing)
If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that
983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422
Proof Note that
1
2983042xk minus αgk minus x98318398304222 =
1
2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2
2983042gk98304222
andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉
Any α satisfying
0 lt α lt2(f(xk)minus f(x983183))
983042gk98304222is desired
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32
21 Assumptions and convergence analysis
The function f is convex
There is at least one (possibly non-unique) minimizing pointx983183 isin argmin
xf(x) with f(x983183) = inf
xf(x) gt minusinfin
The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)
Theorem 2
Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1
K983131
k=1
αk[f(xk)minus f(x983183)] le
1
2983042x1 minus x98318398304222 +
1
2
K983131
k=1
α2kM
2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32
Corollary 3
Let AK =
K983131
k=1
αk and define xK =1
AK
K983131
k=1
αkxk xK
best = argminxkkleK
f(xk)
Then for all K ge 1
f(xK)minus f(x983183) le983042x1 minus x98318398304222 +
K983131
k=1
α2kM
2
2AK
and
f(xKbest)minus f(x983183) le
983042x1 minus xlowast98304222 +K983131
k=1
α2kM
2
2AK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Contour plot of a convex function
Subdifferential = blue zone + red zone
Gray zone = negative blue zone ascent directions
Green zone = negative red zone descent directions
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 5 32
Theorem 1 (Distance to a minimizer is decreasing)
If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that
983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422
Proof Note that
1
2983042xk minus αgk minus x98318398304222 =
1
2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2
2983042gk98304222
andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉
Any α satisfying
0 lt α lt2(f(xk)minus f(x983183))
983042gk98304222is desired
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32
21 Assumptions and convergence analysis
The function f is convex
There is at least one (possibly non-unique) minimizing pointx983183 isin argmin
xf(x) with f(x983183) = inf
xf(x) gt minusinfin
The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)
Theorem 2
Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1
K983131
k=1
αk[f(xk)minus f(x983183)] le
1
2983042x1 minus x98318398304222 +
1
2
K983131
k=1
α2kM
2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32
Corollary 3
Let AK =
K983131
k=1
αk and define xK =1
AK
K983131
k=1
αkxk xK
best = argminxkkleK
f(xk)
Then for all K ge 1
f(xK)minus f(x983183) le983042x1 minus x98318398304222 +
K983131
k=1
α2kM
2
2AK
and
f(xKbest)minus f(x983183) le
983042x1 minus xlowast98304222 +K983131
k=1
α2kM
2
2AK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Theorem 1 (Distance to a minimizer is decreasing)
If 0 isin partf(xk) then for any gk isin partf(xk) and any x983183 isin argminx f(x)there is a stepsize α gt 0 such that
983042xk minus αgk minus x9831839830422 lt 983042xk minus x9831839830422
Proof Note that
1
2983042xk minus αgk minus x98318398304222 =
1
2983042xk minus x98318398304222 + α〈gkx983183 minus xk〉+ α2
2983042gk98304222
andf(x983183) ge f(xk) + 〈gkx983183 minus xk〉
Any α satisfying
0 lt α lt2(f(xk)minus f(x983183))
983042gk98304222is desired
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 6 32
21 Assumptions and convergence analysis
The function f is convex
There is at least one (possibly non-unique) minimizing pointx983183 isin argmin
xf(x) with f(x983183) = inf
xf(x) gt minusinfin
The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)
Theorem 2
Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1
K983131
k=1
αk[f(xk)minus f(x983183)] le
1
2983042x1 minus x98318398304222 +
1
2
K983131
k=1
α2kM
2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32
Corollary 3
Let AK =
K983131
k=1
αk and define xK =1
AK
K983131
k=1
αkxk xK
best = argminxkkleK
f(xk)
Then for all K ge 1
f(xK)minus f(x983183) le983042x1 minus x98318398304222 +
K983131
k=1
α2kM
2
2AK
and
f(xKbest)minus f(x983183) le
983042x1 minus xlowast98304222 +K983131
k=1
α2kM
2
2AK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
21 Assumptions and convergence analysis
The function f is convex
There is at least one (possibly non-unique) minimizing pointx983183 isin argmin
xf(x) with f(x983183) = inf
xf(x) gt minusinfin
The subgradients are bounded for all x and all g isin partf(x) wehave the subgradient bound 983042g9830422 le M lt infin (independently of x)
Theorem 2
Let αk ge 0 be any non-negative sequence of stepsizes and the aboveassumptions hold The subgradient iteration generates the sequencexk that satisfies for all K ge 1
K983131
k=1
αk[f(xk)minus f(x983183)] le
1
2983042x1 minus x98318398304222 +
1
2
K983131
k=1
α2kM
2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 7 32
Corollary 3
Let AK =
K983131
k=1
αk and define xK =1
AK
K983131
k=1
αkxk xK
best = argminxkkleK
f(xk)
Then for all K ge 1
f(xK)minus f(x983183) le983042x1 minus x98318398304222 +
K983131
k=1
α2kM
2
2AK
and
f(xKbest)minus f(x983183) le
983042x1 minus xlowast98304222 +K983131
k=1
α2kM
2
2AK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Corollary 3
Let AK =
K983131
k=1
αk and define xK =1
AK
K983131
k=1
αkxk xK
best = argminxkkleK
f(xk)
Then for all K ge 1
f(xK)minus f(x983183) le983042x1 minus x98318398304222 +
K983131
k=1
α2kM
2
2AK
and
f(xKbest)minus f(x983183) le
983042x1 minus xlowast98304222 +K983131
k=1
α2kM
2
2AK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 8 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Whenever αk rarr 0 and
infin983131
k=1
αk = infin we have
K983131
k=1
α2k
983089 K983131
k=1
αk rarr 0
and sof(xK)minus f(x983183) rarr 0 as K rarr infin
Taking
αk =983042x1 minus x9831839830422
Mradick
yields
f(xK)minus f(x983183) leM983042x1 minus x9831839830422radic
K
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 9 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Example robust regression in robust statistics
Suppose we have a sequence of data vectors ai isin Rn and targetresponses bi isin R i = 1 m and we would like to predict bi via theinner product 〈aix〉 for some vector x If there are outliers orother data corruptions in the targets bi a natural objective forthis task is
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
where A =983045a1 middot middot middot am
983046T isin Rmtimesn and b isin Rm The subgradient
g =1
mATsign(Axminus b) =
1
m
m983131
i=1
aisign(〈aix〉 minus bi) isin partf(x)
The accuracy of the subgradient methods with different stepsizesvaries greatly the smaller the stepsize the better the (final)performance of the iterates xk but initial progress is much slower
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 10 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
f(xk)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 11 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
f(xkbest)minus f(x983183) for subgradient method with fixed stepsizes α
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 12 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
3 Projected subgradient methods for constrained case
For k = 1 2 choose any gk isin partf(xk) set
xk+1 = πC(xk minus αkg
k)
whereπC(x) = argmin
yisinC983042xminus y9830422
The update is equivalent to (why Exercise)
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
2αk983042xminus xk98304222
983070
It is very important in the projected subgradient method that theprojection mapping πC be efficiently computable ndash the method iseffective essentially only in problems where this is true
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 13 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Example LASSO or Compressed sensing applications
minx
983042Axminus b98304222 subject to 983042x9830421 le 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 14 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Example Suppose that C = x 983042x983042p le 1 for p = 1 2infin
(1) p = infin[πC(x)]j = min1maxxj minus1
that is we simply truncate the coordinates of x to be in the range[minus1 1]
(2) p = 2
πC(x) =
983069x if 983042x9830422 le 1x983042x9830422 otherwise
(3) p = 1 If 983042x9830421 le 1 then πC(x) = x If 983042x9830421 gt 1 then
[πC(x)]j = sign(xj)[|xj |minus t]+
where t is the unique t ge 0 satisfying
n983131
j=1
[|xj |minus t]+ = 1
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 15 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Example Suppose that C is an affine set represented by
C = x isin Rn Ax = b
where A isin Rmtimesn with m le n is full rank (So that A is a shortand fat matrix and AAT ≻ 0) Then the projection of x onto C is
πC(x) = (IminusAT(AAT)minus1A)x+AT(AAT)minus1b
If we begin the iterates from a point xk isin C ie with Axk = bthen
xk+1 = πC(xk minus αkg
k) = xk minus αk(IminusAT(AAT)minus1A)gk
that is we simply project gk onto the nullspace of A and iterate
For more examples and proofs see FOMO sect64
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 16 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
31 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that 983042g9830422 le M forall g isin partf(x) forall x isin C
Theorem 4
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The projected subgradient iteration generates thesequence xk that satisfies for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
2αK+
1
2
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 17 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Corollary 5
Let αk =αradick
and define xK =1
K
K983131
k=1
xk Then for all K ge 1
f(xK)minus f(x983183) leR2
2αradicK
+M2αradic
K
We see that convergence is guaranteed at the ldquobestrdquo rate1radicK
for
all iterations Here we say ldquobestrdquo because this rate isunimprovable ndash there are worst case functions for which no
method can achieve a rate of convergence faster thanRMradicK
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 18 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
41 Stochastic subgradient
Definition A stochastic subgradient oracle for the function fconsists of a triple (g SP) where S is a sample space P is aprobability distribution and g Rn times S 983041rarr Rn is a mapping thatfor each fixed x isin domf satisfies
E[g(xS)] =983133
g(x s)dP(s) isin partf(x)
where S isin S is a random variable with distribution P
With some abuse of notation we will use g or g(x) for shorthandof the random vector g(xS) when this does not cause confusion
Definition Let f Rn 983041rarr R cup +infin be a convex function and fixx isin domf Then a random vector g is a stochastic subgradient forf at the point x if E[g] isin partf(x) or
f(y) ge f(x) + 〈E[g]y minus x〉 for all y
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 19 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Example Given a collection of functions F Rn times S 983041rarr R where Sis a sample space and for each s isin S the function F (middot s) is convexthen
f(x) = E[F (xS)]
is convex when we take expectations over random variable S andtaking
g(x s) isin partF (x s)
gives a stochastic subgradient with the property that
E[g(xS)] isin partf(x)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 20 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
42 Stochastic programming
Consider the convex optimization problem
minxisinC
f(x) = E[F (xS)]
where C is a convex set S is a random variable on the space Swith distribution P (so the expectation E[F (xS)] is takenaccording to P) and for each s isin S the function x 983041rarr F (x s) isconvex (therefore f(x) is convex)
If g(x s) isin partxF (x s) and S sim P then g = g(xS) is a stochasticsubgradient because for all y
f(y) = E[F (yS)]
ge E[F (xS) + 〈g(xS)y minus x〉]= f(x) + 〈E[g(xS)]y minus x〉
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 21 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Example Robust regression
f(x) =1
m983042Axminus b9830421 =
1
m
m983131
i=1
|〈aix〉 minus bi|
A natural stochastic subgradient is
g(x i) = aisign(〈aix〉 minus bi)
where i is uniformly at random draw from [m]
Advantage Note that we requires time only O(n) to computeg(x i) (as opposed to O(mn) to compute Axminus b)
Generalization Given any problem with large dataset simi=1
minx
f(x) =1
m
m983131
i=1
F (x si)
Drawing i isin [m] uniformly at random and selecting g isin partF (x si)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 22 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
43 Projected stochastic subgradient methods
Sometimes computing stochastic subgradient is much easier thancomputing subgradient
The expectation E[F (xS)] is generally intractable to compute inmany statistical and machine learning applications Then it maybe impossible to find a subgradient g isin partf(x)
For k = 1 2 compute a stochastic subgradient gk at the pointxk where
E[gk|xk] isin partf(xk)
Setxk+1 = πC(x
k minus αkgk)
This is essentially identical to the projected subgradient methodexcept that we replace the true subgradient with a stochasticsubgradient
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 23 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Example (Robust regression) We consider
minx
f(x) =1
m
m983131
i=1
|〈aix〉 minus bi| st 983042x9830422 le R
using the random sample
g = aisign(〈aix〉 minus bi)
as our stochastic gradient Set
A =983045a1 middot middot middot am
983046T ai sim N (0 Intimesn) (iid)
andbi = 〈aiu〉+ εi|εi|3 εi sim N (0 1) (iid)
where u sim N (0 Intimesn) Set n = 50 m = 100 R = 4 and
α =R
Mradick M2 =
1
m983042A9830422F
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 24 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
f(xk)minus f(x983183) versus k
Typical performance the initial decrease is quite fast but themethod eventually stops making progress once it achieves somelow accuracy (in this case 10minus1) Each iteration O(n) while eachprojected subgradient method iteration of O(mn)
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 25 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Example (Multiclass support vector machine)
In a general m-class classification problem we represent themulticlass classifier using the matrix
X =983045x1 x2 middot middot middot xm
983046isin Rntimesm
The predicted class for a data vector a isin Rn is then
argmaxlisin[m]
〈axl〉 = argmaxlisin[m]
[XTa]l
where 〈axl〉 is the ldquoscorerdquo associated with class l
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 26 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Given training examples as pairs
(ai bi) isin Rn times 1 m i = 1 N
The multiclass classifier X can be determined by
minX
f(X) =1
N
N983131
i=1
F (X (ai bi)) st 983042X983042F le R
where the multiclass hinge loss function
F (X (a b)) = maxl ∕=b
[1 + 〈axl minus xb〉]+
with[t]+ = maxt 0
denotes the positive part
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 27 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Set
αk =α1radick M2 =
1
N
N983131
i=1
983042ai98304222
Stochastic subgradient method
Set i isin [N ] uniformly at random then take
gk isin partF (Xk (ai bi))
Subgradient method
gk =1
N
N983131
i=1
gki isin partf(Xk) gk
i isin partF(Xk (ai bi))
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 28 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
f(Xk)minus f(Xlowast) versus ldquoeffective passes through Ardquo
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 29 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
431 Assumptions and convergence analysis
The function f is convex
The set C sube intdomf is compact and convex and
983042xminus x9831839830422 le R lt infin
for all x isin C where x983183 = argminxisinC f(x) and f(x983183) gt minusinfin
There exists M lt infin such that E[983042g(xS)98304222] le M2 for all x isin Cand all g satisfying E[g(xS)] isin partf(x)
Theorem 6
Let αk gt 0 be any non-increasing sequence of stepsizes and the aboveassumptions hold The stochastic projected subgradient iterationgenerates the sequence xk that satisfies for all K ge 1
xK =1
K
K983131
k=1
xk E[f(xK)minus f(x983183)] leR2
2KαK+
1
2K
K983131
k=1
αkM2
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 30 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Corollary 7
Let the conditions of Theorem 6 hold and let αk =R
Mradick
for each k
Then for all K ge 1
E[f(xK)]minus f(x983183) le3RM
2radicK
Corollary 8
Let αk be non-summable but convergent to zero that is
αk rarr 0983131infin
k=1αk = infin
Then f(xK)minus f(x983183) rarr 0 (in probability) as K rarr infin that is for all983171 gt 0 we have
lim supKrarrinfin
P[f(xK)minus f(x983183) ge 983171] = 0
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 31 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32
Theorem 9
Let the conditions of Theorem 6 hold and assume that 983042g9830422 le M forall stochastic subgradients g Then for any 983171 gt 0
f(xK)minus f(xlowast) le R2
2KαK+
1
2K
K983131
k=1
αkM2 +
RMradicK
983171
with probability at least 1minus eminus129831712
Let αk =R
Mradickand set δ = eminus
129831712 we have
f(xK)minus f(xlowast) le 3RM
2radicK
+MR
radicminus2 log δradicK
with probability at least 1minus δ That is we have convergence ofO(MR
radicK) with high probability
Subgradient Methods DAMC Lecture 10 April 29 - May 6 2020 32 32