Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Proximal methods for minimizing convexcompositions
Courtney PaquetteJoint Work with D. Drusvyatskiy
Department of MathematicsUniversity of Washington (Seattle)
WCOM Fall 2016October 1, 2016
1 / 14
A function f is α-convex and β-smooth if
qx ≤ f ≤ Qx
where
qx(x) = f (x) + 〈∇f (x), y− x〉+ α
2‖y− x‖2
Qx(x) = f (x) + 〈∇f (x), y− x〉+ β
2‖y− x‖2
Condition number: κ = βα
2 / 14
A function f is α-convex and β-smooth if
qx ≤ f ≤ Qx
where
qx(x) = f (x) + 〈∇f (x), y− x〉+ α
2‖y− x‖2
Qx(x) = f (x) + 〈∇f (x), y− x〉+ β
2‖y− x‖2
Condition number: κ = βα
2 / 14
Complexity of first-order methods
Gradient descent: xk+1 = xk − 1β∇f (xk)
Majorization view: xk+1 = argminx Qxk(·)
β-smooth α-convex
Gradient descent βε κ · log(1
ε)
Table: Iterations until f (xk)− f ∗ < ε
(Nesterov ’83, Yudin-Nemirovsky ’83)
3 / 14
Complexity of first-order methods
Gradient descent: xk+1 = xk − 1β∇f (xk)
Majorization view: xk+1 = argminx Qxk(·)
β-smooth α-convex
Gradient descent βε κ · log(1
ε)
Optimal methods√
βε
√κ · log(1
ε)
Table: Iterations until f (xk)− f ∗ < ε
(Nesterov ’83, Yudin-Nemirovsky ’83)
3 / 14
General set-upConvex-Composite Problem is
minx
F(x) := h(c(x)) + g(x)
c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex
For convenience, set µ := Lβ
Examples:Additive composite minimization
minx
c(x) + g(x)Nonlinear least squares
minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}
Exact penalty subproblem:
minx
g(x) + distK(c(x))
4 / 14
General set-upConvex-Composite Problem is
minx
F(x) := h(c(x)) + g(x)
c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex
For convenience, set µ := LβExamples:
Additive composite minimization
minx
c(x) + g(x)Nonlinear least squares
minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}
Exact penalty subproblem:
minx
g(x) + distK(c(x))
4 / 14
Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:
0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))
Idea: Majorization
F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y) ∀ y
Prox-linear mapping:
x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)}
Prox-linear method:xk+1 = x+k
(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient
G(x) = µ(x− x+)
5 / 14
Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:
0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))
Idea: Majorization
F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y) ∀ y
Prox-linear mapping:
x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)}
Prox-linear method:xk+1 = x+k
(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-Marquardt
The prox-gradientG(x) = µ(x− x+)
5 / 14
Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:
0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))
Idea: Majorization
F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y) ∀ y
Prox-linear mapping:
x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)}
Prox-linear method:xk+1 = x+k
(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient
G(x) = µ(x− x+)5 / 14
Prox-Linear algorithm
Convergence Rate:
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
What is ‖G(xk)‖2 < ε?
dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)
For nonconvex problems, the rate
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
is “essentially” the best
6 / 14
Prox-Linear algorithm
Convergence Rate:
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
What is ‖G(xk)‖2 < ε?
dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)
For nonconvex problems, the rate
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
is “essentially” the best
6 / 14
Prox-Linear algorithm
Convergence Rate:
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
What is ‖G(xk)‖2 < ε?
dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)
For nonconvex problems, the rate
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
is “essentially” the best
6 / 14
Solving the Sub-problem: Inexact Prox-Linear
Prox-linear method requires solving:
miny
ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)
Suppose we can not solve exactly:
ϕ(x+, x) ≤ miny
ϕ(y, x) + ε
where x+ is an ε-approximate optimal solution
QuestionHow accurately do we need to solve the subproblem to
guarantee the same overall rate for the prox-linear?
7 / 14
Solving the Sub-problem: Inexact Prox-Linear
Prox-linear method requires solving:
miny
ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)
Suppose we can not solve exactly:
ϕ(x+, x) ≤ miny
ϕ(y, x) + ε
where x+ is an ε-approximate optimal solution
QuestionHow accurately do we need to solve the subproblem to
guarantee the same overall rate for the prox-linear?
7 / 14
Inexact Prox-Linear AlgorithmWant to bound
G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem
Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then
mini=1,...,k
‖G(xi)‖2 ≤ O
(µ+
∑ki=1 εi
k
).
Generalizes (Schmidt-Le Roux-Bach ’11)
Question
Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function
8 / 14
Inexact Prox-Linear AlgorithmWant to bound
G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem
Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then
mini=1,...,k
‖G(xi)‖2 ≤ O
(µ+
∑ki=1 εi
k
).
Generalizes (Schmidt-Le Roux-Bach ’11)
Question
Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function
8 / 14
Inexact Prox-Linear AlgorithmWant to bound
G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem
Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then
mini=1,...,k
‖G(xi)‖2 ≤ O
(µ+
∑ki=1 εi
k
).
Generalizes (Schmidt-Le Roux-Bach ’11)
Question
Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function
8 / 14
Acceleration
Measuring non-convexity,
h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}
Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.
Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖
2 is convex for all w ∈ dom h∗
Defn: Parameter ρ ∈ [0, 1] such that
x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗
9 / 14
Acceleration
Measuring non-convexity,
h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}
Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.
Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖
2 is convex for all w ∈ dom h∗
Defn: Parameter ρ ∈ [0, 1] such that
x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗
9 / 14
Acceleration
Algorithm 1: Accelerated prox-linear methodInitialize : Fix two points x0, v0 ∈ dom g.
1 while ‖G(yk−1)‖ > ε do2 ak ← 2
k+13 yk ← akvk−1 + (1− ak)xk−1
4 xk ← y+k5 vk ← argminz g(z)+ 1
ak·h(c(yk)+ak∇c(yk)(z−vk−1)
)+ ak
2t ‖z− vk−1‖2
6 k← k + 17 end
Thm: (Drusvyatskiy-P ’16)
mini=1,...,k
‖G(xi)‖2 ≤ O(µ2
k3
)+ ρ · O
(µ2R2
k
)where R = diam (dom g)
Generalizes (Ghadimi-Lan ’16) for additive composite
10 / 14
Inexact Accelerated Prox-LinearTwo sub-problems to solve:
xk is an εk-approximate optimal solution
minz
g(z) + h(c(yk) +∇c(yk)(z− yk)
)+ 1
2t ‖z− yk‖2
vk is an δk-approximate optimal solution
minz
g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)
)+ ak
2t ‖z− vk−1‖2
dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+
√εk
Thm: (Drusvyatskiy-P ’16)
mini=1,...,k
{‖xk − yk‖2+ εi} ≤ ρ · O
(µ2R2
k
)+O
(µ2
k3
)+
1k3
(k∑
i=1
O(i2εi) +O(i2δi) +O(i√δi)
)where R = diam (dom g)Need: εi ∼ 1
i3+r and δi ∼ 1i4+r
11 / 14
Inexact Accelerated Prox-LinearTwo sub-problems to solve:
xk is an εk-approximate optimal solution
minz
g(z) + h(c(yk) +∇c(yk)(z− yk)
)+ 1
2t ‖z− yk‖2
vk is an δk-approximate optimal solution
minz
g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)
)+ ak
2t ‖z− vk−1‖2
dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+
√εk
Thm: (Drusvyatskiy-P ’16)
mini=1,...,k
{‖xk − yk‖2+ εi} ≤ ρ · O
(µ2R2
k
)+O
(µ2
k3
)+
1k3
(k∑
i=1
O(i2εi) +O(i2δi) +O(i√δi)
)where R = diam (dom g)Need: εi ∼ 1
i3+r and δi ∼ 1i4+r
11 / 14
Inexact Accelerated Prox-LinearTwo sub-problems to solve:
xk is an εk-approximate optimal solution
minz
g(z) + h(c(yk) +∇c(yk)(z− yk)
)+ 1
2t ‖z− yk‖2
vk is an δk-approximate optimal solution
minz
g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)
)+ ak
2t ‖z− vk−1‖2
dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+
√εk
Thm: (Drusvyatskiy-P ’16)
mini=1,...,k
{‖xk − yk‖2+ εi} ≤ ρ · O
(µ2R2
k
)+O
(µ2
k3
)+
1k3
(k∑
i=1
O(i2εi) +O(i2δi) +O(i√δi)
)where R = diam (dom g)Need: εi ∼ 1
i3+r and δi ∼ 1i4+r
11 / 14
Thank you!
12 / 14
ReferencesDrusvyatskiy, D. and Kempton, C. (2016).An accelerated algorithm for minimizing convex compositions.Preprint arXiv: 1605.00125.
Drusvyatskiy, D. and Lewis, A. (2016).Error bounds, quadratic growth, and linear convergence of proximalmethods.Preprint arXiv:1602.06661.
Ghadimi, S. and Lan, G. (2016).Accelerated gradient methods for nonconvex nonlinear and stochasticprogramming.Math. Program., 156(1-2, Ser. A):59–99.
Nesterov, Y. (2004).Introductory lectures on convex optimization. A Basic Course.Springer.
13 / 14
References (cont.)Schmidt, M., Le Roux, N., and Bach, F. (2011).Convergence rates of inexact proximal-gradient methods for convexoptimization.Advances in Neural Information Processing Systems.
14 / 14