orie6326: convexoptimization unconstrainedminimizationunconstrained minimization minimize f(x) f :...
Post on 12-Jul-2020
3 Views
Preview:
TRANSCRIPT
ORIE 6326: Convex Optimization
Unconstrained minimization
Professor Udell
Operations Research and Information EngineeringCornell
April 30, 2017
Slides adapted from Stanford EE364a
1 / 33
Outline
◮ terminology and assumptions
◮ gradient descent method
◮ steepest descent method
◮ Newton’s method
◮ self-concordant functions
◮ implementation
2 / 33
Unconstrained minimization
minimize f (x)
◮ f : Rn → R convex, continuously differentiable (hence dom f
open)
◮ we assume optimal value p⋆ = infx f (x) is attained (and finite)
◮ we assume a starting point x (0) such that x (0) ∈ dom f is known
unconstrained minimization methods
◮ produce sequence of points x (k) ∈ dom f , k = 0, 1, . . . with
f (x (k))→ p⋆
◮ can be interpreted as iterative methods for solving optimalitycondition
∇f (x⋆) = 0
3 / 33
Descent methods
x (k+1) = x (k) + t(k)∆x (k) with f (x (k+1)) < f (x (k))
◮ other notations: x+ = x + t∆x , x := x + t∆x , x ← x + t∆x◮ ∆x is the step, or search direction◮ t is the step size, or step length◮ from convexity, f (x+) ≥ f (x) +∇f (x)∆x , so
f (x+) < f (x) implies ∇f (x)T∆x < 0
(i.e., ∆x is a descent direction)
Algorithm 1 General descent method.
given a starting point x ∈ dom f .repeat
1. Determine a descent direction ∆x .2. Line search. Choose a step size t > 0.3. Update. x := x + t∆x .
until stopping criterion is satisfied.
4 / 33
Step size choices
◮ constant step size t(k) = t
◮ decreasing step size t(k) = 1/k
◮ line search
5 / 33
Line search types
exact line search: t = argmint>0 f (x + t∆x)
◮ how?
6 / 33
Line search types
exact line search: t = argmint>0 f (x + t∆x)
◮ how? bisection to find a zero of g ′(t) where g(t) = f (x + t∆x)
◮ takes O(log(1/ǫ)) steps to find t with |g ′(t)| ≤ ǫ
backtracking line search (with parameters a ∈ (0, 1/2], b ∈ (0, 1))◮ starting at t = 1, repeat t := bt until
f (x + t∆x) < f (x) + at∇f (x)T∆x
◮ graphical interpretation: backtrack until t ≤ t0
t
f (x + t∆x)
t = 0 t0
f (x) + at∇f (x)T∆xf (x) + t∇f (x)T∆x
6 / 33
Gradient descent method
general descent method with ∆x = −∇f (x)
Algorithm 2 Gradient descent method.
given a starting point x ∈ dom f .repeat
1. ∆x := −∇f (x).2. Line search. Choose a step size t > 0.3. Update. x := x + t∆x .
until stopping criterion is satisfied.
◮ stopping criterion usually of the form ‖∇f (x)‖2 ≤ ǫ
◮ very simple, but often very slow
7 / 33
quadratic problem in R2
f (x) = (1/2)(x21 + γx22 ) (γ > 0)
with exact line search, starting at x (0) = (γ, 1):
x(k)1 = γ
(
γ − 1
γ + 1
)k
, x(k)2 =
(
−γ − 1
γ + 1
)k
◮ very slow if γ ≫ 1 or γ ≪ 1
◮ example for γ = 10:
x1
x2
x(0)
x(1)
−10 0 10
−4
0
4
8 / 33
nonquadratic example
f (x1, x2) = ex1+3x2−0.1 + ex1−3x2−0.1 + e−x1−0.1
x(0)
x(1)
x(2)
x(0)
x(1)
backtracking line search exact line search
9 / 33
a problem in R100
f (x) = cT x −
500∑
i=1
log(bi − aTi x)
f−fs
t
k
exact l.s.
backtracking l.s.
0 50 100 150 20010−4
10−2
100
102
104
‘linear’ convergence, i.e., a straight line on a semilog plot
10 / 33
Quadratic upper and lower bounds
what problem does the gradient descent step x := x + t∇f (x) solve?
11 / 33
Quadratic upper and lower bounds
what problem does the gradient descent step x := x + t∇f (x) solve?answer:
minimizey f (x) +∇f (x)T (y − x) +1
2t‖x − y‖2
So gradient descent will work best if Hessian is“almost” scaledmultiple of the identity.
Formally, find α ∈ R and β ∈ R so that for all x , y ∈ dom f ,
f (x)+∇f (x)T (y−x)+α
2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+
β
2‖x−y‖2
◮ lower bound is called strong convexity (with parameter α)
◮ upper bound is called smoothness (with parameter β)
◮ clearly α ≤ β
11 / 33
Example I: quadratic
f (x)+∇f (x)T (y−x)+α
2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+
β
2‖x−y‖2
◮ lower bound is called strong convexity (with parameter α)◮ upper bound is called smoothness (with parameter β)
suppose A ∈ Sn+, and consider f (x) = 1
2xTAx . is f smooth and
strongly convex?
12 / 33
Example I: quadratic
f (x)+∇f (x)T (y−x)+α
2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+
β
2‖x−y‖2
◮ lower bound is called strong convexity (with parameter α)◮ upper bound is called smoothness (with parameter β)
suppose A ∈ Sn+, and consider f (x) = 1
2xTAx . is f smooth and
strongly convex?
◮ strongly convex, with parameter α = λmin(A) (if λmin(A) > 0)◮ smooth, with parameter β = λmax(A)
12 / 33
Example I: quadratic
f (x)+∇f (x)T (y−x)+α
2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+
β
2‖x−y‖2
◮ lower bound is called strong convexity (with parameter α)◮ upper bound is called smoothness (with parameter β)
suppose A ∈ Sn+, and consider f (x) = 1
2xTAx . is f smooth and
strongly convex?
◮ strongly convex, with parameter α = λmin(A) (if λmin(A) > 0)◮ smooth, with parameter β = λmax(A)
proof:
1
2yTAy =
1
2(x + (y − x))TA(x + (y − x))
=1
2xT x + (Ax)T (y − x) +
1
2(y − x)TA(y − x)
λmin(A)
2‖y − x‖2 ≤
1
2(y − x)TA(y − x) ≤
λmax(A)
2‖y − x‖2
12 / 33
Example II: smoothed absolute value
f (x)+∇f (x)T (y−x)+α
2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+
β
2‖x−y‖2
◮ lower bound is called strong convexity (with parameter α)
◮ upper bound is called smoothness (with parameter β)
consider
huber(x) =
{
12x
2 x2 ≤ 1|x | − 1
2 otherwise
is huber smooth and strongly convex?
13 / 33
Example II: smoothed absolute value
f (x)+∇f (x)T (y−x)+α
2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+
β
2‖x−y‖2
◮ lower bound is called strong convexity (with parameter α)
◮ upper bound is called smoothness (with parameter β)
consider
huber(x) =
{
12x
2 x2 ≤ 1|x | − 1
2 otherwise
is huber smooth and strongly convex?
◮ not strongly convex
◮ smooth, with parameter β = 1
13 / 33
Example III: regularized absolute value
f (x)+∇f (x)T (y−x)+α
2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+
β
2‖x−y‖2
◮ lower bound is called strong convexity (with parameter α)
◮ upper bound is called smoothness (with parameter β)
consider f (x) = |x |+ x2. is f smooth and strongly convex?
14 / 33
Example III: regularized absolute value
f (x)+∇f (x)T (y−x)+α
2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+
β
2‖x−y‖2
◮ lower bound is called strong convexity (with parameter α)
◮ upper bound is called smoothness (with parameter β)
consider f (x) = |x |+ x2. is f smooth and strongly convex?
◮ strongly convex, with parameter α = 1
◮ not smooth
14 / 33
Roadmap
we’ll analyze gradient descent for a few cases:
◮ is f α-strongly convex, or not?
◮ is f β-smooth, or not?
◮ is f Lipschitz differentiable, or not?
◮ do we use a fixed step size, or line-search?
a question: are these rates “the best possible”? how could we tell?compared to what?
we’ll do the β-smooth case first, then work up to the others
15 / 33
Monotone gradient
first, another convexity condition:
◮ a differentiable function f : Rn → R is convex iff for all x ,y ∈ dom f ,
(∇f (x)−∇f (y))T (x − y) ≥ 0
◮ ∇f is called a monotone mapping
◮ strict inequality =⇒ strictly monotone mapping
16 / 33
Monotone gradient: proof
◮ if f is differentiable and convex, then
f (y) ≥ f (x) +∇f (x)T (y − x) f (x) ≥ f (y) +∇f (y)T (x − y)
add these to get
0 ≥ (∇f (x)−∇f (y))T (y − x)
◮ if ∇f is monotone, then for any x , y ∈ dom f , letg(t) = f (x + t(y − x)). for t ≥ 0,
g ′(t) = ∇f (x + t(y − x))T (y − x) ≥ g ′(0).
so
f (y) = g(1) = g(0) +
∫ 1
0
g ′(t)dt ≥ g(0) + g ′(0)
= f (x) +∇f (x)(y − x)
17 / 33
Smoothness: equivalent definitions
for convex f : Rn → R, the following properties are all equivalent:
1. β
2 xT x − f (x) is convex
2. f is β-smooth: for all x , y ∈ dom f ,
f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖x − y‖2
3. (if f is twice differentiable) ∇2f (x) � βI
4. ∇f is Lipschitz continuous with parameter β: ∀x , y ∈ dom f ,
‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖
5. ∇f is co-coercive with parameter 1β: for all x , y ∈ dom f ,
(∇f (x)−∇f (y))T (x − y) ≥1
β‖∇f (x)−∇f (y)‖2
18 / 33
Smoothness: equivalent definitions
for convex f : Rn → R, the following properties are all equivalent:
1. β
2 xT x − f (x) is convex
2. f is β-smooth: for all x , y ∈ dom f ,
f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖x − y‖2
3. (if f is twice differentiable) ∇2f (x) � βI
4. ∇f is Lipschitz continuous with parameter β: ∀x , y ∈ dom f ,
‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖
5. ∇f is co-coercive with parameter 1β: for all x , y ∈ dom f ,
(∇f (x)−∇f (y))T (x − y) ≥1
β‖∇f (x)−∇f (y)‖2
proof: 2) is first order condition for convexity of 1); 3) is second ordercondition for convexity of 1); 4) =⇒ 2) by integration; 5) =⇒ 4)using Cauchy-Schwarz; 2) =⇒ 5) using first order condition (Lemma3.5 of Bubeck). only 2) =⇒ 5) requires convexity.
18 / 33
Smoothness: proofs of equivalence
1. f is β-smooth: for all x , y ∈ dom f ,
f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖x − y‖2
2. ∇f is Lipschitz continuous with parameter β: for all x ,y ∈ dom f ,
‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖
proof. integrate and use Lipschitz ∇f (last line) to prove smoothness:
f (y)− f (x)−∇f (x)T (y − x)
=
∫ 1
0
∇f (x + t(y − x))T (y − x)dt −∇f (x)T (y − x)
=
∫ 1
0
(∇f (x + t(y − x))−∇f (x))T (y − x)dt
≤
∫ 1
0
‖∇f (x + t(y − x))−∇f (x)‖2‖(y − x)‖2dt
≤
∫ 1
0
βt‖(y − x)‖22dt =β
2‖(y − x)‖22
19 / 33
Smoothness: proofs of equivalence
1. ∇f is Lipschitz continuous with parameter β: for all x ,y ∈ dom f ,
‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖
2. ∇f is co-coercive with parameter 1β: for all x , y ∈ dom f ,
(∇f (x)−∇f (y))T (x − y) ≥1
β‖∇f (x)−∇f (y)‖2
20 / 33
Smoothness: proofs of equivalence
1. ∇f is Lipschitz continuous with parameter β: for all x ,y ∈ dom f ,
‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖
2. ∇f is co-coercive with parameter 1β: for all x , y ∈ dom f ,
(∇f (x)−∇f (y))T (x − y) ≥1
β‖∇f (x)−∇f (y)‖2
proof. use Cauchy-Shwarz (first ineq) and co-coercivity (second)
‖∇f (x)−∇f (y))‖‖(x − y)‖ ≥ (∇f (x)−∇f (y))T (x − y)
≥1
β‖∇f (x)−∇f (y)‖2
‖(x − y)‖ ≥1
β‖∇f (x)−∇f (y)‖
to get Lipschitz continuity20 / 33
Analysis of gradient descent for smooth functions
three assumptions for analysis:
◮ f : Rn → R convex and differentiable with dom f = Rn
◮ f is smooth with parameter β > 0
◮ optimal value p⋆ = infx f (x) finite and attained at x⋆
Algorithm: GD with constant step size.
pick constant step size 0 < t ≤ 1β, and repeat
x (k+1) = x (k) − t∇f (x (k))
(nb: not implementable without guess for β)
21 / 33
Analysis of gradient descent for smooth functions
use quadratic upper bound with y = x+ = x − t∇f (x):
f (x+) ≤ f (x) +∇f (x)(x+ − x) +β
2‖x+ − x‖2
= f (x)− t‖∇f (x)‖2 + t2β
2‖∇f (x)‖2
if constant step size 0 < t ≤ 1β,
f (x+) ≤ f (x)−t
2‖∇f (x)‖2
≤ f (x⋆)−∇f (x)T (x − x⋆)−t
2‖∇f (x)‖2
= p⋆ +1
2t(‖x − x⋆‖2 − ‖x − x⋆ − t∇f (x)‖2)
= p⋆ +1
2t(‖x − x⋆‖2 − ‖x+ − x⋆‖2)
(second line uses first order convexity condition)22 / 33
Analysis of gradient descent for smooth functions
take average over iteration counter i = 1, . . . , k :
1
k
k∑
i=1
f (x (i))− p⋆ ≤1
k
k∑
i=1
1
2t(‖x (i) − x⋆‖2 − ‖x (i+1) − x⋆‖2)
≤1
2tk(‖x (0) − x⋆‖2 − ‖x (k+1) − x⋆‖2)
≤1
2tk‖x (0) − x⋆‖2
since f (x (k)) is non-increasing,
f (x (k))− p⋆ ≤1
2tk‖x (0) − x⋆‖2
so number of iterations k to reach f (x (k))− p⋆ ≤ ǫ is O(1/ǫ)
23 / 33
Analysis of gradient descent for smooth functions
now, with line search!
◮ t chosen by line search w/params (a, b) = ( 12 ,12 ) (to simplify
proofs), so x+ = x − t∇f (x) satisfies
f (x+) < f (x)−t
2‖∇f (x)‖2.
◮ from smoothness of f , we know t = 1βwould work
◮ so linesearch returns t ≥ bβ= 1
2β
Algorithm: GD with line search.
pick line search parameters (a, b) = ( 12 ,12 ) and x (0) ∈ Rn, and repeat
1. compute ∇f (x (k))
2. find t(k) by line search
3. updatex (k+1) = x (k) − t∇f (x (k))
24 / 33
Analysis of gradient descent for smooth functions
using line search condition,
f (x+) ≤ f (x)−t
2‖∇f (x)‖2
≤ f (x⋆)−∇f (x)T (x − x⋆)−t
2‖∇f (x)‖2
= p⋆ +1
2t(‖x − x⋆‖2 − ‖x − x⋆ − t∇f (x)‖2)
= p⋆ +1
2t(‖x − x⋆‖2 − ‖x+ − x⋆‖2)
≤ p⋆ + β(‖x − x⋆‖2 − ‖x+ − x⋆‖2)
second line uses first order convexity condition,last line uses 1
t≤ β
b= 2β
25 / 33
Analysis of gradient descent for smooth functions
take average over iteration counter i = 1, . . . , k :
1
k
k∑
i=1
f (x (i))− p⋆ ≤1
k
k∑
i=1
β(‖x (i) − x⋆‖2 − ‖x (i+1) − x⋆‖2)
≤β
k(‖x (0) − x⋆‖2 − ‖x (k+1) − x⋆‖2)
≤β
k‖x (0) − x⋆‖2
since f (x (k)) is non-increasing,
f (x (k))− p⋆ ≤β
k‖x (0) − x⋆‖2
so number of iterations k to reach f (x (k))− p⋆ ≤ ǫ is O(1/ǫ)
26 / 33
Strong convexity: equivalent definitions
for convex f : Rn → R, the following properties are all equivalent:
1. f (x)− α2 x
T x is convex
2. f is α-strongly convex: for all x , y ∈ dom f ,
f (y) ≥ f (x) +∇f (x)T (y − x) +α
2‖x − y‖2
3. (if f is twice differentiable) ∇2f (x) � αI
4. ∇f is coercive with parameter α: for all x , y ∈ dom f ,
(∇f (x)−∇f (y))T (x − y) ≥ α‖x − y‖2
proof: 2) is first order condition for convexity of 1); 3) is second ordercondition for convexity of 1); 4) is monotone gradient condition for 1)
27 / 33
Strong convexity + smoothness
if f is α-strongly convex and β-smooth,then h(x) = f (x)− α/2‖x‖2 is convex and (β − α)-smooth:
(∇h(x)−∇h(y))T (x − y) ≥1
β − α‖∇h(x)−∇h(y)‖2
expand h to show
(∇f (x)−∇f (y))T (x − y) ≥αβ
α+ β‖x − y‖2 +
1
α+ β‖∇f (x)−∇f (y)‖2
28 / 33
Analysis of gradient descent for SSC functions
assumptions for analysis:
◮ f : Rn → R convex and differentiable with dom f = Rn
◮ f is smooth with parameter β > 0
◮ f is strongly convex with parameter α > 0
◮ optimal value p⋆ = infx f (x) finite and attained at x⋆
Algorithm: GD with constant step size.
pick constant step size 0 < t ≤ 2α+β
, and repeat
x (k+1) = x (k) − t∇f (x (k))
note α < β =⇒ 1β< 2
α+β,
so SSC allows larger step sizes than just smoothness
29 / 33
Analysis of gradient descent for SSC functions
‖x+ − x⋆‖2 = ‖x − t∇f (x)− x⋆‖2
= ‖x − x⋆‖2 + t2‖∇f (x)‖2 − 2t∇f (x)T (x − x⋆)
≤ ‖x − x⋆‖2 + t2‖∇f (x)‖2
−2t
(
αβ
α+ β‖x − x⋆‖2 +
1
α+ β‖∇f (x)‖2
)
=
(
1− t
(
2αβ
α+ β
))
‖x − x⋆‖2 + t
(
t −2
α+ β
)
‖∇f (x)‖2
≤
(
1− t
(
2αβ
α+ β
))
‖x − x⋆‖2
(first inequality uses coercivity + co-coercivity,last uses t ≤ 2
α+β)
30 / 33
Analysis of gradient descent for SSC functions
◮ distance to optimum decreases by c = 1− t( 2αβα+β
) every iteration
‖x (k) − x⋆‖2 ≤ ck‖x (0) − x⋆‖2
i.e., “linear convergence”
◮ if t = 2α+β
, c = (κ−1κ+1 )
2, where κ = β
α≥ 1 is condition number
◮ using quadratic upper bound, get bound on function value
f (x (k))− p⋆ ≤β
2‖x (k) − x⋆‖2 ≤
βck
2‖x (0) − x⋆‖2
◮ so number of iterations k to reach f (x (k))− p⋆ ≤ ǫ is O(log(1/ǫ))
31 / 33
Conclusion
we showed that the gradient method with appropriate step sizesconverges, and guarantees
◮ for f convex and β-smooth,
f (x (k))− p⋆ ≤β
2k‖x (0) − x⋆‖2
◮ for f convex, β-smooth, and α-strongly convex,
f (x (k))− p⋆ ≤βck
2‖x (0) − x⋆‖2,
where c = (κ−1κ+1 )
2, κ = β
α≥ 1 is condition number
32 / 33
References
◮ Lieven Vandenberghe, UCLA EE236C: Gradient methods
◮ Sebastian Bubeck, Convex Optimization: Algorithms andComplexity
33 / 33
top related