iterative methods and regularization in the design...

Post on 05-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ITERATIVE METHODS AND REGULARIZATION

IN THE DESIGN OF FAST ALGORITHMS

Lorenzo Orecchia, MIT Math

An unified framework for optimization and online learning

beyond Multiplicative Weight Updates

Talk Outline: A Tale of Two Halves

PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING

• Online Linear Optimization

• Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs)

• A Regularization Framework to generalize MWUs: Follow the Regularized Leader

MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE

Talk Outline: A Tale of Two Halves

PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING

• Online Linear Optimization

• Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs)

• A Regularization Framework to generalize MWUs: Follow the Regularized Leader

MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE

Optimization:

Regularized Updates

Online Learning:

Multiplicative Weight

Updates (MWUs)

Talk Outline: A Tale of Two Halves

PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING

• Online Linear Optimization

• Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs)

• A Regularization Framework to generalize MWUs: Follow the Regularized Leader

MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE

PART 2: NON-SMOOTH OPTIMIZATION AND FAST ALGORITHMS FOR MAXFLOW

• Non-smooth vs Smooth Convex Optimization

•Non-smooth Convex Optimization reduces to Online Linear Optimization

• Application: Understanding Undirected Maxflow algorithms based on MWUs

MESSAGE: FASTEST ALGORITHMS REQUIRE PRIMAL-DUAL APPROACH

Fast Algorithms for solving specific LPs and SDPs: Maximum Flow problems [PST], [GK], [F], [CKMST] Covering-packing problems [PST] Oblivious routing [R], [M]

Fast Approximation Algorithms based on LP and SDP relaxations: Maxcut [AK] Graph Partitioning Problems [AK], [S], [OSV]

Proof Technique Hardcore Lemma [BHK] QIP = PSPACE [W] Derandomization [Y]

… and more

TOC Applications of MWUs

Machine Learning meets Optimization meets TCS

These techniques have been rediscovered multiple times in different fields:

Machine Learning, Convex Optimization, TCS

Three surveys emphasizing the different viewpoints and literatures:

1) ML: Prediction, Learning and Games by Gabor and Lugosi

2) Optimization: Lectures in Modern Convex Optimization

by Ben Tal and Nemirowski

3) TCS: The Multiplicative Weights Update Method: a Meta

Algorithm and Applications by Arora, Hazan and Kale

REGULARIZATION 101

What is Regularization?

Regularization is a fundamental technique in optimization

OPTIMIZATION

PROBLEM

WELL-BEHAVED

OPTIMIZATION

PROBLEM

• Stable optimum

• Unique optimal solution

• Smoothness conditions

What is Regularization?

Regularization is a fundamental technique in optimization

OPTIMIZATION

PROBLEM

WELL-BEHAVED

OPTIMIZATION

PROBLEM

Benefits of Regularization in Learning and Statistics:

• Prevents overfitting

• Increases stability

•Decreases sensitivity to random noise

Regularizer F Parameter ¸ > 0

Example: Regularization Helps Stability

f(c) = argminx2S cTx

Consider a convex set and a linear optimization problem:

The optimal solution f(c) may be very unstable under perturbation of c :

S ½Rn

kc0 ¡ ck · ± and

S

cc0

f(c0) f(c)

kf(c0)¡ f(c)k >> ±

Example: Regularization Helps Stability

f(c) = argminx2S cTx

Consider a convex set and a regularized linear optimization problem

where F is ¾-strongly convex.

Then:

S ½Rn

kc0 ¡ ck · ± implies kf(c0)¡ f(c)kk · ±¾

f(c0)f(c)

+F(x)

cTx+F(x)

c0Tx+F(x)

Example: Regularization Helps Stability

f(c) = argminx2S cTx

Consider a convex set and a regularized linear optimization problem

where F is ¾-strongly convex.

Then:

S ½Rn

kc0 ¡ ck · ± implies kf(c0)¡ f(c)kk · ±¾

f(c0)f(c)

+F(x)

cTx+F(x)

c0Tx+F(x)

kslopek · ±

ONLINE LINEAR OPTIMIZATION

AND

MULTIPLICATIVE WEIGHT UPDATES

SETUP: Convex set X µ Rn, generic norm, repeated game over T rounds.

At round t,

Online Linear Minimization

ALGORITHM ADVERSARY

x(t) 2X

Current solution

SETUP: Convex set X µ Rn, generic norm, repeated game over T rounds.

At round t,

Online Linear Minimization

ALGORITHM ADVERSARY

x(t) 2X

Current solution

`(t) 2 Rn;kr`(t)k¤ · ½Current linear objective

Loss vector

SETUP: Convex set X µ Rn, generic norm, repeated game over T rounds.

At round t,

Online Linear Minimization

ALGORITHM ADVERSARY

x(t) 2X

Current solution

`(t) 2 Rn;kr`(t)k¤ · ½Current linear objective

Loss vector

`(t)Tx(t)

Algorithm’s loss

SETUP: Convex set X µ Rn, generic norm, repeated game over T rounds.

At round t,

Online Linear Minimization

ALGORITHM ADVERSARY

x(t) 2X x(t) 2X`(t) 2 Rn;kr`(t)k¤ · ½

x(t+1) 2X

Updated solution

SETUP: Convex set X µ Rn, generic norm, repeated game over T rounds.

At round t,

Online Linear Minimization

ALGORITHM ADVERSARY

x(t) 2X x(t) 2X`(t) 2 Rn;kr`(t)k¤ · ½

x(t+1) 2X `(t+1) 2 Rn;kr`(t)k¤ · ½

Updated solution New Loss Vector

SETUP: Convex set X µ Rn, generic norm, repeated game over T rounds.

At round t,

Online Linear Minimization

ALGORITHM ADVERSARY

x(t) 2X x(t) 2X`(t) 2 Rn;kr`(t)k¤ · ½

x(t+1) 2X `(t+1) 2 Rn;kr`(t)k¤ · ½

GOAL: update x(t) to minimize regret

Average Algorithm’s Loss A Posteriori Optimum

1

T¢TX

t=1

`(t)TxT ¡min

x2X

1

T¢TX

t=1

`(t)

i

T

x

p(t)ALGORITHM ADVERSARY

distribution over experts

Simplex Case: Learning with Experts SETUP: Simplex X µ Rn under ℓ1 norm. At round t,

p(t)ALGORITHM ADVERSARY

distribution over dimensions

i.e. experts

Simplex Case: Learning with Experts SETUP: Simplex X µ Rn under ℓ1 norm. At round t,

k`(t)k1 · ½

Experts’ losses

p(t)ALGORITHM ADVERSARY

distribution over experts

Simplex Case: Learning with Experts SETUP: Simplex X µ Rn under ℓ1 norm. At round t,

k`(t)k1 · ½

Experts’ losses

EiÃp(t)h`(t)

i

i= p(t)

T`(t)

Algorithm’s loss

p(t)ALGORITHM ADVERSARY

distribution over experts

Simplex Case: Learning with Experts SETUP: Simplex X µ Rn under ℓ1 norm. At round t,

k`(t)k1 · ½

Experts’ losses

p(t+1)

Update distribution

Simplex Case: Multiplicative Weight Updates

p(t)

ALGORITHM ADVERSARY

`(t)

w(t+1)

i = (1¡ ²)`(t)

i w(t)

i ; w1 = ~1Weights:

Simplex Case: Multiplicative Weight Updates

p(t)

ALGORITHM ADVERSARY

`(t)

w(t+1)

i = (1¡ ²)`(t)

i w(t)

i ; w1 = ~1Weights:

p(t+1)

i =w(t)

iPn

j=1w(t)

j

Distribution:

p(t)

ALGORITHM ADVERSARY

`(t)

w(t+1)

i = (1¡ ²)`(t)

i w(t)

i ; w1 = ~1Weights:

p(t+1)

i =w(t)

iPn

j=1w(t)

j

Distribution:

MULTIPLICATIVE WEIGHT UPDATE

Simplex Case: Multiplicative Weight Updates

p(t)

ALGORITHM ADVERSARY

`(t)

w(t+1)

i = (1¡ ²)`(t)

i w(t)

i ; w1 = ~1Weights:

p(t+1)

i =w(t)

iPn

j=1w(t)

j

Distribution:

Simplex Case: Multiplicative Weight Updates

² 2 (0; 1)0 1

CONSERVATIVE AGGRESSIVE

MWUs: Unraveling the Update

p(t)

ALGORITHM ADVERSARY

`(t)

WEIGHT

CUMULATIVE LOSS

(1¡ ²)

Pt`(t)

i

p(t+1)

i / w(t+1)

i = (1¡ ²)`(t)

i ¢w(t)iUpdate:

w(t+1)

i

Pt `(t)

i

For and

MWUs: Regret Bound

p(t)

ALGORITHM ADVERSARY

`(t)

L̂¡L? · ½ logn

²T+ ½²

k`(t)k1 · ½² < 12

p(t+1)

i / w(t+1)

i = (1¡ ²)`(t)

i ¢w(t)iUpdate:

For and

MWUs: Regret Bound

p(t)

ALGORITHM ADVERSARY

`(t)

L̂¡L? · ½ logn

²T+ ½²

² < 12

p(t+1)

i / w(t+1)

i = (1¡ ²)`(t)

i ¢w(t)iUpdate:

Algorithm’s

Regret

Start-up Penalty Penalty for

being greedy

k`(t)k1 · ½

ONLINE LINEAR OPTIMIZATION BEYOND MWUs

A REGULARIZATION FRAMEWORK

MWUs: Proof Sketch of Regret Bound

©(t+1) = log1¡²Pn

i=1w(t+1)

i

p(t+1)

i / w(t+1)

i = (1¡ ²)

Pt

s=1`(s)

iUpdate:

• Proof is potential function argument

©(t+1) = log1¡²Pn

i=1w(t+1)

i

p(t+1)

i / w(t+1)

i = (1¡ ²)

Pt

s=1`(s)

iUpdate:

• Proof is potential function argument

• Potential function bounds loss of best expert

©(t+1) · log1¡²minni=1w

(t+1)

i =minni=1

³Pt

s=1 `(s)

i

´

MWUs: Proof Sketch of Regret Bound

©(t+1) = log1¡²Pn

i=1w(t+1)

i

p(t+1)

i / w(t+1)

i = (1¡ ²)

Pt

s=1`(s)

iUpdate:

• Proof is potential function argument

• Potential function bounds loss of best expert

• Potential function is related to algorithm’s performance

©(t+1) · log1¡²minni=1w

(t+1)

i =minni=1

³Pt

s=1 `(s)

i

´

©(t+1) ¡©(t) ¸³`(t)

Tp(t)´¡ ²

MWUs: Proof Sketch of Regret Bound

©(t+1) = log1¡²Pn

i=1w(t+1)

i

p(t+1)

i / w(t+1)

i = (1¡ ²)

Pt

s=1`(s)

iUpdate:

• Proof is potential function argument

• Potential function bounds loss of best expert

• Potential function is related to algorithm’s performance

©(t+1) · log1¡²minni=1w

(t+1)

i =minni=1

³Pt

s=1 `(s)

i

´

©(t+1) ¡©(t) ¸³`(t)

Tp(t)´¡ ²

DOES THIS PROOF TECHNIQUE GENERALIZE TO BEYOND SIMPLEX CASE?

MWUs: Proof Sketch of Regret Bound

MWUs AND APPLICATIONS

Designing a Regularized Update GOAL: Design an update and its potential function analysis

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 1 – FOLLOW THE LEADER: Cumulative loss

L(t) =Pt

s=1 `(s)

x(t+1) = argminx2X

xTL(t) ©(t+1) = minx2X

xTL(t)

Pick best current solution Potential is current best loss

Designing a Regularized Update

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 1 – FOLLOW THE LEADER: Cumulative loss

L(t) =Pt

s=1 `(s)

x(t+1) = argminx2X

xTL(t) ©(t+1) = minx2X

xTL(t)

Pick best current solution Potential is current best loss

Designing a Regularized Update

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 1 – FOLLOW THE LEADER: Cumulative loss

L(t) =Pt

s=1 `(s)

x(t+1) = argminx2X

xTL(t) ©(t+1) = minx2X

xTL(t)

Pick best current solution Potential is current best loss

Designing a Regularized Update

Fails if best expert changes moves drastically

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 1 – FOLLOW THE LEADER: Cumulative loss

L(t) =Pt

s=1 `(s)

x(t+1) = argminx2X

xTL(t)

©(t+1) = minx2X

xTL(t)

Designing a Regularized Update

How to make update

more stable?

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 2 – FOLLOW THE REGULARIZED LEADER:

x(t+1) = argminx2X

xTL(t) + ´ ¢F(x)

©(t+1) = minx2X

xTL(t) + ´ ¢F(x)

Properties of Regularizer F(x):

1. Convex, differentiable

2. ¾-strong convex w.r.t. norm

Parameter ´ ¸ 0, TBD

Regularized Update: Definition

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 2 – FOLLOW THE REGULARIZED LEADER:

x(t+1) = argminx2X

xTL(t) + ´ ¢F(x)

©(t+1) = minx2X

xTL(t) + ´ ¢F(x)

Properties of Regularizer F(x):

1. Convex, differentiable

2. ¾-strong convex w.r.t. norm

Parameter ´ ¸ 0, TBD

Regularized Update: Definition

These properties are actually sufficient to get a regret bound

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 2 – FOLLOW THE REGULARIZED LEADER:

x(t+1) = argminx2X

xTL(t) + ´ ¢F(x)

©(t+1) = minx2X

xTL(t) + ´ ¢F(x)

Properties of Regularizer F(x):

1. Convex, differentiable

2. ¾-strong convex w.r.t. norm

Parameter ´ ¸ 0, TBD

Regularized Update: Analysis

©(t+1) · minx2X

L(t)Tx+ ´ ¢max

x2XF(x)

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 2 – FOLLOW THE REGULARIZED LEADER:

x(t+1) = argminx2X

xTL(t) + ´ ¢F(x)

©(t+1) = minx2X

xTL(t) + ´ ¢F(x)

Properties of Regularizer F(x):

1. Convex, differentiable

2. ¾-strong convex w.r.t. norm

Parameter ´ ¸ 0, TBD

Regularized Update: Analysis

©(t+1) · minx2X

L(t)Tx+ ´ ¢max

x2XF(x) Regularization

error

MWUs AND APPLICATIONS

QUESTION: Choice of potential function?

DESIDERATA: 1) lower bounds best expert’s loss

2) tracks algorithm’s performance

Attempt 2 – FOLLOW THE REGULARIZED LEADER:

x(t+1) = argminx2X

xTL(t) + ´ ¢F(x)

©(t+1) = minx2X

xTL(t) + ´ ¢F(x)

Properties of Regularizer F(x):

1. Convex, differentiable

2. ¾-strong convex w.r.t. norm

Parameter ´ ¸ 0, TBD

Regularized Update: Analysis

?

f(t+1)(x)

Tracking the Algorithm: Proof by Picture

f(t+1)(x) = xTL(t) + ´ ¢ F(x)

f(t)(x)

x

f(t+1)(x)

x(t) x(t+1)

Define:

©(t+1)

©(t)

Define:

©(t+1)

©(t)

Notice:

f(t+1)(x)¡ f(t)(x) = `(t)Tx Latest loss vector

Tracking the Algorithm: Proof by Picture

f(t+1)(x) = xTL(t) + ´ ¢ F(x)

f(t)(x)

x

f(t+1)(x)

x(t) x(t+1)

Define:

©(t+1)

©(t)

Notice:

f(t+1)(x)¡ f(t)(x) = `(t)Tx Latest loss vector

`(t)Tx(t)

Tracking the Algorithm: Proof by Picture

f(t+1)(x) = L(t)Tx+ ´ ¢ F(x)

f(t)(x)

x

f(t+1)(x)

x(t) x(t+1)

Compare:

©(t+1)

©(t)

and ©(t+1) ¡©(t)

Tracking the Algorithm: Proof by Picture

`(t)Tx(t)

f(t)(x)

x

f(t+1)(x)

x(t) x(t+1)

`(t)Tx(t)

f(t)(x)

f(t+1)(x)

p

Want:

©(t+1)

©(t)

Tracking the Algorithm: Proof by Picture

f(t+1)(x(t)) ¼ f(t+1)(x(t+1))

`(t)Tx(t)

xx(t) x(t+1)

f(t)(x)

f(t+1)(x)

©(t+1) ¡©(t) = f(t+1)(x(t+1))¡ f(t+1)(x(t)) + `(t)Tx(t)

Regularization in Action

©(t+1)

©(t)

f (t) is (´ ¢ ¾ )-strongly-convex REGULARIZATION

f(t+1)(x) = L(t)Tx+ ´ ¢ F(x)

`(t)Tx(t)

xx(t) x(t+1)

f(t)(x)

f(t+1)(x)

`(t)

Regularization in Action

©(t+1)

©(t)

f (t) is (´ ¢ ¾ )-strongly-convex REGULARIZATION

kf(t+1) ¡ f(t)k¤ = k`(t)k¤ jjx(t+1) ¡ x(t)jj ·jj`(t)jj¤´¢¾

STABILITY

`(t)Tx(t)

xx(t) x(t+1)

f(t)(x)

f(t+1)(x)

f(t+1)(x) = L(t)Tx+ ´ ¢ F(x)

`(t)

Regularization in Action

©(t+1)

©(t)

f (t) is (´ ¢ ¾ )-strongly-convex REGULARIZATION

kf(t+1) ¡ f(t)k = k`(t)k jjx(t+1) ¡ x(t)jj¤ ·jj`(t)jj´¢¾

STABILITY

`(t)Tx(t)

xx(t) x(t+1)

f(t)(x)

f(t+1)(x)

f(t+1)(x) = L(t)Tx+ ´ ¢ F(x)

Quadratic

lower bound

to f(t+1)

MWUs AND APPLICATIONS

Analysis: Progress in One Iteration

rf(t+1)(x(t)) = `(t) jjx(t) ¡ x(t)jj ·jj`(t)jj¤´¢¾

f (t+1) is (´ ¢ ¾)-strongly-convex

©(t+1) ¡©(t) = f(t+1)(x(t+1))¡ f(t+1)(x(t)) + `(t)Tx(t)

f(t+1)(x(t+1))¡ f(t+1)(x(t)) ¸ `(t)T(x(t+1) ¡ x(t)) +

jj`(t)jj2¤2´ ¢ ¾

MWUs AND APPLICATIONS

Analysis: Progress in One Iteration

rf(t+1)(x(t)) = `(t)

f(t+1)(x(t+1))¡ f(t+1)(x(t)) ¸ `(t)T(x(t+1) ¡ x(t)) +

jj`(t)jj2¤2´ ¢ ¾

f (t+1) is (´ ¢ ¾)-strongly-convex

©(t+1) ¡©(t) = f(t+1)(x(t+1))¡ f(t+1)(x(t)) + `(t)Tx(t)

¸ ¡k`(t)k¤kx(t+1) ¡ x(t)k+ jj`(t)jj¤2´ ¢ ¾ ¸ ¡k`

(t)k2¤2´ ¢ ¾

jjx(t) ¡ x(t)jj ·jj`(t)jj¤´¢¾

MWUs AND APPLICATIONS

Completing the Analysis

©(t+1) ¡©(t) ¸ `(t)Tx(t) ¡ k`(t)k¤

2¾´Regret at iteration t

Progress in one iteration:

MWUs AND APPLICATIONS

Completing the Analysis

©(t+1) ¡©(t) ¸ `(t)Tx(t) ¡ k`(t)k¤

2¾´

Progress in one iteration:

Telescopic sum:

©(T+1) ¸TX

t=1

`(t)Tp(t) +©(1) ¡ T ¢ jj`

(t)jj2´ ¢ ¾

MWUs AND APPLICATIONS

Completing the Analysis

©(t+1) ¡©(t) ¸ `(t)Tx(t) ¡ k`(t)k¤

2¾´

Progress in one iteration:

Telescopic sum:

©(T+1) ¸TX

t=1

`(t)Tp(t) +©(1) ¡ T ¢ jj`

(t)jj2´ ¢ ¾

Final regret bound:

1

T

ÃTX

t=1

`(t)Tx(t) ¡min

x2X

TX

t=1

`(t)Tx

´

T¢ (maxx2X

F (x)¡minx2X

F (x)) +½2

2¾´

MWUs AND APPLICATIONS

Completing the Analysis

Regret bound: with regularizer F and

jj`(t)jj¤ · ½

Start-up Penalty Penalty for

being greedy

SAME TYPE OF BOUND AS FOR MWUs

1

T

ÃTX

t=1

`(t)Tx(t) ¡min

x2X

TX

t=1

`(t)Tx

´

T¢ (maxx2X

F (x)¡minx2X

F (x)) +½2

2¾´

MWUs AND APPLICATIONS

Reinterpreting MWUs

©(t+1) = minp¸0;Ppi=1

pTL(t) + ´ ¢nX

i=1

pi logpiPotential function:

Regularizer: is negative entropy F (p) =

nX

i=1

pi log pi

MWUs AND APPLICATIONS

Reinterpreting MWUs

©(t+1) = minp¸0;Ppi=1

pTL(t) + ´ ¢nX

i=1

pi logpiPotential function:

Regularizer: is negative entropy

F (p ) is 1-strongly-convex w.r.t.

Update:

F (p) =

nX

i=1

pi log pi

k ¢ k1

p(t+1) = arg minp¸0;Ppi=1

pTL(t) + ´ ¢nX

i=1

pi logpi

p(t+1)

i =e¡

1´L(t)

i

Pn

i=1 e¡ 1´L(t)

i

=(1¡ ²)L

(t)

i

Pn

i=1(1¡ ²)L(t)

i

:

SOFT-MAX

MWUs AND APPLICATIONS

Reinterpreting MWUs

©(t+1) = minp¸0;Ppi=1

pTL(t) + ´ ¢nX

i=1

pi logpiPotential function:

Regularizer: is negative entropy

F (p ) is 1-strongly-convex w.r.t.

Update:

F (p) =

nX

i=1

pi log pi

k ¢ k1

p(t+1) = arg minp¸0;Ppi=1

pTL(t) + ´ ¢nX

i=1

pi logpi

p(t+1)

i =e¡

1´L(t)

i

Pn

i=1 e¡ 1´L(t)

i

=(1¡ ²)L

(t)

i

Pn

i=1(1¡ ²)L(t)

i

:

MWUs AND APPLICATIONS

Beyond MWUs: which regularizer?

Regret bound: optimizing over ́

Best choice of regularizer and norm minimizes

maxt jj`(t)jj2¤ ¢ (maxx2X F (x)¡minx2X F (x))

¾

1

T

ÃTX

t=1

`(t)Tx(t) ¡min

x2X

TX

t=1

`(t)Tx

!·½p(2 ¢ (maxx2X F (x)¡minx2X F (x))p

¾T

MWUs AND APPLICATIONS

Beyond MWUs: which regularizer?

Regret bound: optimizing over ́

Best choice of regularizer and norm minimizes

maxt jj`(t)jj2¤ ¢ (maxx2X F (x)¡minx2X F (x))

¾

1

T

ÃTX

t=1

`(t)Tx(t) ¡min

x2X

TX

t=1

`(t)Tx

!·½p(2 ¢ (maxx2X F (x)¡minx2X F (x))p

¾T

Negative entropy with -norm is approximately optimal for simplex

QUESTION: are other regularizers ever useful?

`1

QUESTION 1:

Are other regularizers, besides entropy, ever useful?

YES! Applications:

Graph Partitioning and Random Walks

Spectral algorithms for balanced separator running in time

Uses random-walk framework and SDP MWUs

Different walks correspond to different regularizers for eigenvector problem

[Mahoney, Orecchia, Vishnoi 2011], [Orecchia, Sachdeva, Vishnoi 2012]

Different Regularizers in Algorithm Design

F(X) = Tr(X1=2)

F(X) = Tr(Xp)

F(X) = Tr(X logX)SDP MWU

p-norm, 1 · p · 1

NEW REGULARIZER

Heat Kernel Random Walk

Lazy Random Walk

Personalized PageRank

~O(m)

QUESTION 1:

Are other regularizers, besides entropy, ever useful?

YES! Applications:

Graph Partitioning and Random Walks

Sparsification

²-spectral-sparsifiers with edges

Uses Matrix concentration bound equivalent to SDP MWUs

[Spielman, Srivastava 2008]

²-spectral-sparsifiers with edges

Can be interpreted as different regularizer:

[Batson, Spielman, Srivastava 2009]

Different Regularizers in Algorithm Design

O(n logn²2

)

O( n²2)

F(X) = Tr(X1=2)

QUESTION 1:

Are other regularizers, besides entropy, ever useful?

YES! Applications:

Graph Partitioning and Random Walks

Sparsification

Many more in Online Learning

Bandit Online Learning [AHR], …

Different Regularizers in Algorithm Design

NON-SMOOTH CONVEX OPTIMIZATION

REDUCES TO

ONLINE LINEAR OPTIMIZATION

Convex Optimization Setup

8x 2 X;

krf(x)k¤ · ½

8x; y 2 X;

krf(y)¡rf(x)k¤ · Lky ¡ xk

f convex, differentiable

X µ Rn closed, convex set

minx2X

f(x)

NON-SMOOTH SMOOTH

½-Lipschitz continuous ½-Lipschitz continuous gradient

Convex Optimization Setup

8x 2 X;

krf(x)k¤ · ½

8x; y 2 X;

krf(y)¡rf(x)k¤ · Lky ¡ xk

f convex, differentiable

X µ Rn closed, convex set

minx2X

f(x)

NON-SMOOTH SMOOTH

½-Lipschitz continuous ½-Lipschitz continuous gradient

Gradient step is guaranteed to decrease

function value

f(x(t+1)) · f(x(t))¡ krf(x(t))k2¤2L

Convex Optimization Setup

8x 2 X;

krf(x)k¤ · ½

8x; y 2 X;

krf(y)¡rf(x)k¤ · Lky ¡ xk

f convex, differentiable

X µ Rn closed, convex set

minx2X

f(x)

NON-SMOOTH SMOOTH

½-Lipschitz continuous ½-Lipschitz continuous gradient

Gradient step is guaranteed to decrease

function value

f(x(t+1)) · f(x(t))¡ krf(x(t))k2¤2L

x(t)x(t+1)

NO GRADIENT STEP GUARANTEE

Convex Optimization Setup

8x 2 X;

krf(x)k¤ · ½

8x; y 2 X;

krf(y)¡rf(x)k¤ · Lky ¡ xk

f convex, differentiable

X µ Rn closed, convex set

minx2X

f(x)

NON-SMOOTH SMOOTH

½-Lipschitz continuous ½-Lipschitz continuous gradient

Gradient step is guaranteed to decrease

function value

f(x(t+1)) · f(x(t))¡ krf(x(t))k2¤2L

x(t)x(t+1)

NO GRADIENT STEP GUARANTEE

ONLY DUAL GUARANTEE

Non-Smooth Setup: Dual Approach

8x 2X; krf(x)k¤ · ½

f convex, differentiable

X µ Rn closed, convex set

minx2X

f(x)

½-Lipschitz continuous

x(t)x(t+1) x(t+2)

APPROACH: Each iterate solution provides a lower bound and an upper bound

f(x(t)) ¸ f(x¤)

f(x¤) ¸ f(x(t)) +rf(x(t)T (x¤ ¡ x(t))

Non-Smooth Setup: Dual Approach

8x 2X; krf(x)k¤ · ½

f convex, differentiable

X µ Rn closed, convex set

minx2X

f(x)

½-Lipschitz continuous

x(t)x(t+1) x(t+2)

APPROACH: Each iterate solution provides a lower bound and an upper bound

f(x(t)) ¸ f(x¤)

f(x¤) ¸ f(x(t)) +rf(x(t)T (x¤ ¡ x(t))

CAN WEAKEN DIFFERENTIABILITY ASSUMPTION: SUBGRADIENTS SUFFICE

Non-Smooth Setup: Dual Approach

x(t)x(t+1) x(t+2)

APPROACH: Each iterate solution provides a lower bound and an upper bound

f(x(t)) ¸ f(x¤)

f(x¤) ¸ f(x(t)) +rf(x(t)T (x¤ ¡ x(t))

Take convex combination of both upper bounds and lower bounds with weights °t

UPPER BOUND:

LOWER BOUND:

1PT

t=1°t

³PT

t=1 °tf(x(t))´¸ f(x¤)

UPPER

Non-Smooth Setup: Dual Approach

x(t)x(t+1) x(t+2)

APPROACH: Each iterate solution provides a lower bound and an upper bound

f(x(t)) ¸ f(x¤)

f(x¤) ¸ f(x(t)) +rf(x(t))T (x¤ ¡ x(t))

Take convex combination of both upper bounds and lower bounds with weights °t

UPPER:

LOWER :

1PT

t=1°t

³PT

t=1 °tf(x(t))´¸ f(x¤)

UPPER

f(x¤) ¸ 1PT

t=1°t

hPT

t=1 °t(f(x(t)) +rf(x(t))T (x¤ ¡ x(t)))

i

LOWER

Non-Smooth Setup: Dual Approach

x(t)x(t+1) x(t+2)

APPROACH: Each iterate solution provides a lower bound and an upper bound

f(x(t)) ¸ f(x¤)

f(x¤) ¸ f(x(t)) +rf(x(t))T (x¤ ¡ x(t))

Take convex combination of both upper bounds and lower bounds with weights °t

UPPER:

LOWER :

1PT

t=1°t

³PT

t=1 °tf(x(t))´¸ f(x¤)

UPPER

f(x¤) ¸ 1PT

t=1°t

hPT

t=1 °t(f(x(t)) +rf(x(t))T (x¤ ¡ x(t)))

i

LOWER HOW TO UPDATE ITERATES?

HOW TO CHOSE WEIGHTS?

Reduction to Online Linear Minimization

Fix weights °t to be uniform for simplicity:

UPPER:

LOWER :

DUALITY GAP:

1PT

t=1°t

³PT

t=1 °tf(x(t))´¸ f(x¤)

f(x¤) ¸ 1PT

t=1°t

hPT

t=1 °t(f(x(t)) +rf(x(t))T (x¤ ¡ x(t)))

i

·PT

t=1°tPT

t=1°tf(x(t))

¸¡ f(x¤) ·

PT

t=1¡rf(x(t))T(x¤ ¡ x(t))

LINEAR FUNCTION

Reduction to Online Linear Minimization

Fix weights °t to be uniform for simplicity:

DUALITY GAP: ·PT

t=1°tPT

t=1°tf(x(t))

¸¡ f(x¤) ·

PT

t=1¡rf(x(t))T(x¤ ¡ x(t))

ALGORITHM ADVERSARY

x(t) 2X ¡rf(x(t))

ONLINE SETUP

Reduction to Online Linear Minimization

Fix weights °t to be uniform for simplicity:

DUALITY GAP: ·PT

t=1°tPT

t=1°tf(x(t))

¸¡ f(x¤) ·

PT

t=1¡rf(x(t))T(x¤ ¡ x(t))

ALGORITHM ADVERSARY

x(t) 2X `(t) =¡rf(x(t))

ONLINE SETUP

Recall that by assumption: Loss vector is gradient

k`(t)k¤ = krf(x(t))k¤ · ½

Reduction to Online Linear Minimization

Fix weights °t to be uniform for simplicity:

DUALITY GAP: hPT

t=11Tf(x(t))

i¡ f(x¤) · 1

T¢PT

t=1¡rf(x(t))T(x¤ ¡ x(t))

ALGORITHM ADVERSARY

x(t) 2X `(t) =¡rf(x(t))

ONLINE SETUP

Recall that by assumption: Loss vector is gradient

k`(t)k¤ = krf(x(t))k¤ · ½

1

T¢TX

t=1

¡rf(x(t))T (x¤ ¡ x(t)) = REGRET

Final Bound

ALGORITHM ADVERSARY

x(t) 2X `(t) =¡rf(x(t))

ONLINE SETUP

Recall that by assumption: Loss vector is gradient

k`(t)k¤ = krf(x(t))k¤ · ½

TX

t=1

¡rf(x(t))T (x¤ ¡ x(t)) = REGRET

²MD ·½p2 ¢ (maxx2X F (x)¡minx2X F (x))

¾pT

RESULTING ALGORITHM: MIRROR DESCENT

Error bound with ¾-strongly-convex regularizer F

Final Bound

ALGORITHM ADVERSARY

x(t) 2X `(t) =¡rf(x(t))

ONLINE SETUP

Recall that by assumption: Loss vector is gradient

k`(t)k¤ = krf(x(t))k¤ · ½

TX

t=1

¡rf(x(t))T (x¤ ¡ x(t)) = REGRET

²MD ·½p2 ¢ (maxx2X F (x)¡minx2X F (x))

¾pT

RESULTING ALGORITHM: MIRROR DESCENT

Error bound with ¾-strongly-convex regularizer F

ASYMPTOTICALLY OPTIMAL BY INFORMATION COMPLEXITY LOWER BOUND

Non-Smooth Optimization over Simplex

²MD ·½p2 ¢ lognpT

RESULTING ALGORITHM:

MIRROR DESCENT OVER SIMPLEX = MWU

Regularizer F is negative entropy, with krf(x(t))k1 · ½

APPLICATIONS IN ALGORITHM DESIGN

Warm-up Example: Linear Programming

A 2 Rm£n;?9x 2 X : Ax¡ b ¸ 0

LP Feasibility problem

Easy constraints

Maintain feasible Hard constraints

Require fixing

Warm-up Example: Linear Programming

A 2 Rm£n;?9x 2 X : Ax¡ b ¸ 0

Convert into non-smooth optimization problem over simplex:

Non-differentiable objective:

LP Feasibility problem

minp2¢m

maxx2X

pT (b¡Ax)

f(p) = maxx2X

pT (b¡Ax)

Warm-up Example: Linear Programming

A 2 Rm£n;?9x 2 X : Ax¡ b ¸ 0

Convert into non-smooth optimization problem over simplex:

Non-differentiable objective:

LP Feasibility problem

minp2¢m

maxx2X

pT (b¡Ax)

f(p) = maxx2X

pT (b¡Ax)Best response to dual

solution p

Warm-up Example: Linear Programming

A 2 Rm£n;?9x 2 X : b¡Ax ¸ 0

Convert into non-smooth optimization problem over simplex:

Non-differentiable objective

Admits subgradients, for all p:

LP Feasibility problem

minp2¢m

maxx2X

pT (b¡Ax)

f(p) = maxx2X

pT (b¡Ax)

xp : pT (b¡Axp) ¸ 0;

(b¡Axp) 2 @f(p)Subgradient is slack

in constraints

Warm-up Example: Linear Programming

A 2 Rm£n;?9x 2 X : b¡Ax ¸ 0

Convert into non-smooth optimization problem over simplex:

Non-differentiable objective

Admits subgradients, for all p:

If we can pick xp such that , then

LP Feasibility problem

minp2¢m

maxx2X

pT (b¡Ax)

f(p) = maxx2X

pT (b¡Ax)

xp : pT (b¡Axp) ¸ 0;

(b¡Axp) 2 @f(p)

kb¡Axpk1 · ½

²MD ·½p2 ¢ lognpT

T ·2 ¢ ½2 ¢ logn

²2

Minaximum flow feasibility for value F over undirected graph G with incidence matrix B:

Turn into non-smooth minimization problem over simplex:

MWU and s-t Maxflow

8e 2 E;F ¢ jfejce

· 1

BT f = es ¡ et

f(p) = minBT f=es¡et

X

e2Epe ¢

F ¢ jfejce

¡ 1

Will enforce this

Best response fp is shortest s-t path with lengths pe / ce .

For any p, if fp has length > 1, there is no subgradient, i.e. problem is infeasible.

Otherwise, the following is a subgradient

Unfortunately, width can be large

@f(p)e =F ¢ j(fp)ej

ce¡ 1

k@f(p)ek1 ·F

cmin

[PST 91] T = O

³F logn

²2cmin

´

PROBLEM: Optimal for this specific formulation

SOLUTION: Regularize primal

Width Reduction: make function nicer

x(t)x(t+1) x(t+2)

k@f(p)ek1 ·F

cmin

f(p) = minBTf=es¡et

F ¢X

e2E

fe

ce

³pe +

²

m

´¡ 1

NEED PRIMAL ARGUMENT

PROBLEM: Optimal for this specific formulation

SOLUTION: Regularize primal

REGULARIZATION ERROR:

NEW WIDTH:

ITERATION BOUND:

Width Reduction: make primal nicer

k@f(p)ek1 ·F

cmin

f(p) = minBTf=es¡et

F ¢X

e2E

fe

ce

³pe +

²

m

´¡ 1

²F

k@f(p)ek1 ·m

²

[GK 98] T = O

³m logn

²2

´

Electrical Flow Approach [CKMST]

8e 2 E;F ¢ f2e

c2e· 1

BT f = es ¡ et Will enforce this

Different formulation yields basis for CKMST algorithm:

Non-smooth optimization problem:

f(p) = minBT f=es¡et

X

e2Epe ¢

F ¢ f2ec2e

¡ 1

Electrical Flow Approach [CKMST]

8e 2 E;F ¢ f2e

c2e· 1

BT f = es ¡ et Will enforce this

Different formulation yields basis for CKMST algorithm:

Non-smooth optimization problem:

Original width:

f(p) = minBT f=es¡et

X

e2Epe ¢

F ¢ f2ec2e

¡ 1

Best response is electrical flow fp

k@f(p)ek1 · m

Electrical Flow Approach [CKMST]

8e 2 E;F ¢ f2e

c2e· 1

BT f = es ¡ et Will enforce this

Different formulation yields basis for CKMST algorithm:

Non-smooth optimization problem:

Regularize primal:

f(p) = minBT f=es¡et

X

e2Epe ¢

F ¢ f2ec2e

¡ 1

f(p) = minBT f=es¡et

F ¢X

e2E

f2ec2e

³pe +

²

m

´¡ 1

k@f(p)ek1 ·

rm

²

Conclusion: Take-away messages

• Regularization is a powerful tool for the design of fast algorithms.

• Most iterative algorithms can be understood as regularized updates:

MWUs, Width Reduction, Interior Point, Gradient descent, ..

• Perform well in practice. Regularization also helps eliminate noise.

• ULTIMATE GOAL:

Development of a library of iterative methods for fast graph algorithms.

Regularization plays a fundamental role in this effort

THE END – THANK YOU

top related