three strategies to derive a dual problem

8
Three strategies to derive a dual problem Ryota Tomioka May 18, 2010 There are three strategies, namely, (i) equality constraints, (ii) conic con- straints, and (iii) Fenchel’s duality to derive a dual problem. Using a group-lasso regularized support vector machine (=MKL) problem as an example, we see how these strategies can be used to derive dual problems that look different but are actually equivalent. More specifically, we are interested in a dual of the following problem: (P) minimize wR n m i=1 H (x (i) w,y (i) )+ λ gG w g 2 , where {x (i) ,y (i) } m i=1 (x (i) R n ) are training examples, H (z i ,y i ) := (1 y i z i ) + is the hinge loss function, and G is a disjoint partition of {1, 2,...,n}; i.e., gG g = {1, 2,...,n} and g 1 , g 2 G and g 1 ̸= g 2 imply g 1 g 2 = . 1 Using equality constraints The most basic technique in deriving a dual problem can be summarized as follows: 1. Find an equality constraint. 2. If you cannot find an equality constraint, introduce auxiliary variables to create an equality constraint. 3. Form a Lagrangian. Introduce Lagrangian multipliers for every equality constraint. 4. Try to minimize the Lagrangian with respect to the primal variables. 5. If the minimization is too hard, introduce more auxiliary variables, and go back to 3. 6. If you can minimize the Lagrangian, check when it takes a finite value and when it becomes −∞. This will give you the dual constraints. 1

Upload: ryota-tomioka

Post on 18-Nov-2014

117 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Three strategies to derive a dual problem

Three strategies to derive a dual problem

Ryota Tomioka

May 18, 2010

There are three strategies, namely, (i) equality constraints, (ii) conic con-straints, and (iii) Fenchel’s duality to derive a dual problem.

Using a group-lasso regularized support vector machine (=MKL) problemas an example, we see how these strategies can be used to derive dual problemsthat look different but are actually equivalent.

More specifically, we are interested in a dual of the following problem:

(P) minimizew∈Rn

m∑i=1

ℓH(x(i)⊤w, y(i)) + λ∑g∈G

∥wg∥2,

where {x(i), y(i)}mi=1 (x(i) ∈ Rn) are training examples, ℓH(zi, yi) := (1− yizi)+

is the hinge loss function, and G is a disjoint partition of {1, 2, . . . , n}; i.e.,∪g∈Gg = {1, 2, . . . , n} and g1, g2 ∈ G and g1 ̸= g2 imply g1 ∩ g2 = ∅.

1 Using equality constraints

The most basic technique in deriving a dual problem can be summarized asfollows:

1. Find an equality constraint.

2. If you cannot find an equality constraint, introduce auxiliary variables tocreate an equality constraint.

3. Form a Lagrangian. Introduce Lagrangian multipliers for every equalityconstraint.

4. Try to minimize the Lagrangian with respect to the primal variables.

5. If the minimization is too hard, introduce more auxiliary variables, andgo back to 3.

6. If you can minimize the Lagrangian, check when it takes a finite value andwhen it becomes −∞. This will give you the dual constraints.

1

Page 2: Three strategies to derive a dual problem

Following the above recipe, we first notice that there is no equality constraintin the above primal problem (P). Thus we introduce an auxiliary variable z ∈Rm, and rewrite (P) as follows:

(P1) minimizew∈Rn,z∈Rm

m∑i=1

(1 − zi)+ + λ∑g∈G

∥wg∥2,

subject to y(i)x(i)⊤w = zi (i = 1, . . . ,m).

Note that the way we introduce equality constraints is not unique. For example,we could have (1−yizi)+ in the objective subject to x(i)⊤w = zi. Nevertheless,as long as the mapping is one to one, this choice is not important. The currentchoice is made to mimic the most common representation of SVM dual.

Now we are ready to form a Lagrangian L(w,z, α), where α = (αi)mi=1 are

Lagrangian multipliers associated with the m equality constraints in (P1). TheLagrangian can be written as follows:

L(w, z, α) =m∑

i=1

(1 − zi)+ + λ∑g∈G

∥wg∥2 +m∑

i=1

αi(zi − y(i)x(i)⊤w).

The dual function d(α) is obtained by minimizing the Lagrangian L(w, z, α)with respect to the primal variables w and z as follows:

d(α) = infw,z

(m∑

i=1

(1 − zi)+ + λ∑g∈G

∥wg∥2 +m∑

i=1

αi(zi − y(i)x(i)⊤w)

)

=m∑

i=1

infzi

(max(0, 1 − zi) + αizi) +∑g∈G

infwg∈R|g|

(λ∥wg∥2 −

m∑i=1

αiy(i)x

(i)g

⊤wg

),

=m∑

i=1

infzi

max(αizi, (αi − 1)zi + 1) +∑g∈G

infwg∈R|g|

(λ∥wg∥2 −

m∑i=1

αiy(i)x

(i)g

⊤wg

),

which takes a finite value∑m

i=1 αi if the following conditions are satisfied

αi ≥ 0, αi − 1 ≤ 0,

∥∥∥∥∥m∑

i=1

αiy(i)x

(i)g

∥∥∥∥∥2

≤ λ.

Otherwise d(α) = −∞ (a trivial lower bound).Accordingly, we obtain the following dual problem:

(D1) maximizem∑

i=1

αi,

subject to 0 ≤ αi ≤ 1 (i = 1, . . . ,m),∥∥∥∥∥m∑

i=1

αiy(i)x

(i)g

∥∥∥∥∥2

≤ λ (g ∈ G).

2

Page 3: Three strategies to derive a dual problem

2 Using conic constraints

The second strategy to derive a dual problem is based on finding a conic struc-ture in the primal problem. A cone K is a subset of some vector space suchthat if x ∈ K, for any nonnegative α, we have αx ∈ K.

The most common cone we encounter is the positive orthant cone; i.e.,

K = {x ∈ Rn : x ≥ 0}.

Another commonly used cone is the second-order cone; i.e.,

K = {(x0, x⊤)⊤ ∈ Rn+1 : x0 ≥ ∥x∥2}.

The dual cone K∗ of a cone K is defined as follows:

K∗ = {y ∈ Rn : y⊤x ≥ 0 (∀x ∈ K)}.

In other words, the dual cone is a collection of vectors that have nonnegativeinner products with all the vectors in K. Note that both the positive orthantcone and the second order cone are self-dual; i.e., K∗ = K.

Why is a cone useful? Because, when we consider the minimization of aLagrangian and see a term like f(α)⊤x and x ∈ K, we know that the minimumis zero if f(α) ∈ K∗ and −∞ otherwise (because if f(α) /∈ K∗ we can find avector x ∈ K such that f(α)⊤x < 0, and even if f(α)⊤x is very close to zero,we can find a very large α > 0 and drive f(α)⊤(αx) to −∞).

Let us consider a conic programming problem

(PC) minimizex∈Rn

c⊤x,

subject to Ax = b, x ∈ K,

where K is a cone. The dual problem of (PC) can be written as follows:

(DC) maximizeα∈Rm

b⊤α,

subject to c − A⊤α ∈ K∗,

where K∗ is the dual cone of K. The derivation of (DC) (and some generaliza-tion) is given in Appendix A.

Now we rewrite the primal problem (P) as a conic programming problem asfollows:

(P2) minimizew∈Rn,ξ∈Rm,ξ̃∈Rm,ug∈R(g∈G)

m∑i=1

ξi + λ∑g∈G

ug,

subject to y(i)x(i)⊤w + ξi − ξ̃i = 1,

ξi, ξ̃i ≥ 0 (i = 1, . . . ,m), ∥wg∥2 ≤ ug (g ∈ G).

3

Page 4: Three strategies to derive a dual problem

By defining

x = (ξ⊤, ξ̃⊤, u⊤, w⊤)⊤,

c = (1m⊤,0m

⊤, λ1|G|⊤,0m

⊤)⊤,

A =

1 −1 y(1)x(1)⊤

. . . . . . 0...

1 −1 y(m)x(m)⊤

,

b = 1m,

we notice that (P2) is a conic programming problem. In fact, the cone K canbe written as

K ={

(ξ⊤, ξ̃⊤, u⊤,w⊤) ∈ R2m+n+|G| : ξ ≥ 0, ξ̃ ≥ 0, ug ≥ ∥wg∥2(∀g ∈ G)}

.

Note that K is self dual; i.e., K∗ = K. Accordingly the dual of (P2) can bewritten as follows:

(D2) maximizem∑

i=1

αi,

subject to 1m − α ≥ 0m,

0m + α ≥ 0m,

λ ≥

∥∥∥∥∥m∑

i=1

αiy(i)xi

∥∥∥∥∥2

(∀g ∈ G).

The dual problem (D2) is clearly equivalent to (D1).

3 Using Fenchel’s duality

Fenchel’s duality theorem [Rockafellar, 1970, Theorem 31.1] states that for twoproper closed convex functions f and g, we have

infx∈Rn

(f(Ax) + g(x)) = supα∈Rm

(−f∗(−α) − g∗(A⊤α)), (1)

where f∗ and g∗ are the convex conjugate functions1 of f and g, respectively.The derivation of Eq. (1) is given in Appendix B.

The problem (P) can be rewritten as follows:

(P3) minimizew∈Rn

f(Aw) + g(w),

1The convex conjugate function f∗ of a function f is defined as f∗(y) =supx

`

y⊤x− f(x)´

. If f is proper closed convex function, f∗ = f .

4

Page 5: Three strategies to derive a dual problem

where

f(z) =m∑

i=1

(1 − zi)+,

g(w) = λ∑

g∈Gf

∥wg∥2,

A =

y(1)x(1)⊤

...y(m)x(m).

Using Fenchel’s duality theorem, the dual problem of (P3) can be written as

follows:

(D′3) minimize

α∈Rm− f∗(−α) − g∗(A⊤α).

The remaining task, therefore, is to compute the convex conjugate functions f∗

and g∗.First we compute f∗. By definition,

f∗(−α) = supz∈Rm

(−z⊤α −

m∑i=1

max(0, 1 − zi)

)

=m∑

i=1

supzi

min (−αizi, (1 − αi)zi − 1)

=m∑

i=1

+∞ (if αi < 0),−αi (if 0 ≤ αi ≤ 1),+∞ (if αi > 1).

Next, we compute g∗. First we show a lower bound of g∗ as follows:

g∗(y) = supw∈Rn

y⊤w − λ∑g∈G

∥wg∥2

=

∑g∈G

supwg∈R|g|

(yg

⊤wg − λ∥wg∥2

)=

∑g∈G

supt

supwg:∥wg∥≤t

(yg

⊤wg − λ∥wg∥2

)≥

∑g∈G

supt

(∥yg∥2 − λ)t

=∑g∈G

{0 (if ∥yg∥2 ≤ λ),+∞ (otherwise).

5

Page 6: Three strategies to derive a dual problem

αi=1α

i=0

f∗ (−αi)

(a) The conjugate hinge loss.

||yg||<=λ

g∗ (yg)

(b) The conjugate regularizer.

Figure 1: The shapes of convex conjugate functions f∗(−α) and g∗(y) in 1D.

Next we show that the above lower bound is tight. In fact, if ∥yg∥2 ≤ λ, we haveyg

⊤wg ≤ λ∥wg∥2 (Cauchy-Schwarz inequality), which implies(yg

⊤wg − λ∥wg∥2

)≤

0.Finally, substituting the above f∗ and g∗ into (D′

3), we obtain the followingdual problem.

(D3) maximize

∑m

i=1 αi

(if 0 ≤ αi ≤ 1 (i = 1, . . . ,m),

and∥∥∥∑m

i=1 αiy(i)x

(i)g

∥∥∥2≤ λ (∀g ∈ G),

)−∞ (otherwise).

Note that dual problem (D3) with the above f∗ and g∗ are equivalent toboth (D1) and (D2).

Figure 1 shows the rough shape of conjugate functions f∗(−α) and g∗(y).

References

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. CambridgeUniversity Press, 2004.

R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

A Derivation of (DC) and generalization to arbi-trary loss functions

Similarly to the derivation in Sec. 1, the Lagrangian of (PC) is written as follows:

L(x,α) = c⊤x + α⊤(b − Ax).

6

Page 7: Three strategies to derive a dual problem

The dual function d(α) is obtained by minimizing the Lagrangian L(x, α)with respect to x as follows:

d(α) = infx∈K

(c⊤x + (b − Ax)⊤α

)= b⊤α + inf

x∈K(c − A⊤α)⊤x.

Note that the minimization with respect to x is constrained in the cone K.Thus, for the minimum to exist, it is necessary and sufficient that c−A⊤α ∈ K∗

(recall, by definition that for any x ∈ K and y ∈ K∗, x⊤y ≥ 0, and if y /∈ K∗

there exists x ∈ K such that x⊤y < 0). If c − A⊤α ∈ K∗, then the minimumis zero, because for any x ∈ K and y ∈ K∗, x⊤y ≥ 0. Accordingly, we obtainthe dual porblem (DC).

¤The above conic duality can be generalized to arbitrary convex loss function

f(x) instead of c⊤x.Let us consider the following primal probelm:

(P′C) minimize

x∈Rnf(x),

subject to Ax = b, x ∈ K,

where K is a cone.Introducing an auxiliary variable z ∈ Rn, we can rewrite (P′

C) as follows:

(P′′C) minimize

x,z∈Rnf(z),

subject to Ax = b,

z = x, x ∈ K.

The Lagrangian L(w, z, α) of (P′′C) can be written as follows:

L(w,z, α) = f(z) + α⊤(b − Ax) + β⊤(x − z),

where α ∈ Rm and β ∈ Rn are Lagrangian multipliers.The dual function d(α) can be obtained by minimizing L(w, z, α) with re-

spect to w and z as follows:

d(α) = infw,z

(f(z) + α⊤(b − Ax) + β⊤(x − z)

)= b⊤α + inf

z∈Rm

(f(z) − β⊤z

)+ inf

w∈Rn

(β − A⊤α

)⊤w

= b⊤α − supz∈Rm

(β⊤z − f(z)

)+ inf

w∈Rn

(β − A⊤α

)⊤w

= b⊤α − f∗(β) + infw∈Rn

(β − A⊤α

)⊤w,

where f∗ is the convex conjugate of f . Note that the minimization with respectto w takes a finite value zero if and only if β − A⊤α ∈ K∗ (otherwise d(α) =−∞).

7

Page 8: Three strategies to derive a dual problem

Accordingly, the dual problem is written as follows:

(D′′C) maximize

α∈Rm,β∈Rnb⊤α − f∗(β),

subject to β − A⊤α ∈ K∗.

Note that if f(x) = c⊤x, f∗(β) = 0 if β = c, and f∗(β) = +∞ otherwise.Therefore, (D′′

C) reduces to (DC).

B Derivation of Fenchel’s duality theorem

First, we introduce an equality constraint and rewrite the left-hand side ofEq. (1) as follows

(PF) minimizex∈Rn,z∈Rm

f(z) + g(x),

subject to Ax = z.

The Lagrangian L(x, z,α) of the equality constrained problem (PF) can bewritten as follows:

L(x, z, α) = f(z) + g(x) + α⊤(z − Ax).

Minimizing the Lagrangian L(x, z, α) with respect to x and z we obtain thedual function d(α) as follows:

d(α) = infx,z

(f(z) + g(x) + α⊤(z − Ax)

)= inf

z

(f(z) + α⊤z

)+ inf

x

(g(x) − α⊤Ax

)= − sup

z

((−α)⊤z − f(z)

)− sup

x

((A⊤α)⊤x − g(x)

)= −f∗(−α) − g∗(A⊤α)

If both f and g are convex, (PF) satisfies Slater’s condition [Boyd and Van-denberghe, 2004] and the strong duality holds. Therefore we have Eq. (1).

¤Fenchel’s duality theorem can be generalized to the following problem

(P′F) minimize

xf(Ax) + g(Bx),

whose dual problem can be written as

(D′F) maximize

α,β− f∗(−α) − g∗(α),

subject to A⊤α = B⊤β.

If B = In (identity matrix), the above duality is equivalent to Fenchel’ dualitytheorem.

8