advanced methods for sequence analysis · convex optimization g. rätsch, c.s. ong and p. philips:...

33
Advanced Methods for Sequence Analysis G. Rätsch 1 , C.S. Ong 1,2 and P. Philips 1 1 Friedrich Miescher Laboratory, Tübingen 2 Max Planck Institute for Biological Cybernetics, Tübingen Vorlesung WS 2006/2007 Eberhard Karls Universität Tübingen 10 January 2007

Upload: others

Post on 19-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Advanced Methodsfor Sequence Analysis

G. Rätsch 1, C.S. Ong 1,2 and P. Philips 1

1 Friedrich Miescher Laboratory, Tübingen2 Max Planck Institute for Biological Cybernetics, Tübingen

Vorlesung WS 2006/2007Eberhard Karls Universität Tübingen

10 January 2007

http://www.fml.mpg.de/raetsch/lectures/amsa

Page 2: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Convex Optimization

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2

Machine Learning as Numerical Optimization

Unconstrained Optimization

Gradient DescentNewton Method

Constrained Optimization

Some HistoryConvex Functions and Convex SetsCommon Problem FormulationsKKT conditions

http://www.stanford.edu/~boyd/cvxbook/

Page 3: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Recall the SVM

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 3

1. How are examples represented?

2. How are labels represented?

3. What are the inputs to the SVM?

4. What does SVM training output?

Page 4: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Recall the SVM

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 4

1. How are examples represented?

By the kernel matrix K.

2. How are labels represented?

By the label vector y.

3. What are the inputs to the SVM?

K, y

4. What does SVM training output?

α

Recall: Representer Theorem

Kα = y

Page 5: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Linear Equations

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 5

MotivationKα = y

Linear Algebra Golub and van Loan [1996]

Gaussian EliminationLU FactorizationQR FactorizationEigenvalue methodsLanczos methodsConjugate gradient methods

Page 6: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Variational Formulation

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 6

MotivationSolve a scalar function instead of a matrix equation.

Objective

minα

1

2α>Kα− y>α

The gradient at optimality

Kα− y = 0

gives the original problem.

Observations

Second derivavtive of objective is K.K is positive semidefinite hence the problem is con-vex.

Page 7: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Gradient Descent

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 7

GradientFor a multivariate function f : Rn → R, define the gradi-ent of f to be

∇f (x) =

(∂f (x)

∂x1,∂f (x)

∂x2, . . . ,

∂f (x)

∂xn,

).

Algorithm Schölkopf and Smola [2002]For an intial value x0 and precision ε,k = 0while (‖∇f (xk)‖ > ε)

Compute g = ∇f (xk)Perform line search on f (xk − γg) for optimal γxk+1 = xk − γgk = k + 1

Page 8: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Newton Method (1)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 8

HessianFor a multivariate function f : Rn → R,

∇2f (x)ij =∂2f (x)

∂xi∂xj, i, j = 1, . . . , n

MotivationThe second order Taylor approximation f of f at x is

f (x + v) = f (x) +∇f (x)>v +1

2v>∇2f (x)v,

which is a convex quadratic function of v, with minimizer

v∗ = −∇2f (x)−1∇f (x).

Newton stepThe vector v∗ above is called the Newton step for f at x.

Page 9: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Newton Method (2)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 9

Motivation

AlgorithmFor an intial value x0 and precision ε,k = 0while (‖∇f (xk)‖ > ε)

xk+1 = xk −∇2f (x)−1∇f (x)k = k + 1

Page 10: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Non-smooth functions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 10

MotivationNon differentiable functions f have to be treated care-fully. For example the hinge loss

`(f (xi), yi) := max{0, 1− yif (xi)}.

Piecewise differentiableFor piecewise differentiable functions, we can treat eachpiece individually, and then we can apply the abovemethods.

Page 11: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Constrained Optimization

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 11

minx f0(x)subject to fi(x) 6 0 for all i

gj(x) = 0 for all j

x ∈ Rn is the optimization variable

f0 : Rn → R is the objective or cost function

fi : Rn → R are inequality constraint functions

gj : Rn → R are equality constraint functions

Page 12: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Some History

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 12

Theory (Convex Analysis) ca. 1900–1970

Algorithms

1947: simplex algorithm for linear programming(Dantzig)1960s: early interior-point methods (Fiacco & Mc-Cormick, Dikin, . . . )1970s: ellipsoid method and other subgradient meth-ods1980s: polynomial-time interior-point methods for lin-ear programming (Karmarkar 1984)late 1980s-now: polynomial-time interior-point meth-ods for nonlinear convex optimization (Nesterov & Ne-mirovski 1994)

Page 13: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Convex Set

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 13

line segment between x1 and x2: all points

x = θx1 + (1− θ)x2

with 0 6 θ 6 1.

convex set : contains line seqment between any two pointsin the set

x1, x2 ∈ C, 0 6 θ 6 1 =⇒ x = θx1 + (1− θ)x2 ∈ C.

Examples (one convex, two non-convex sets)

Page 14: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

How to check

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 14

Use definitionTo show that a set C is convex, show that C is obtainedfrom simple convex sets (where we establish convexityby the definition).

Applying operations that preserve convexity

intersectionaffine functionsperspective functionslinear-fractional functions

Page 15: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Convex Function

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 15

f : Rn → R is convex if the domain of f is a convex set and

f (θx1 + (1− θ)x2) 6 θf(x1) + (1− θ)f (x2)

for all x1, x2 in the domain of f , 0 6 θ 6 1.

affine: ax + b on R, for any a, b ∈ R.

affine: a>x + b on Rn, for any a ∈ Rn, b ∈ R.

exponential: expax for any a ∈ R.

powers: xa on R++, for a > 1 or a 6 0.

powers of absolute value: |x|a on R, for a > 1.

negative entropy: x log x on R++.

norms: ‖x‖p = (∑n

i=1 |xi|p)1p for p > 1.

Page 16: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

How to check (1)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 16

Restrict to linef : Rn → R is convex if and only if the function g : R → R,

g(t) = f (x + tv), dom g = {t|x + tv ∈ dom f}is convex (in t) for any x ∈ dom f, v ∈ Rn.

First-order conditionThe first order approximation of f is a global underesti-mator.

∇f (x) =

(∂f (x)

∂x1,∂f (x)

∂x2, . . . ,

∂f (x)

∂xn,

)f with convex domain is convex if and only if

f (x2) > f (x1) +∇f (x1)>(x2 − x1)

for all x1, x2 ∈ dom f .

Page 17: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

How to check (2)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 17

Second-order conditionThe hessian is positive semi-definite.

∇2f (x)ij =∂2f (x)

∂xi∂xj, i, j = 1, . . . , n

f is convex if and only if

∇2f (x) � 0 for all x ∈ dom f.

Page 18: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

How to check (3)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 18

Show that f is obtained from simple convex functions byoperations that preserve convexity

nonnegative weighted sum

composition with affine function

pointwise maximum and supremum

composition

minimization

perspective

Page 19: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Convex Optimization

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 19

Constrained Optimization (generally hard)

minx f0(x)subject to fi(x) 6 0 for all i

gj(x) = 0 for all j

Convex Optimization (generally easy)

minx f0(x)subject to fi(x) 6 0 for all i

a>j x = bj for all j

f0, f1, . . . , fm are convex, and the equality constraints areaffine Boyd and Vandenberghe [2004].

Page 20: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Linear Program

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 20

minx c>x + dsubject to Gx 6 h

Ax = b

Page 21: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Quadratic Program

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 21

minx12x>Px + q>x + r

subject to Gx 6 hAx = b

Page 22: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Some other programs

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 22

Quadratically constrained quadratic program

minx12x>P0x + q>0 x + r0

subject to 12x>P0x + q>0 x + r0 6 0 for all i

Ax = b

Second order cone program

minx f>xsubject to ‖Aix + bi‖2 6 c>i x + di for all i

Fx = g

Semidefinite program

minx c>xsubject to x1F1 + x2F2 + . . . + xnFn + G � 0

Ax = b

Page 23: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Problem reformulations

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 23

equivalent formulations of a problem can lead to verydifferent duals

reformulating the primal problem can be useful when thedual is difficult to derive, or uninteresting.

Common reformulations

introduce new variables and equality constraintsmake explicit constraints implicit and vice versatransform objective or constraint functions

Page 24: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Equivalent convex problems (1)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 24

Two problems are (informally) equivalent if the solution ofone is readily obtained from the solution of the other, andvice-versa.

Eliminating equality constraints

minx f0(x)subject to fi(x) 6 0 for all i

Ax = b

is equivalent to

minz f0(Fz + x0)subject to fi(Fz + x0) 6 0 for all i

Ax = b

where F and x0 are such that Ax = b ⇔ x = Fz + x0 forsome z.

Page 25: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Equivalent convex problems (2)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 25

Introducing Equality Constraints

minx f0(A0x + b)subject to fi(Aix + b) for all i

is equivalent to

minx,yif0(y0)

subject to fi(yi) 6 0, for all iyi = Aix + bi

Page 26: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Equivalent convex problems (3)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 26

Introducing slack variables

minx f0(x)subject to a>i x 6 bi for all i

is equivalent to

minx,s f0(x)subject to a>i x + si = bi for all i

si > 0

Page 27: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Equivalent convex problems (4)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 27

Epigraph form

minx f0(x)subject to fi(x) 6 0 for all i

Ax = b

is equivalent to

minx,t tsubject to f0(x)− t 6 0

fi(x) 6 0 for all iAx = b

Page 28: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Equivalent convex problems (5)

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 28

Minimizing over some variables

minx1,x2 f0(x1, x2)subject to fi(x1) 6 0 for all i

is equivalent to

minx1 f0(x1)subject to fi(x1) 6 0 for all i

where f0(x1) = infx2f0(x1, x2).

Page 29: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Lagrange Duality

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 29

LagrangianL : Rn × Rm × Rp → R with

L(x, λ, ν) = f0(x) +

m∑i=1

λifi(x) +

p∑j=1

νjgj(x).

weighted sum of objective and constraint functionsλi is the Lagrange multiplier associated with fi(x) 6 0.νj is the Lagrange multiplier associated with gj(x) = 0.

Page 30: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Dual function

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 30

Lagrange dual functionh : Rm × Rp → R,

h(λ, ν) = infx

L(x, λ, ν).

h is concave.

Lagrange dual problem

maxλ,ν h(λ, ν)subject to λ > 0

Page 31: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Checking for Optimality

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 31

If x, λ, ν satisfy the KKT conditions for a convex problem,then they are optimal.The following four conditions are called the Karush-Kuhn-Tucker (KKT) conditions:

primal constraints: fi(x) 6 0, gi(x) = 0

dual constraints: λ > 0

complementary slackness: λifi(x) = 0

gradient of Lagrangian with respect to x vanishes:

∇f0(x) +∑

i

λi∇fi(x) +∑

j

ν∇gj(x) = 0

Page 32: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

Summary

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 32

Machine Learning as Numerical Optimization

Unconstrained Optimization

Gradient DescentNewton Method

Constrained Optimization

Convex Functions and Convex SetsCommon Problem FormulationsKKT conditions

http://www.stanford.edu/~boyd/cvxbook/

Page 33: Advanced Methods for Sequence Analysis · Convex Optimization G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2 Machine Learning as Numerical Optimization

References

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Gene H. Golub and Charles F. van Loan. Matrix Computations. Johns Hopkins, 3rd edition, 1996.

B. Scholkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.