skuehn_machinelearningandoptimization_2015

Post on 21-Mar-2017

216 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Foundations of (Machine) Learning

Stefan Kühn

codecentric AG

CSMLS Meetup Hamburg - February 26th, 2015

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 1 / 18

Contents

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 2 / 18

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 3 / 18

Setting

Supervised Learning ApproachUse labeled training data in order to fit a given model to the data, i.e. tolearn from the given data.

Typical Problems:Classification - discrete output

I Logistic RegressionI Neural NetworksI Support Vector Machines

Regression - continuous outputI Linear RegressionI Support Vector RegressionI Generalized Linear/Additive Models

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 4 / 18

Training and Learning

Ingredients:Training Data SetModel, e.g. Logistic RegressionError Measure, e.g. Mean Squared Error

Learning Procedure:Derive objective function from Model and Error MeasureInitialize Model parametersFind a good fit!Iterate with other initial parameters

What is Learning in this context?Learning is nothing but the application of an algorithm for unconstrainedoptimization to the given objective function.

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 5 / 18

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 6 / 18

Unconstrained Optimization

Higher-order methodsNewton’s method (fast local convergence)

Gradient-based methodsGradient Descent / Steepest Descent (globally convergent)Conjugate Gradient (globally convergent)Gauß-Newton, Levenberg-Marquardt, Quasi-NewtonKrylow Subspace methods

Derivative-free methods, direct searchSecant method (locally convergent)Regula Falsi and successors (global convergence, typically slow)Nelder-Mead / Downhill-Simplex

I unconventional method, creates a moving simplexI driven by reflection/contraction/expansion of the corner pointsI globally convergent for differentiable functions f ∈ C1

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 7 / 18

General Iterative Algorithmic Scheme

Goal: Minimize a given function f :

min f (x), x ∈ Rn

Iterative AlgorithmsStarting from a given point an iterative algorithm tries to minimize theobjective function step by step.

Preparation: k = 0Initialization: Choose initial points and parameters

Iterate until convergence: k = 1, 2, 3, . . .Termination criterion: Check optimality of the current iterateDescent Direction: Find reasonable search directionStepsize: Determine length of the step in the given direction

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 8 / 18

Termination criteria

Critical points x∗:

∇f (x∗) = 0

Gradient: Should converge to zero

‖∇f (x∗)‖ < tol

Iterates: Distance between xk and xk+1 should converge to zero∥∥∥xk − xk+1∥∥∥ < tol

Function Values: Difference between f (xk) and f (xk+1) shouldconverge to zero ∣∣∣f (xk)− f (xk+1)

∣∣∣ < tol

Number of iterations: Terminate after maxiter iterations

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 9 / 18

Descent direction

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 10 / 18

Descent directionGeometric interpretationd is a descent direction if and only if the angle α between the gradient∇f (x) and d is in a certain range:

π

2= 90◦ < α < 270◦ =

3π2

Algebraic equivalentThe sign of the scalar product between two vectors a and b is determinedby the cosine of the angle α between a and b:

〈a, b〉 = aTb = ‖a‖‖b‖ cosα(a, b)

d is a descent direction if and only if:

dT∇f (x) < 0

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 11 / 18

Stepsize

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 12 / 18

StepsizeArmijo’s ruleTakes two parameters 0 < σ < 1, and 0 < ρ < 0.5

For ` = 0, 1, 2, ... test Armijo condition:

f (p + σ`d) < f (p) + ρσ`dT∇f (p)

Accepted stepsizeFirst ` that passes this test determines the accepted stepsize

t = σ`

Standard Armijo implies, that for the accepted stepsize t always holdst <= 1, only semi-efficient.

Technical detail: WideningTest whether some t > 0 satisfy Armijo condition, i.e. check` = −1,−2, . . . as well, ensures efficiency.

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 13 / 18

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 14 / 18

Gradient Descent

Descent directionDirection of Steepest Descent, the negative gradient:

d = −∇f (x)

Motivation:corresponds to α = 180◦ = π

obvious choice, always a descent direction, no test neededguarantees the quickest win locallyworks with inexact line search, e.g. Armijo’ s ruleworks for functions f ∈ C1

always solves auxiliary optimization problem

min sT∇f (x), s ∈ Rn, ‖s‖ = 1

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 15 / 18

Conjugate GradientMotivation: Quadratic Model Problem, minimize

f (x) = ‖Ax − b‖2

Optimality condition:

∇f (x∗) = 2AT (Ax∗ − b) = 0

Obvious approach: Solve system of linear equations

ATAx = ATb

Descent directionConsecutive directions di , . . . , di+k satisfy certain orthogonality orconjugacy conditions, M = ATA symmetric positive definite:

dTi Mdj = 0, i 6= j

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 16 / 18

Nonlinear Conjugate GradientInitial Steps:

start at point x0 with d0 = −∇f (x0)

perform exact line search, find

t0 = argmin f (x0 + td0), t > 0

set x1 = x0 + t0d0.Iteration:

set ∆k = −∇f (xk)

compute βk via one of the available formulas (next slide)update conjugate search direction dk = ∆k + βkdk−1

perform exact line search, find

tk = argmin f (xk + tdk), t > 0

set xk+1 = xk + tkdk

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 17 / 18

Nonlinear Conjugate Gradient

Formulas for βk :Fletcher-Reeves

βFRk =∆T

k ∆k

∆Tk−1∆k−1

Polak-Ribière βPRk =∆T

k (∆k −∆k−1)

∆Tk−1∆k−1

Hestenes-Stiefel βHSk = −∆T

k (∆k −∆k−1)

sTk−1 (∆k −∆k−1)

Dai-Yuan βDYk = −∆T

k ∆k

sTk−1 (∆k −∆k−1)

Reasonable choice with automatic direction reset:

β = max{0, βPR

}Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 18 / 18

top related