skuehn_machinelearningandoptimization_2015

The Foundations of (Machine) Learning

Stefan Kühn

codecentric AG

CSMLS Meetup Hamburg - February 26th, 2015

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 1 / 18

Contents

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Setting

Supervised Learning ApproachUse labeled training data in order to fit a given model to the data, i.e. tolearn from the given data.

Typical Problems:Classification - discrete output

I Logistic RegressionI Neural NetworksI Support Vector Machines

Regression - continuous outputI Linear RegressionI Support Vector RegressionI Generalized Linear/Additive Models

Training and Learning

Ingredients:Training Data SetModel, e.g. Logistic RegressionError Measure, e.g. Mean Squared Error

Learning Procedure:Derive objective function from Model and Error MeasureInitialize Model parametersFind a good fit!Iterate with other initial parameters

What is Learning in this context?Learning is nothing but the application of an algorithm for unconstrainedoptimization to the given objective function.

Unconstrained Optimization

Higher-order methodsNewton’s method (fast local convergence)

Gradient-based methodsGradient Descent / Steepest Descent (globally convergent)Conjugate Gradient (globally convergent)Gauß-Newton, Levenberg-Marquardt, Quasi-NewtonKrylow Subspace methods

Derivative-free methods, direct searchSecant method (locally convergent)Regula Falsi and successors (global convergence, typically slow)Nelder-Mead / Downhill-Simplex

I unconventional method, creates a moving simplexI driven by reflection/contraction/expansion of the corner pointsI globally convergent for differentiable functions f ∈ C1

General Iterative Algorithmic Scheme

Goal: Minimize a given function f :

min f (x), x ∈ Rn

Iterative AlgorithmsStarting from a given point an iterative algorithm tries to minimize theobjective function step by step.

Preparation: k = 0Initialization: Choose initial points and parameters

Iterate until convergence: k = 1, 2, 3, . . .Termination criterion: Check optimality of the current iterateDescent Direction: Find reasonable search directionStepsize: Determine length of the step in the given direction

Termination criteria

Critical points x∗:

∇f (x∗) = 0

Gradient: Should converge to zero

‖∇f (x∗)‖ < tol

Iterates: Distance between xk and xk+1 should converge to zero∥∥∥xk − xk+1∥∥∥ < tol

Function Values: Difference between f (xk) and f (xk+1) shouldconverge to zero ∣∣∣f (xk)− f (xk+1)

∣∣∣ < tol

Number of iterations: Terminate after maxiter iterations

Descent direction

Descent directionGeometric interpretationd is a descent direction if and only if the angle α between the gradient∇f (x) and d is in a certain range:

2= 90◦ < α < 270◦ =

Algebraic equivalentThe sign of the scalar product between two vectors a and b is determinedby the cosine of the angle α between a and b:

〈a, b〉 = aTb = ‖a‖‖b‖ cosα(a, b)

d is a descent direction if and only if:

dT∇f (x) < 0

Stepsize

StepsizeArmijo’s ruleTakes two parameters 0 < σ < 1, and 0 < ρ < 0.5

For ` = 0, 1, 2, ... test Armijo condition:

f (p + σ`d) < f (p) + ρσ`dT∇f (p)

Accepted stepsizeFirst ` that passes this test determines the accepted stepsize

t = σ`

Standard Armijo implies, that for the accepted stepsize t always holdst <= 1, only semi-efficient.

Technical detail: WideningTest whether some t > 0 satisfy Armijo condition, i.e. check` = −1,−2, . . . as well, ensures efficiency.

Gradient Descent

Descent directionDirection of Steepest Descent, the negative gradient:

d = −∇f (x)

Motivation:corresponds to α = 180◦ = π

obvious choice, always a descent direction, no test neededguarantees the quickest win locallyworks with inexact line search, e.g. Armijo’ s ruleworks for functions f ∈ C1

always solves auxiliary optimization problem

min sT∇f (x), s ∈ Rn, ‖s‖ = 1

Conjugate GradientMotivation: Quadratic Model Problem, minimize

f (x) = ‖Ax − b‖2

Optimality condition:

∇f (x∗) = 2AT (Ax∗ − b) = 0

Obvious approach: Solve system of linear equations

ATAx = ATb

Descent directionConsecutive directions di , . . . , di+k satisfy certain orthogonality orconjugacy conditions, M = ATA symmetric positive definite:

dTi Mdj = 0, i 6= j

Nonlinear Conjugate GradientInitial Steps:

start at point x0 with d0 = −∇f (x0)

perform exact line search, find

t0 = argmin f (x0 + td0), t > 0

set x1 = x0 + t0d0.Iteration:

set ∆k = −∇f (xk)

compute βk via one of the available formulas (next slide)update conjugate search direction dk = ∆k + βkdk−1

perform exact line search, find

tk = argmin f (xk + tdk), t > 0

set xk+1 = xk + tkdk

Nonlinear Conjugate Gradient

Formulas for βk :Fletcher-Reeves

βFRk =∆T

k ∆k

∆Tk−1∆k−1

Polak-Ribière βPRk =∆T

k (∆k −∆k−1)

∆Tk−1∆k−1

Hestenes-Stiefel βHSk = −∆T

k (∆k −∆k−1)

sTk−1 (∆k −∆k−1)

Dai-Yuan βDYk = −∆T

k ∆k

sTk−1 (∆k −∆k−1)

Reasonable choice with automatic direction reset:

β = max{0, βPR

}Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 18 / 18

skuehn_machinelearningandoptimization_2015

Data & Analytics

inside the cave

preparing for the uk research data registry and discovery...

the true winner in fifa 2014

maas and umf

like loggly using open source

presto meetup @ facebook (2014-05-14)

22 ways to reduce churn with growth hacking

global digital statistics 2014 by we are social

the hidden key to ecommerce personalization

using microsoft azure machine learning to advance scientific...

predicting retail kpis using magento & machine learning

building simple continuous reviews in acl

sentiment analysis in healthcare

process mining and lean

[sis] the patterns of culture

community detection from a computational social science...

six terms fundamental to a model of transcription

crowdsourcing approaches to big data curation for earth...

teresópolis 2011 - lições para gerenciar riscos

3 mitos de big data revelados big data