exploring optimization in vowpal wabbit

34
Exploring Optimization in Vowpal Wabbit -Shiladitya Sen

Upload: shiladitya-sen

Post on 11-Apr-2017

205 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Exploring Optimization in Vowpal Wabbit

Exploring Optimization in Vowpal Wabbit

-Shiladitya Sen

Page 2: Exploring Optimization in Vowpal Wabbit

Vowpal Wabbit• Online• Open Source• Machine Learning Library

• Has achieved record-breaking speed by implementation of• Parallel Processing• Caching• Hashing,etc.

• A “true” library:offers a wide range ofmachine learning andoptimization algorithms

Page 3: Exploring Optimization in Vowpal Wabbit

Machine Learning Models

• Linear Regressor ( --loss_function squared)• Logistic Regressor (--loss_function logistic)• SVM (--loss_function hinge)• Neural Networks ( --nn <arg> )• Matrix Factorization• Latent Dirichlet Allocation ( --lda <arg> )• Active Learning ( --active_learning)

Page 4: Exploring Optimization in Vowpal Wabbit

Regularization

• L1 Regularization ( --l1 <arg> )

• L2 Regularization ( --l2 <arg> )

Page 5: Exploring Optimization in Vowpal Wabbit

Optimization Algorithms

• Online Gradient Descent ( default )

• Conjugate Gradient ( --conjugate_gradient )

• L-BFGS ( --bfgs )

Page 6: Exploring Optimization in Vowpal Wabbit

Optimization

The Convex Definition

Page 7: Exploring Optimization in Vowpal Wabbit

Convex Sets

Definition:

(0,1) and C y x, whereC y )x - (1ifset convex a be tosaid is of Csubset A n

Page 8: Exploring Optimization in Vowpal Wabbit

Convex Functions:

set.convex a isf(x)} and X, x| ){(x,

as defined epigraph, its iffunction convex a be tosaid isin function convex a is X whereX:ffunction valued-realA n

Page 9: Exploring Optimization in Vowpal Wabbit

It can be proved from the definition of convex functions that such a function can have no maxima.

In other words…

Might have at most one minimai.e. Local minima is global minima

Loss functions which are convex help in optimization forMachine Learning

Page 10: Exploring Optimization in Vowpal Wabbit

Optimization

Algorithm I : Online Gradient Descent

Page 11: Exploring Optimization in Vowpal Wabbit

What the batch implementation of Gradient Descent (GD) does

Page 12: Exploring Optimization in Vowpal Wabbit

How does Batch-version of GD work?

• Expresses total loss J as a function of a set of parameters : x

• Takes a calculated step α in that direction to reach a new point, with new co-ordinate values of x

descent.steepest ofdirection is J(x)- So,ascent.steepest ofdirection theas J(x) Calculates

achieved. is tolerancerequired until continues This)(

:Algorithm

1 tttt xJxx

Page 13: Exploring Optimization in Vowpal Wabbit

What is the online implementation of GD?

Page 14: Exploring Optimization in Vowpal Wabbit

How does online GD work?

1. Takes a point from the dataset :2. Using existing hypothesis, predicts value3. True value is revealed4. Calculates error J as a function of parameters

x for point5. 6.

7. 8. Moves onto next point

tp

tp

)( Evaluates txJ

)( :descentsteepest ofdirection in the step a Takes

txJ

)( : as parameters Updates 1 ttt xJxx

1tp

Page 15: Exploring Optimization in Vowpal Wabbit

Looking Deeper into Online GD

• Essentially calculates error function J(x) independently for each point, as opposed to calculating J(x) as sum of all errors as in Batch implementation (Offline) GD

• To achieve accuracy, Online GD takes multiple passes through the dataset

(Continued…)

Page 16: Exploring Optimization in Vowpal Wabbit

Still deeper…• So that a convergence is reached, the step η in

each pass is reduced. In VW, this is implemented as:

• Cache file used for multiple passes (-c)

10][ l ate]learning_r-[- -l-0.5][ p -initial_p-

1][ i -initial_t-1][ d rning_rate-decay_lea-

)(.

''

1

p

eee

pn

iiild

e

Page 17: Exploring Optimization in Vowpal Wabbit

So why Online GD?

• It takes less space…

• And my system needs its space!

Page 18: Exploring Optimization in Vowpal Wabbit

Optimization

Algorithm II: Method of Conjugate Gradients

Page 19: Exploring Optimization in Vowpal Wabbit

What is wrong with Gradient Descent?

•Often takes steps in the same direction•Convergence Issues

Page 20: Exploring Optimization in Vowpal Wabbit

Convergence Problems:

Page 21: Exploring Optimization in Vowpal Wabbit

The need for Conjugate Gradients:Wouldn’t it be wonderful if we did not need to take steps in the same direction to minimize error in that direction?

This is where Conjugate Gradient comes in…

Page 22: Exploring Optimization in Vowpal Wabbit

Method of Orthogonal Directions

• In an (n+1) dimensional vector space where J is defined with n parameters, at most n linearly independent directions for parameters exist

• Error function may have a component in at most n linearly independent (orthogonal) directions

• Intended: A step in each of these directions i.e. at most n steps to minimize the error

• Not solvable for orthogonal directions

Page 23: Exploring Optimization in Vowpal Wabbit

Conjugate Directions:

0 : ) respect towith Conjugate(

0 :Orthogonal

directionssearch are d,d ji

jTi

jTi

AddA

dd

Page 24: Exploring Optimization in Vowpal Wabbit

How do we get the conjugate directions?

• We first choose n mutually orthogonal directions:

• We calculate as:

nuuu ,...,, 21

id

. calculate to,...,, toorthogonal-not are which of componentsany out Subtracts

k21

1

1

n

i

i

kkkii

dddAu

dud

Page 25: Exploring Optimization in Vowpal Wabbit

So what is Method of Conjugate Gradients?

• If we set to , the gradient in the i-th step, we have the Method of Conjugate Gradients.

• The step size in the direction is found by an exact line search.

iu ir

jit independanlinearly are , ji rr

id

Page 26: Exploring Optimization in Vowpal Wabbit

The Algorithm for Conjugate Gradient:

Page 27: Exploring Optimization in Vowpal Wabbit

Requirement for Preconditioning:

• Round-off errors – leads to slight deviations from Conjugate Directions

• As a result, Conjugate Gradient is implemented iteratively

• To minimize number of iterations, preconditioning is done on the vector space

Page 28: Exploring Optimization in Vowpal Wabbit

What is Pre-conditioning?

• The vector space is modified by multiplying a matrix such that M is a symmetric, positive-definite matrix.

• This leads to a better clustering of the eigenvectors and a faster convergence.

-1M

Page 29: Exploring Optimization in Vowpal Wabbit

Optimization

Algorithm III: L-BFGS

Page 30: Exploring Optimization in Vowpal Wabbit

Why think linearly?

Newton’s Method proposes a step along a non-linear path as opposed to a linear one as in GD and CG..

Leads to a faster convergence…

Page 31: Exploring Optimization in Vowpal Wabbit

Newton’s Method:

)).(()(21

)).(()()(:)( ofexpansion series sTaylor'order 2nd

2 xxxJxx

xxxJxJxxJxxJ

T

:get we, respect to with Minimizing x

)()]([ 12 xJxJx

)()]([

:form iterativeIn 12

1 xJxJxx nn

Page 32: Exploring Optimization in Vowpal Wabbit

What is this BFGS Algorithm?•

• Named after Broyden-Fletcher-Goldfarb-Shanno

• Maintains an approximate matrix B and updates B upon each iteration

BFGS. is which amongpopular most Methods,Newton - toQuasiled

)]([][ gcalculatinin onsComplicati 121 xJHB

Page 33: Exploring Optimization in Vowpal Wabbit

BFGS Algorithm:

Page 34: Exploring Optimization in Vowpal Wabbit

Memory is a limited asset• In Vowpal Wabbit, the version of BFGS

implemented is L-BFGS• In L-BFGS, all the previous updates to B are

not stored in memory• At a particular iteration i, only the last m

updates are stored and used to make new update

• Also, the step size η in each step is calculated by an inexact line search following Wolfe’s Conditions.