Iterative Reweighted Least Squares
Sargur N. SrihariUniversity at Buffalo, State University of New York
USA
Topics in Linear Classification using Probabilistic Discriminative Models
• Generative vs Discriminative1. Fixed basis functions in linear classification2. Logistic Regression (two-class)3. Iterative Reweighted Least Squares (IRLS)4. Multiclass Logistic Regression5. Probit Regression6. Canonical Link Functions
2
SrihariMachine Learning
Topics in IRLS
• Linear and Logistic Regression• Newton-Raphson Method• Hessian• IRLS for Linear Regression• IRLS for Logistic Regression• Numerical Example• Hessian in Deep Learning
Machine Learning Srihari
3
Recall Linear Regression• Assuming Gaussian noise p(t|x,w,β)=N (t | y(x,w),β -1 )
• And a linear model• Likelihood has the form
• Log-likelihood becomes
• Which is quadratic in input• Derivative has the form• Setting equal to zero and solving we get
Machine Learning Srihari
4
y(x,w) = w
jφ
j(x)
j=0
M −1
∑ = wTφ(x)
p(t | X,w,β) = N t
n| wTφ(x
n),β−1( )
n=1
N
∏
ln p(t | w,β) = lnN t
n| wTφ(x
n),β−1( )
n=1
N
∑ =N2
lnβ−N2
ln2π−βED(w)
E
D(w) = 1
2 t
n−wTϕ(x
n) { }
n=1
N
∑2
TT FFF=F -+ 1)(
∇ ln p(t | w,β) = β t
n−wTφ(x
n) { }
n=1
N
∑ φ(xn)T
tw +F=ML
÷÷÷÷÷
ø
ö
ççççç
è
æ
=F
-
-
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
ff
ffff
Linear and Logistic Regression• In linear regression there is a closed-form max
likelihood solution for determining w• on the assumption of Gaussian noise model • Due to quadratic dependence of log-likelihood on w
• For logistic regression: No closed-form maximum likelihood solution• Due to nonlinearity of logistic sigmoid
• But departure from quadratic is not substantial• Error function is concave, i.e., unique minimum
Machine Learning Srihari
5
E(w) = − ln p(t |w) = − t
nlny
n+ (1 − t
n)ln(1 − y
n){ }
n=1
N
∑ yn= σ(wTɸn)
E
D(w) = 1
2 t
n−wTφ(x
n) { }
n=1
N
∑2
tw +F=ML
What is IRLS?• An iterative method to find solution w*
– for linear regression and logistic regression• assuming least squares objective
• While simple gradient descent has the form
• IRLS uses second derivative and has the form
• It is derived from Newton-Raphson method • where H is the Hessian matrix whose elements are the
second derivatives of E(w) wrt w
Machine Learning Srihari
6
w(new) = w(old) − η∇E(w)
w(new) = w(old) −H −1∇E(w)
∇E[w!"] ≡ ∂E
∂w0, ∂E∂w1
,# ∂E∂wn
⎡
⎣⎢
⎤
⎦⎥
H = ∇∇E(w) = d 2
dw2E(w) =
∂2E∂w
12
∂2E∂w
1∂w
2
∂2E∂w
2∂w
1
∂2E∂w
22
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
Why second derivative?Machine Learning Srihari
7
Since we are solving forderivative of E(w)Need second derivative
Criticalpoint
Currentsolution
• Derivative of a function E(w) at wis the slope of tangent at w
• Newton-Raphson is an iterative method to find the zeros of a function – It repeatedly computes the derivative
• We need to find value of w at which ∇E(w)=0– So we need to repeatedly compute
the derivative of ∇E(w)
Solving for f (x)=0
Higher order derivatives example(1-D)Machine Learning Srihari
8
• Illustration of 1st, 2nd and 3rd order derivatives– Derivatives of Gaussian p(x) ~ N(0, σ)
First Order Derivative
Second Order Derivative
Third Order Derivative
Second derivative measures curvature
• Derivative of a derivative• Quadratic functions with different curvatures
9
Decrease is faster than predictedby Gradient Descent
Gradient Predictsdecrease correctly
Decreaseis slower than expectedActually increases
Dashed line isvalue of costfunction predictedby gradient alone
Beyond Gradient: Jacobian and Hessian matrices
• Sometimes we need to find all derivatives of a function whose input and output are both vectors
• If we have function f: RmàRn– Then the matrix of partial derivatives is known as
the Jacobian matrix J defined as
10
J
i,j=∂∂x
j
f x( )i
Second derivative• Derivative of a derivative• For a function f: RnàR the derivative wrt xi of
the derivative of f wrt xj is denoted as• In a single dimension we can denote by
f ”(x)• Tells us how the first derivative will change as
we vary the input• This important as it tells us whether a gradient
step will cause as much of an improvement as based on gradient alone
11
∂2
∂xi∂x
j
f
∂2
∂x 2f
Hessian• Second derivative with many dimensions• H ( f ) (x) is defined as• Hessian is the Jacobian of the gradient • Hessian matrix is symmetric, i.e., Hi,j=Hj,i
• anywhere that the second partial derivatives are continuous
– So the Hessian matrix can be decomposed into a set of real eigenvalues and an orthogonal basis of eigenvectors
• Eigenvalues of H are useful to determine learning rate as seen in next two slides
H(f )(x)
i,j=
∂2
∂xi∂x
j
f (x)
Role of eigenvalues of Hessian
• Second derivative in direction d is dTHd– If d is an eigenvector, second derivative in that
direction is given by its eigenvalue– For other directions, weighted average of
eigenvalues (weights of 0 to 1, with eigenvectors with smallest angle with d receiving more value)
• Maximum eigenvalue determines maximum second derivative and minimum eigenvalue determines minimum second derivative
13
Learning rate from Hessian
• Taylor’s series of f(x) around current point x(0)
• where g is the gradient and H is the Hessian at x(0)
– If we use learning rate ε the new point x is given by x(0)-εg. Thus we get
• There are three terms: – original value of f, – expected improvement due to slope, and – correction to be applied due to curvature
• Solving for step size when correction is least gives
14
f (x)≈ f (x (0))+ (x - x (0))Tg +
12(x - x (0))T H(x - x (0))
f (x (0)− εg)≈ f (x (0))− εgTg +
12ε2gTHg
ε*≈
gTggTHg
Another 2nd Derivative Method• Using Taylor’s series of f (x) around current x (0)
– solve for the critical point of this function to give
– When f is a quadratic (positive definite) function use solution to jump to the minimum function directly
– When not quadratic apply solution iteratively• Can reach critical point much faster than
gradient descent– But useful only when nearby point is a minimum
15
f (x)≈ f (x (0))+ (x - x (0))T∇x f (x (0))+
12(x - x (0))T H(f )(x - x (0))(x - x (0))
x* = x (0)−H(f )(x (0))−1∇x f (x(0))
• IRLS is applicable to both Linear Regression and Logistic Regression
• We discuss both, for each we need1. Model2. Objective function E (w)
• Linear Regression: Sum of Squared Errors• Logistic Regression: Bernoulli Likelihood Function
3. Gradient4. Hessian5. Newton-Raphson update
Two applications of IRLSSrihari
16
w(new) = w(old) −H −1∇E(w)
Machine Learning
∇E(w)
H = ∇∇E(w)
y(x,w) = w
jφ
j(x) = wTφ(x)
j=0
M −1
∑
p(t |w) = y
n
tn
n=1
N
∏ 1 − yn{ }1−tn
E(w) = 1
2tn−wTφ(x
n){ }2
n=1
N
∑
p(C1|ϕ) =y(ϕ) = σ (wTϕ)
IRLS for Linear Regression1. Model:
2. Error Function: Sum of Squares:
3. Gradient of Error Function is:
4. Hessian is:
5. Newton-Raphson Update:
∇E(w) = (wTφn− t
n)φ
nn=1
N
∑ = ΦTΦw − ΦTt
H = ∇∇E(w) = φ
nφ
nT
n=1
N
∑ = ΦTΦ
where Φ is the N x Mdesign matrix whosenth row is given by φnT
SrihariMachine Learning
y(x,w) = w
jφ
j(x) = wTφ(x)
j=0
M −1
∑
E(w) = 1
2tn−wTφ(x
n){ }2
n=1
N
∑
÷÷÷÷÷
ø
ö
ççççç
è
æ
=F
-
-
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
ff
ffff
for data set X={xn,tn} n=1,..N
w(new) = w(old) −H −1∇E(w)
Substituting: andw (new) = w(old)-(ΦTΦ)-1{ΦTΦw (old)-ΦTt}
= (ΦTΦ)-1ΦTtwhich is the standard least squares solution
Since it is independent of w, Newton-Raphson gives exact solution in one step
�
H = ΦTΦ ∇E(w) = ΦTΦw − ΦTt
IRLS for Logistic Regression1. Model: p(C1|ϕ) =y(ϕ) = σ (wTϕ) Likelihood:
2. Objective Function: Negative log-likelihood:
3. Gradient of Error Function is:
4. Hessian is:
5. Newton-Raphson Update:
where Φ is the N x M design matrix whosenth row is given by ɸnT
SrihariMachine Learning
÷÷÷÷÷
ø
ö
ççççç
è
æ
=F
-
-
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
ff
ffff
w(new) = w(old) −H −1∇E(w)
p(t |w) = y
n
tn
n=1
N
∏ 1 − yn{ }1−tn
for data set {ϕn,tn}, tn ε {0,1}, yn=ϕ (xn)
E(w) = − t
nlny
n+ (1 − t
n)ln(1 − y
n){ }
n=1
N
∑Cross-entropy error function
∇E(w) = (y
n− t
n)φ
n= ΦT(y − t)
n=1
N
∑
H = ∇∇E(w) = y
n(1 − y
n)φ
nφ
nT = ΦTRΦ
n=1
N
∑R is N x N diagonal matrix with elements
Rnn= yn(1-yn)=wTϕn(1-wTϕn)
Hessian is not constant and depends on w through R. Since H is positive-definite (i.e., for arbitrary u, uTHu>0) error function is a concave function ofw and so has a unique minimum
Substituting andw(new) = w(old) - (ΦTRΦ)-1ΦT (y-t)
= (ΦTRΦ)-1{ΦΦw(old)-ΦT(y-t)} = (ΦTRΦ)-1ΦTRz
�
H = ΦTRΦ ∇E(w) = ΦT(y − t)
where z is a N-dimensional vector with elements z =Φw (old)-R-1(y-t)
Update formula is a set of normal equations. Since Hessian depends on w apply them iteratively each time using the new weight vector
Initial Weight Vector, Gradient and Hessian (2-class)
• Weight vector
• Gradient
• Hessian
19
Final Weight Vector, Gradient and Hessian (2-class)
• Weight Vector
• Gradient
• Hessian
20
Number of iterations : 10
Error (Initial and Final): 15.0642, 1.0000e-009
Hessian in Deep Learning • The second derivative can be used to
determine whether a critical point is a local minimum, maximum or a saddle point– On a critical point f’(x)=0– When f’’(x)>0 the first derivative f’(x) increases as
we move to the right and decreases as we move left• We conclude that x is a local minimum
– For local maximum, f’(x)=0 and f’’(x)<0– When f’’(x)=0 test is inconclusive: xmay be a
saddle point or part of a flat region
21
Multidimensional Second derivative test
• In multiple dimensions, we need to examine second derivatives of all dimensions
• Eigendecomposition generalizes the test• Test eigenvalues of Hessian to determine
whether critical point is a local maximum, local minimum or saddle point
• When H is positive definite (all eigenvalues are positive) the point is a local minimum
• Similarly negative definite implies a maximum22
Saddle point• Contains both positive and negative curvature• Function is f(x)=x12-x22
– Along axis x1, function curves upwards: this axis is an eigenvector of H and has a positive value
– Along x2, function corves downwards; its direction is an eigenvector of H with negative eigenvalue
– At a saddle point Eigen values are both positive and negative
23