sb2b statistical machine learning hilary term 2017flaxman/ht17_lecture3.pdf · sb2b statistical...

SB2b Statistical Machine LearningHilary Term 2017

Seth FlaxmanNew website:

http://www.stats.ox.ac.uk/~flaxman/course_ml.html

Announcements:First problem sheet available! (due: Friday 27 Jan. at 5pm for Part B)

First problem class for MSc students: Thursday (26 Jan) at 3pm in LG.01

http://www.stats.ox.ac.uk/~flaxman/course_ml.html

Recap from last time Validation and Cross-Validation

Validation and Cross-Validation


Validation

For each combination of tuning parameters γ:Train our model on the training set, fitparameters θ = θ(γ), obtaining decisionfunction fθ(γ).Evaluate Rval (fθ(γ)) average loss on avalidation set.

Pick γ∗ with best performance on validationset.Using γ∗, train on both training andvalidation set to obtain the optimal θ∗.Rval(fθ(γ∗)) is now a biased estimate ofR(fθ(γ∗)) and can be overly optimistic!Evaluate model with γ∗, θ∗ on test set,reporting generalization performance.

Training set

Test set

θ

generalizationperformance

Validation set Modelcomplexity


Cross-Validation

Training Training Training Validation Test

TrainingTraining Training Validation Test

TrainingTrainingTraining Validation Test

Training TrainingTrainingValidation Test

Logistic regression Logistic Regression

Logistic regression

One of the most popular methods for classificationLinear model on the probabilitiesDates back to work on population growth curves by Verhulst [1838, 1845,1847]Statistical use for classification dates to Cox [1960s]Independently discovered as the perceptron in machine learning[Rosenblatt 1957]Main example of “discriminative” as opposed to “generative” learningNaïve approach to classification: linear regression


Logistic regression

Statistical perspective: consider Y = {0, 1}. Generalised linear modelwith Bernoulli likelihood and logit link:

Y|X = x, a, b ∼ Bernoulli(s(a + b>x)

)s(a + b>x) = 1

1+exp(−(a+b>x)) .

−8 −6 −4 −2 0 2 4 6 80

0.5

1

ML perspective: a discriminative classifier. Consider binaryclassification with Y = {−1,+1}. Logistic regression uses a parametricmodel on the conditional Y|X, not the joint distribution of (X,Y):

p(Y = y|X = x; a, b) =1

1 + exp(−y(a + b>x)).


Linearity of log-odds and logistic function

a + b>x models the log-odds ratio:

logp(Y = +1|X = x; a, b)p(Y = −1|X = x; a, b)

= a + b>x.

Solve explicitly for conditional class probabilities:

p(Y = +1|X = x; a, b) =1

1 + exp(−(a + b>x))=: s(a + b>x)

p(Y = −1|X = x; a, b) =1

1 + exp(+(a + b>x))= s(−a− b>x)


Fitting the parameters of the hyperplaneHow to learn a and b?

Consider maximizing the conditional log likelihood for Y = {−1,+1}:

`(a, b) =n∑

i=1

log p(Y = yi|X = xi) =

n∑i=1

log s(yi(a + b>xi)).

Equivalent to minimizing the empirical risk associated with the log loss:

R̂log(fa,b) =1n

n∑i=1

− log s(yi(a + b>xi)) =1n

n∑i=1

log(1 + exp(−yi(a + b>xi)))


Logistic Regression

Not possible to find optimal a, b analytically.For simplicity, absorb a as an entry in b byappending ’1’ into x vector.Objective function:

R̂log =1n

n∑i=1

− log s(yix>i b)

Logistic Function

s(−z) = 1− s(z)

∇zs(z) = s(z)s(−z)

∇z log s(z) = s(−z)

∇2z log s(z) = −s(z)s(−z)

Differentiate wrt b:

∇bR̂log =1n

n∑i=1

−s(−yix>i b)yixi

∇2bR̂log =

1n

n∑i=1

s(yix>i b)s(−yix>i b)xix>i � 0.


Logistic Regression

Second derivative is positive-definite: objective function is convex andthere is a single unique global minimum.Many different algorithms can find optimal b, e.g.:

Gradient descent:

bnew = b + ε1n

n∑i=1

s(−yix>i b)yixi

Stochastic gradient descent:

bnew = b + εt1|I(t)|

∑i∈I(t)

s(−yix>i b)yixi

where I(t) is a subset of the data at iteration t, and εt → 0 slowly(∑

t εt =∞,∑

t ε2t <∞).

Newton-Raphson:bnew = b− (∇2

bR̂log)−1∇bR̂log

This is also called iterative reweighted least squares.Conjugate gradient, LBFGS and other methods from numerical analysis.


Linearly separable data

Assume that the data is linearly separable, i.e. there is a scalar α and a vectorβ such that yi(α+ β>xi) > 0, i = 1, . . . , n. Let c > 0. The empirical risk fora = cα, b = cβ is

R̂log(fa,b) =1n

n∑i=1

log(1 + exp(−cyi(α+ β>xi)))

which can be made arbitrarily close to zero as c→∞, i.e. soft classificationrule becomes ±∞ (overconfidence)→ overfitting.


Multi-class logistic regression

The multi-class/multinomial logistic regression uses the softmax function tomodel the conditional class probabilities p (Y = k|X = x; θ), for K classesk = 1, . . . ,K, i.e.,

p (Y = k|X = x; θ) =exp

(w>k x + bk

)∑K`=1 exp

(w>` x + b`

) .Parameters are θ = (b,W) where W = (wkj) is a K × p matrix of weights andb ∈ RK is a vector of bias terms.


Logistic Regression: Summary

Makes less modelling assumptions than generative classifiers: oftenresulting in better prediction accuracy.Diverging optimal parameters for linearly separable data: need toregularise / pull them towards zero.A simple example of a generalised linear model (GLM), for which there isa well established statistical theory:

Assessment of fit via deviance and plots,Well founded approaches to removing insignificant features (drop-indeviance test, Wald test).

Regularization Overfitting and Regularization

Overfitting and Regularization


Overfitting in Logistic Regressiondx <- c(-5,-4,-3,-2,-1,1,2,3,5)d <- data.frame(dx)x <- seq(-6,6,.1)y <- c(0,0,0,1,0,1,1,1,1)lr <- glm(y ~ ., data=d,family=binomial)p <- predict(lr,newdata=data.frame(dx=x),type="response")plot(x,p,type="l")points(dx,y)

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

x

p

● ● ●

●

●

● ● ● ●


Overfitting in Logistic Regressiondx <- c(-5,-4,-3,-2,-1,1,2,3,5)d <- data.frame(dx)x <- seq(-6,6,.1)y <- c(0,0,0,0,0,1,1,1,1)lr <- glm(y ~ ., data=d,family=binomial)p <- predict(lr,newdata=data.frame(dx=x),type="response")plot(x,p,type="l")points(dx,y)

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

x

p

● ● ● ● ●

● ● ● ●


Overfitting in Linear Regression

http://www.stats.ox.ac.uk/~flaxman/sml17/overfitting.html

http://www.stats.ox.ac.uk/~flaxman/sml17/overfitting.html


Regularization

Flexible models for high-dimensional problems require many parameters.With many parameters, learners can easily overfit to the noise in thetraining data.Regularization: Limit flexibility of model to prevent overfitting.Deep connections with Bayesian perspective: prior distributions provideregularizationTypically: add term penalizing large values of parameters θ.

Remp(θ) + λ‖θ‖ρρ =1n

n∑i=1

log(1 + exp(−yi(a + b>xi))) + λ‖b‖ρρ

where ρ ∈ [1, 2], and ‖z‖ρ = (∑p

j=1 |zj|ρ)1/ρ is the Lρ norm of b (also ofinterest when ρ ∈ [0, 1), but is no longer a norm).Also known as shrinkage methods—parameters are shrunk towards 0.Typical cases are ρ = 2 (Euclidean norm, ridge regression) and ρ = 1(LASSO). When ρ ≤ 1 it is called a sparsity inducing regularization.λ is a tuning parameter (or hyperparameter) and controls the amountof regularization, and resulting complexity of the model.

Bach, Sparse Methods in Machine learning.

http://statweb.stanford.edu/~tibs/lasso.html

http://www.di.ens.fr/~fbach/sierra/index.htm


Regularization

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

.01

.10

.50

1.0

1.5

2.0

Lρ regularization profile for different values of ρ.


Regularization

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

L1 and L2 norm contours.


L2 regularization

Dates to Tikhonov [1943] (more general framework)Rediscovered as ridge regression [Hoerl and Kennard 1970]Derivation of ridge regression estimator [on board]


Sparsity Inducing Regularization

Consider constrained optimization problem

minθ

Remp(θ) s.t. ‖θ‖1 < γ

Lagrange multiplier λ > 0 to enforce constraint,

minθ

Remp(θ) + λ(‖θ‖1 − γ)

At the optimal value of λ, the parameter θ is theone minimizing the regularized empirical riskobjective.Conversely, given λ, there is a value of γ such thatthe corresponding optimal Lagrange multiplier is λ.

Generally: L1 regularization leads to optimalsolutions with many zeros, i.e. the regressionfunction depends only on the (small) number offeatures with non-zero parameters.

constraint |θ|1 ! γ

Remp(θ)


Illustration

Source: Statistical Learning with Sparsity: the Lasso and Generalizations byHastie, Tibshirani, and Wainwright


Bias/variance tradeoff

Why does regularization prevent overfitting?Regularization introduces bias with the goal of reducing variance:

Model complexity/flexibility

Pred

ictio

n er

ror

Underfit:high bias

low varianceOverfit:low bias

high variance

Just right

Training error

Testerror

http://www.stats.ox.ac.uk/~flaxman/sml17/regularization.html



Extra

Useful Resources and Pointers

[links below are clickable]Compendium of common loss functions for classification and regressionStatistical Learning with Sparsity: The Lasso and Generalizations byHastie, Tibshirani, and Wainwrightglmnet (R package) tutorialMatrix Cookbook

http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote10.html

http://web.stanford.edu/~hastie/StatLearnSparsity/

https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf

sb2b statistical machine learning hilary term 2017flaxman/ht17_lecture3.pdf · sb2b statistical...

Documents