sb2b statistical machine learning hilary term 2017flaxman/ht17_lecture3.pdf · sb2b statistical...

25
SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman New website: http://www.stats.ox.ac.uk/~flaxman/course_ml.html Announcements: First problem sheet available! (due: Friday 27 Jan. at 5pm for Part B) First problem class for MSc students: Thursday (26 Jan) at 3pm in LG.01

Upload: others

Post on 04-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

SB2b Statistical Machine LearningHilary Term 2017

Seth FlaxmanNew website:

http://www.stats.ox.ac.uk/~flaxman/course_ml.html

Announcements:First problem sheet available! (due: Friday 27 Jan. at 5pm for Part B)

First problem class for MSc students: Thursday (26 Jan) at 3pm in LG.01

Page 2: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Recap from last time Validation and Cross-Validation

Validation and Cross-Validation

Page 3: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Recap from last time Validation and Cross-Validation

Validation

For each combination of tuning parameters γ:Train our model on the training set, fitparameters θ = θ(γ), obtaining decisionfunction fθ(γ).Evaluate Rval (fθ(γ)) average loss on avalidation set.

Pick γ∗ with best performance on validationset.Using γ∗, train on both training andvalidation set to obtain the optimal θ∗.Rval(fθ(γ∗)) is now a biased estimate ofR(fθ(γ∗)) and can be overly optimistic!Evaluate model with γ∗, θ∗ on test set,reporting generalization performance.

Training set

Test set

θ

generalizationperformance

Validation set Modelcomplexity

Page 4: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Recap from last time Validation and Cross-Validation

Cross-Validation

Training Training Training Validation Test

TrainingTraining Training Validation Test

TrainingTrainingTraining Validation Test

Training TrainingTrainingValidation Test

Page 5: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Logistic regression

One of the most popular methods for classificationLinear model on the probabilitiesDates back to work on population growth curves by Verhulst [1838, 1845,1847]Statistical use for classification dates to Cox [1960s]Independently discovered as the perceptron in machine learning[Rosenblatt 1957]Main example of “discriminative” as opposed to “generative” learningNaïve approach to classification: linear regression

Page 6: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Logistic regression

Statistical perspective: consider Y = {0, 1}. Generalised linear modelwith Bernoulli likelihood and logit link:

Y|X = x, a, b ∼ Bernoulli(s(a + b>x)

)s(a + b>x) = 1

1+exp(−(a+b>x)) .

−8 −6 −4 −2 0 2 4 6 80

0.5

1

ML perspective: a discriminative classifier. Consider binaryclassification with Y = {−1,+1}. Logistic regression uses a parametricmodel on the conditional Y|X, not the joint distribution of (X,Y):

p(Y = y|X = x; a, b) =1

1 + exp(−y(a + b>x)).

Page 7: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Linearity of log-odds and logistic function

a + b>x models the log-odds ratio:

logp(Y = +1|X = x; a, b)p(Y = −1|X = x; a, b)

= a + b>x.

Solve explicitly for conditional class probabilities:

p(Y = +1|X = x; a, b) =1

1 + exp(−(a + b>x))=: s(a + b>x)

p(Y = −1|X = x; a, b) =1

1 + exp(+(a + b>x))= s(−a− b>x)

Page 8: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Fitting the parameters of the hyperplaneHow to learn a and b?

Consider maximizing the conditional log likelihood for Y = {−1,+1}:

`(a, b) =n∑

i=1

log p(Y = yi|X = xi) =

n∑i=1

log s(yi(a + b>xi)).

Equivalent to minimizing the empirical risk associated with the log loss:

R̂log(fa,b) =1n

n∑i=1

− log s(yi(a + b>xi)) =1n

n∑i=1

log(1 + exp(−yi(a + b>xi)))

Page 9: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Logistic Regression

Not possible to find optimal a, b analytically.For simplicity, absorb a as an entry in b byappending ’1’ into x vector.Objective function:

R̂log =1n

n∑i=1

− log s(yix>i b)

Logistic Function

s(−z) = 1− s(z)

∇zs(z) = s(z)s(−z)

∇z log s(z) = s(−z)

∇2z log s(z) = −s(z)s(−z)

Differentiate wrt b:

∇bR̂log =1n

n∑i=1

−s(−yix>i b)yixi

∇2bR̂log =

1n

n∑i=1

s(yix>i b)s(−yix>i b)xix>i � 0.

Page 10: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Logistic Regression

Second derivative is positive-definite: objective function is convex andthere is a single unique global minimum.Many different algorithms can find optimal b, e.g.:

Gradient descent:

bnew = b + ε1n

n∑i=1

s(−yix>i b)yixi

Stochastic gradient descent:

bnew = b + εt1|I(t)|

∑i∈I(t)

s(−yix>i b)yixi

where I(t) is a subset of the data at iteration t, and εt → 0 slowly(∑

t εt =∞,∑

t ε2t <∞).

Newton-Raphson:bnew = b− (∇2

bR̂log)−1∇bR̂log

This is also called iterative reweighted least squares.Conjugate gradient, LBFGS and other methods from numerical analysis.

Page 11: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Linearly separable data

Assume that the data is linearly separable, i.e. there is a scalar α and a vectorβ such that yi(α+ β>xi) > 0, i = 1, . . . , n. Let c > 0. The empirical risk fora = cα, b = cβ is

R̂log(fa,b) =1n

n∑i=1

log(1 + exp(−cyi(α+ β>xi)))

which can be made arbitrarily close to zero as c→∞, i.e. soft classificationrule becomes ±∞ (overconfidence)→ overfitting.

Page 12: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Multi-class logistic regression

The multi-class/multinomial logistic regression uses the softmax function tomodel the conditional class probabilities p (Y = k|X = x; θ), for K classesk = 1, . . . ,K, i.e.,

p (Y = k|X = x; θ) =exp

(w>k x + bk

)∑K`=1 exp

(w>` x + b`

) .Parameters are θ = (b,W) where W = (wkj) is a K × p matrix of weights andb ∈ RK is a vector of bias terms.

Page 13: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Logistic regression Logistic Regression

Logistic Regression: Summary

Makes less modelling assumptions than generative classifiers: oftenresulting in better prediction accuracy.Diverging optimal parameters for linearly separable data: need toregularise / pull them towards zero.A simple example of a generalised linear model (GLM), for which there isa well established statistical theory:

Assessment of fit via deviance and plots,Well founded approaches to removing insignificant features (drop-indeviance test, Wald test).

Page 14: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Overfitting and Regularization

Page 15: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Overfitting in Logistic Regressiondx <- c(-5,-4,-3,-2,-1,1,2,3,5)d <- data.frame(dx)x <- seq(-6,6,.1)y <- c(0,0,0,1,0,1,1,1,1)lr <- glm(y ~ ., data=d,family=binomial)p <- predict(lr,newdata=data.frame(dx=x),type="response")plot(x,p,type="l")points(dx,y)

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

x

p

● ● ●

● ● ● ●

Page 16: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Overfitting in Logistic Regressiondx <- c(-5,-4,-3,-2,-1,1,2,3,5)d <- data.frame(dx)x <- seq(-6,6,.1)y <- c(0,0,0,0,0,1,1,1,1)lr <- glm(y ~ ., data=d,family=binomial)p <- predict(lr,newdata=data.frame(dx=x),type="response")plot(x,p,type="l")points(dx,y)

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

x

p

● ● ● ● ●

● ● ● ●

Page 17: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Overfitting in Linear Regression

http://www.stats.ox.ac.uk/~flaxman/sml17/overfitting.html

Page 18: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Regularization

Flexible models for high-dimensional problems require many parameters.With many parameters, learners can easily overfit to the noise in thetraining data.Regularization: Limit flexibility of model to prevent overfitting.Deep connections with Bayesian perspective: prior distributions provideregularizationTypically: add term penalizing large values of parameters θ.

Remp(θ) + λ‖θ‖ρρ =1n

n∑i=1

log(1 + exp(−yi(a + b>xi))) + λ‖b‖ρρ

where ρ ∈ [1, 2], and ‖z‖ρ = (∑p

j=1 |zj|ρ)1/ρ is the Lρ norm of b (also ofinterest when ρ ∈ [0, 1), but is no longer a norm).Also known as shrinkage methods—parameters are shrunk towards 0.Typical cases are ρ = 2 (Euclidean norm, ridge regression) and ρ = 1(LASSO). When ρ ≤ 1 it is called a sparsity inducing regularization.λ is a tuning parameter (or hyperparameter) and controls the amountof regularization, and resulting complexity of the model.

Bach, Sparse Methods in Machine learning.

Page 19: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Regularization

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

.01

.10

.50

1.0

1.5

2.0

Lρ regularization profile for different values of ρ.

Page 20: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Regularization

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

L1 and L2 norm contours.

Page 21: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

L2 regularization

Dates to Tikhonov [1943] (more general framework)Rediscovered as ridge regression [Hoerl and Kennard 1970]Derivation of ridge regression estimator [on board]

Page 22: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Sparsity Inducing Regularization

Consider constrained optimization problem

minθ

Remp(θ) s.t. ‖θ‖1 < γ

Lagrange multiplier λ > 0 to enforce constraint,

minθ

Remp(θ) + λ(‖θ‖1 − γ)

At the optimal value of λ, the parameter θ is theone minimizing the regularized empirical riskobjective.Conversely, given λ, there is a value of γ such thatthe corresponding optimal Lagrange multiplier is λ.

Generally: L1 regularization leads to optimalsolutions with many zeros, i.e. the regressionfunction depends only on the (small) number offeatures with non-zero parameters.

constraint |θ|1 ! γ

Remp(θ)

Page 23: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Illustration

Source: Statistical Learning with Sparsity: the Lasso and Generalizations byHastie, Tibshirani, and Wainwright

Page 24: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Regularization Overfitting and Regularization

Bias/variance tradeoff

Why does regularization prevent overfitting?Regularization introduces bias with the goal of reducing variance:

Model complexity/flexibility

Pred

ictio

n er

ror

Underfit:high bias

low varianceOverfit:low bias

high variance

Just right

Training error

Testerror

http://www.stats.ox.ac.uk/~flaxman/sml17/regularization.html

Page 25: SB2b Statistical Machine Learning Hilary Term 2017flaxman/HT17_lecture3.pdf · SB2b Statistical Machine Learning Hilary Term 2017 Seth Flaxman ... Source: Statistical Learning with

Extra

Useful Resources and Pointers

[links below are clickable]Compendium of common loss functions for classification and regressionStatistical Learning with Sparsity: The Lasso and Generalizations byHastie, Tibshirani, and Wainwrightglmnet (R package) tutorialMatrix Cookbook