sb2b statistical machine learning hilary term 2017flaxman/ht17_lecture3.pdf · sb2b statistical...
TRANSCRIPT
SB2b Statistical Machine LearningHilary Term 2017
Seth FlaxmanNew website:
http://www.stats.ox.ac.uk/~flaxman/course_ml.html
Announcements:First problem sheet available! (due: Friday 27 Jan. at 5pm for Part B)
First problem class for MSc students: Thursday (26 Jan) at 3pm in LG.01
Recap from last time Validation and Cross-Validation
Validation and Cross-Validation
Recap from last time Validation and Cross-Validation
Validation
For each combination of tuning parameters γ:Train our model on the training set, fitparameters θ = θ(γ), obtaining decisionfunction fθ(γ).Evaluate Rval (fθ(γ)) average loss on avalidation set.
Pick γ∗ with best performance on validationset.Using γ∗, train on both training andvalidation set to obtain the optimal θ∗.Rval(fθ(γ∗)) is now a biased estimate ofR(fθ(γ∗)) and can be overly optimistic!Evaluate model with γ∗, θ∗ on test set,reporting generalization performance.
Training set
Test set
θ
generalizationperformance
Validation set Modelcomplexity
Recap from last time Validation and Cross-Validation
Cross-Validation
Training Training Training Validation Test
TrainingTraining Training Validation Test
TrainingTrainingTraining Validation Test
Training TrainingTrainingValidation Test
Logistic regression Logistic Regression
Logistic regression
One of the most popular methods for classificationLinear model on the probabilitiesDates back to work on population growth curves by Verhulst [1838, 1845,1847]Statistical use for classification dates to Cox [1960s]Independently discovered as the perceptron in machine learning[Rosenblatt 1957]Main example of “discriminative” as opposed to “generative” learningNaïve approach to classification: linear regression
Logistic regression Logistic Regression
Logistic regression
Statistical perspective: consider Y = {0, 1}. Generalised linear modelwith Bernoulli likelihood and logit link:
Y|X = x, a, b ∼ Bernoulli(s(a + b>x)
)s(a + b>x) = 1
1+exp(−(a+b>x)) .
−8 −6 −4 −2 0 2 4 6 80
0.5
1
ML perspective: a discriminative classifier. Consider binaryclassification with Y = {−1,+1}. Logistic regression uses a parametricmodel on the conditional Y|X, not the joint distribution of (X,Y):
p(Y = y|X = x; a, b) =1
1 + exp(−y(a + b>x)).
Logistic regression Logistic Regression
Linearity of log-odds and logistic function
a + b>x models the log-odds ratio:
logp(Y = +1|X = x; a, b)p(Y = −1|X = x; a, b)
= a + b>x.
Solve explicitly for conditional class probabilities:
p(Y = +1|X = x; a, b) =1
1 + exp(−(a + b>x))=: s(a + b>x)
p(Y = −1|X = x; a, b) =1
1 + exp(+(a + b>x))= s(−a− b>x)
Logistic regression Logistic Regression
Fitting the parameters of the hyperplaneHow to learn a and b?
Consider maximizing the conditional log likelihood for Y = {−1,+1}:
`(a, b) =n∑
i=1
log p(Y = yi|X = xi) =
n∑i=1
log s(yi(a + b>xi)).
Equivalent to minimizing the empirical risk associated with the log loss:
R̂log(fa,b) =1n
n∑i=1
− log s(yi(a + b>xi)) =1n
n∑i=1
log(1 + exp(−yi(a + b>xi)))
Logistic regression Logistic Regression
Logistic Regression
Not possible to find optimal a, b analytically.For simplicity, absorb a as an entry in b byappending ’1’ into x vector.Objective function:
R̂log =1n
n∑i=1
− log s(yix>i b)
Logistic Function
s(−z) = 1− s(z)
∇zs(z) = s(z)s(−z)
∇z log s(z) = s(−z)
∇2z log s(z) = −s(z)s(−z)
Differentiate wrt b:
∇bR̂log =1n
n∑i=1
−s(−yix>i b)yixi
∇2bR̂log =
1n
n∑i=1
s(yix>i b)s(−yix>i b)xix>i � 0.
Logistic regression Logistic Regression
Logistic Regression
Second derivative is positive-definite: objective function is convex andthere is a single unique global minimum.Many different algorithms can find optimal b, e.g.:
Gradient descent:
bnew = b + ε1n
n∑i=1
s(−yix>i b)yixi
Stochastic gradient descent:
bnew = b + εt1|I(t)|
∑i∈I(t)
s(−yix>i b)yixi
where I(t) is a subset of the data at iteration t, and εt → 0 slowly(∑
t εt =∞,∑
t ε2t <∞).
Newton-Raphson:bnew = b− (∇2
bR̂log)−1∇bR̂log
This is also called iterative reweighted least squares.Conjugate gradient, LBFGS and other methods from numerical analysis.
Logistic regression Logistic Regression
Linearly separable data
Assume that the data is linearly separable, i.e. there is a scalar α and a vectorβ such that yi(α+ β>xi) > 0, i = 1, . . . , n. Let c > 0. The empirical risk fora = cα, b = cβ is
R̂log(fa,b) =1n
n∑i=1
log(1 + exp(−cyi(α+ β>xi)))
which can be made arbitrarily close to zero as c→∞, i.e. soft classificationrule becomes ±∞ (overconfidence)→ overfitting.
Logistic regression Logistic Regression
Multi-class logistic regression
The multi-class/multinomial logistic regression uses the softmax function tomodel the conditional class probabilities p (Y = k|X = x; θ), for K classesk = 1, . . . ,K, i.e.,
p (Y = k|X = x; θ) =exp
(w>k x + bk
)∑K`=1 exp
(w>` x + b`
) .Parameters are θ = (b,W) where W = (wkj) is a K × p matrix of weights andb ∈ RK is a vector of bias terms.
Logistic regression Logistic Regression
Logistic Regression: Summary
Makes less modelling assumptions than generative classifiers: oftenresulting in better prediction accuracy.Diverging optimal parameters for linearly separable data: need toregularise / pull them towards zero.A simple example of a generalised linear model (GLM), for which there isa well established statistical theory:
Assessment of fit via deviance and plots,Well founded approaches to removing insignificant features (drop-indeviance test, Wald test).
Regularization Overfitting and Regularization
Overfitting and Regularization
Regularization Overfitting and Regularization
Overfitting in Logistic Regressiondx <- c(-5,-4,-3,-2,-1,1,2,3,5)d <- data.frame(dx)x <- seq(-6,6,.1)y <- c(0,0,0,1,0,1,1,1,1)lr <- glm(y ~ ., data=d,family=binomial)p <- predict(lr,newdata=data.frame(dx=x),type="response")plot(x,p,type="l")points(dx,y)
−6 −4 −2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
x
p
● ● ●
●
●
● ● ● ●
Regularization Overfitting and Regularization
Overfitting in Logistic Regressiondx <- c(-5,-4,-3,-2,-1,1,2,3,5)d <- data.frame(dx)x <- seq(-6,6,.1)y <- c(0,0,0,0,0,1,1,1,1)lr <- glm(y ~ ., data=d,family=binomial)p <- predict(lr,newdata=data.frame(dx=x),type="response")plot(x,p,type="l")points(dx,y)
−6 −4 −2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
x
p
● ● ● ● ●
● ● ● ●
Regularization Overfitting and Regularization
Overfitting in Linear Regression
http://www.stats.ox.ac.uk/~flaxman/sml17/overfitting.html
Regularization Overfitting and Regularization
Regularization
Flexible models for high-dimensional problems require many parameters.With many parameters, learners can easily overfit to the noise in thetraining data.Regularization: Limit flexibility of model to prevent overfitting.Deep connections with Bayesian perspective: prior distributions provideregularizationTypically: add term penalizing large values of parameters θ.
Remp(θ) + λ‖θ‖ρρ =1n
n∑i=1
log(1 + exp(−yi(a + b>xi))) + λ‖b‖ρρ
where ρ ∈ [1, 2], and ‖z‖ρ = (∑p
j=1 |zj|ρ)1/ρ is the Lρ norm of b (also ofinterest when ρ ∈ [0, 1), but is no longer a norm).Also known as shrinkage methods—parameters are shrunk towards 0.Typical cases are ρ = 2 (Euclidean norm, ridge regression) and ρ = 1(LASSO). When ρ ≤ 1 it is called a sparsity inducing regularization.λ is a tuning parameter (or hyperparameter) and controls the amountof regularization, and resulting complexity of the model.
Bach, Sparse Methods in Machine learning.
Regularization Overfitting and Regularization
Regularization
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
.01
.10
.50
1.0
1.5
2.0
Lρ regularization profile for different values of ρ.
Regularization Overfitting and Regularization
Regularization
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
L1 and L2 norm contours.
Regularization Overfitting and Regularization
L2 regularization
Dates to Tikhonov [1943] (more general framework)Rediscovered as ridge regression [Hoerl and Kennard 1970]Derivation of ridge regression estimator [on board]
Regularization Overfitting and Regularization
Sparsity Inducing Regularization
Consider constrained optimization problem
minθ
Remp(θ) s.t. ‖θ‖1 < γ
Lagrange multiplier λ > 0 to enforce constraint,
minθ
Remp(θ) + λ(‖θ‖1 − γ)
At the optimal value of λ, the parameter θ is theone minimizing the regularized empirical riskobjective.Conversely, given λ, there is a value of γ such thatthe corresponding optimal Lagrange multiplier is λ.
Generally: L1 regularization leads to optimalsolutions with many zeros, i.e. the regressionfunction depends only on the (small) number offeatures with non-zero parameters.
constraint |θ|1 ! γ
Remp(θ)
Regularization Overfitting and Regularization
Illustration
Source: Statistical Learning with Sparsity: the Lasso and Generalizations byHastie, Tibshirani, and Wainwright
Regularization Overfitting and Regularization
Bias/variance tradeoff
Why does regularization prevent overfitting?Regularization introduces bias with the goal of reducing variance:
Model complexity/flexibility
Pred
ictio
n er
ror
Underfit:high bias
low varianceOverfit:low bias
high variance
Just right
Training error
Testerror
http://www.stats.ox.ac.uk/~flaxman/sml17/regularization.html
Extra
Useful Resources and Pointers
[links below are clickable]Compendium of common loss functions for classification and regressionStatistical Learning with Sparsity: The Lasso and Generalizations byHastie, Tibshirani, and Wainwrightglmnet (R package) tutorialMatrix Cookbook