statistical learninghorebeek/epe/rosset1.pdfoutline • part 1: introduction to statistical learning...

Statistical Learning

Saharon RossetSpecial thanks: Trevor Hastie

Outline• Part 1: Introduction to Statistical Learning

Roughly chapters 1-3 of “Elements of Statistical Learning” by Hastie, Tibshirani and Friedman (2001)– Motivation and problem examples– Introduction of fundamental concepts:

• Supervised learning: regression and classification • Local models (k-NN, kernel smoothing)• Linear models• Bias-variance tradeoff(s)• Examples

– Illustration through discussion of some simple regression methods: linear regression and k-NN

Outline

• Part 2: Regularization and Boosting– Regularized optimization: introduction and examples– Boosting: introduction and examples– Boosting as approximate L1 regularization

• Part 3: L1 Regularization: statistical and computational properties– Piecewise linear regularized solution paths– L1 regularization in infinite dimensional feature spaces

ESL Chap1 - Introduction

Statistical Learning Problems• Identify the risk factors for prostate cancer (lcavol), based on clinical and demographic variables.

• Classify a recorded phoneme, based on a log-periodogram.

• Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements

• Customize an email spam detection system.

• Identify the numbers in a handwritten zip code, from a digitized image

• Classify a tissue sample into one of several cancer classes, based on a gene expression profile.

• Classify the pixels in a LANDSAT image, according to usage:{red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil}

The Supervised Learning Problem

• Outcome measurement Y (also called dependent variable, response,

target)

• Vector of p predictor measurements X (also called independent variables, inputs, regressors, covariates, features)

• In regression problems, Y is quantitative (price, blood pressure)

• In classification problems, Y takes values in a finite, unordered set

(survived/died, digit 0-9, cancer class of tissue sample)

We often use G for classification labels (e.g. G ∈ {survived, died})

• We have training data (x1, y1)L(xN , yN). These are

observations (examples, instances) of these measurements.

Objectives

On the basis of the training data we would like to:

• Accurately predict unseen test cases

• Understand which inputs affect the outcome, and how

• Assess the quality of our predictions and inferences

Philosophy

• It is important to understand the ideas behind the various techniques, in order to know how and when to use them.

• One has to understand the simpler methods first, in order to grasp the more sophisticated ones.

• It is important to accurately assess the performance of a method, to know how well or how badly it is working [simpler methods often perform as well as fancier ones!]

• This is an exciting research area, having important applications in science, industry and finance.

200 points generated in R2 from an unknown distribution; 100 in each of two classes G = {GREEN; RED}. Can we build a rule to predict the color of future points?

Linear Regression

The decision boundary is the points such that the prediction is 0.5 exactly.

It is linear (obviously) and seems to be making a lot of errors in prediction in this case

Possible Scenarios

K-Nearest Neighbors

15-nearest neighbor classification. Fewer training data are misclassified, and the decision boundary adapts to the local densities of the classes.

1-nearest neighbor classification. None of the training data are misclassified.

Discussion• Linear regression uses 3 parameters to describe its fit.

K-nearest neighbors uses 1, the value of k?

• More realistically, k-nearest neighbors uses N/k effective number of parameters

Many modern procedures are variants of linear regression and K-nearest neighbors:

• Kernel smoothers (or viewed as RKHS regression)• Local linear regression• Linear basis expansions• Projection pursuit and neural networks• Support vector machines and logistic regression

See page 17 for more details, or the book website for the actual data.

http://www-stat.stanford.edu/ElemStatLearn

The Bayes Error is the best performance possible: Using the decision boundary in the image attains this best possible performance

How should we choose the right modeling approach?

• We want to minimize EPE• What kind of considerations do we need to keep in

mind?– Data in high dimension is sparse: Curse of Dimensionality

⇒ Makes estimation hard, affects some methods more– If the models we keep are too complex, they will be overfitted

⇒ Have high variance, be unstable– If the models are too simple, they will be too poor to represent

f(x)⇒ Have high bias, predict poorly

In the next few slides we will give a little more detail and examples, will revisit these concepts later

The bias-variance decomposition

• In a regression setting, using squared loss• Assume we are building a model which predicts• What makes up our expected risk?

)(ˆ xf

22

2

2

))(ˆ)(ˆ())(ˆ()(

))(ˆ)(ˆ)(ˆ(

))(ˆ())(ˆ(

XfXfEEXfEEYYVar

XfXfEXfEEYEYYE

XfYEXfEPE

−+−+=

=−+−+−=

=−=

Irreducible error of best possible estimator:

)|()(ˆ XYEXf =

Squared bias, measuring our model’s lack of expressiveness

Variance of our model’s prediction

Effect as dimension p increases

statistical learninghorebeek/epe/rosset1.pdfoutline • part 1: introduction to statistical learning...

Documents