lecture 1: course logistics, introduction, bias-variance

59
Lecture 1: Course logistics, introduction, bias-variance tradeoff STATS 202: Data Mining and Analysis Linh Tran [email protected] Department of Statistics Stanford University June 21, 2021 STATS 202: Data Mining and Analysis L. Tran 1/51

Upload: others

Post on 18-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 1: Course logistics, introduction,bias-variance tradeoff

STATS 202: Data Mining and Analysis

Linh [email protected]

Department of StatisticsStanford University

June 21, 2021

STATS 202: Data Mining and Analysis L. Tran 1/51

Course Information

I Class website: stats-202.github.io

I Videos: Zoom lectures will be recorded/uploaded to Canvas.

I Textbook: An Introduction to Statistical Learning

I Supplemental Textbook: The Elements of StatisticalLearning

I Email policy: Please use Piazza for most questions.Homeworks and Exams should be submitted via Gradescope.

I Office hours: Please refer to this Google calendar.

STATS 202: Data Mining and Analysis L. Tran 2/51

Syllabus

I Topics: Intro to statistical learning and methods for analyzinglarge amounts of data

I Prereqs: STATS 60, MATH 51, CS 105

I Grades: 3 components

I 4 homework assignments (50pts each)

I Due by 3:59pm PDT of due date

I Accepted up to 2-days late w/ 20% penalty (2 free late days)

I Take home midterm on Wednesday, July 21 (100pts)

I Final project (200pts)

I Submissions due on Monday, August 23, 2021

I Write-up due on Friday, August 27, 2021

STATS 202: Data Mining and Analysis L. Tran 3/51

Class material

STATS 202: Data Mining and Analysis L. Tran 4/51

Programming

While the course textbook uses R, you are free to choose betweenR and Python. Some thoughts:

I R (style guide)

I Good visualizations

I More detailed result outputs

I Embraced by statistician community

I Python (style guide)

I Good scalability

I More detailed debugging logs

I Embraced by ML community

10% of your assignment grade is based upon theorganization + readability of your code

STATS 202: Data Mining and Analysis L. Tran 5/51

Motivation

Companies are paying lots of money for statistical models.

I Netflix

I Heritage Provider Network

I Department of Homeland Security

I Zillow

I Etc...

STATS 202: Data Mining and Analysis L. Tran 6/51

Netflix

Popularized prediction challenges by organizing an open, blindcontest to improve its recommendation system.

I Prize: $1 million

I Features: User ratings (1 to 5 stars) on previously watchedfilms

I Outcome: User ratings (1 to 5 stars) for unwatched films

STATS 202: Data Mining and Analysis L. Tran 7/51

Heritage Provider Network

Ran for two years, with six milestone prizes during that span.

I Prize: $3 million ($500K)

I Features: Anonymized patient data over a 48 month period

I Outcome: How many days a patient will spend in a hospitalin the next year

STATS 202: Data Mining and Analysis L. Tran 8/51

Department of Homeland Security

Improving the algorithms used by TSA to detect potential threatsfrom body scans.

I Prize: $1 million

I Features: Body scan images

I Outcome: Whether a given body zone has a threat present

STATS 202: Data Mining and Analysis L. Tran 9/51

Zillow

Improve the algorithm that “changed the world of real estate.”

I Prize: $1.2 million

I Features: Features related to a given house (e.g. sqft, #beds, # baths, etc.)

I Outcome: The sales price for a given house

STATS 202: Data Mining and Analysis L. Tran 10/51

Empirical vs true distributions

Common scenario: I have a data set. What do I do?Common approaches:

I Fit a linear model and look at p-values

I Fit a non-parametric model and get predictions

I Calculate summary statistics and form a story around theanswers

STATS 202: Data Mining and Analysis L. Tran 11/51

Empirical vs true distributions

Common scenario: I have a data set. What do I do?Common approaches:

I Fit a linear model and look at p-values

I Fit a non-parametric model and get predictions

I Calculate summary statistics and form a story around theanswers

Can result in significantly different answers!

STATS 202: Data Mining and Analysis L. Tran 12/51

Empirical vs true distributions

Ideally, we want Ψ(P0).

STATS 202: Data Mining and Analysis L. Tran 13/51

Empirical vs true distributions

Example: P0 is Gaussian, while Pn is Laplace

STATS 202: Data Mining and Analysis L. Tran 14/51

Applied example

The financial crisis of 2007-2008.I Let O = (X1,X2, ...,Xp,Y ) be our data

I e.g. X’s are measures of financial & real estate market. Y isCDO value.

I Generally want to estimate E0[Y |X1,X2, ...,Xp]

I Two big issues:

1. Pn wasn’t representative of P0

2. Most statistical models assume Oiiid∼ P0

Result: Ψ(P0) was no where close to Ψ(Pn)

STATS 202: Data Mining and Analysis L. Tran 15/51

Supervised vs unsupervised learning

Supervised: We have a clearly defined outcome of interest.

I Pro: More clearly defined

I Con: May take more resources to gather

Unsupervised: We don’t have a clearly defined outcome ofinterest.

I Pro: Typically readily available

I Con: May be a more abstract problem

STATS 202: Data Mining and Analysis L. Tran 16/51

Unsupervised learning

Typically start with a data matrix, e.g.

*n.b. The data may also be unstructured (e.g. text, pixels,etc).

STATS 202: Data Mining and Analysis L. Tran 17/51

Unsupervised learning

Two primary categories:

1. Quantitative:

I Numerical

I Ordinal

2. Qualitative:

I Categorical

I Free form

STATS 202: Data Mining and Analysis L. Tran 18/51

Unsupervised learning

Goal: Learn the overall structure of our data. e.g.

I Clustering: learn meaningful groupings of the data

I e.g. k-means, Expectation Maximization, etc.

I Correlation: learn meaningful relationships between variablesor units

I e.g. concordance, Pearson’s, etc.

I Dimension reduction: learn compression of data fordownstream tasks

I e.g. PCA, LDA, auto-encoding, etc.

We learn these using our data.

STATS 202: Data Mining and Analysis L. Tran 19/51

Supervised learning

Typically start with a data matrix with an outcome, e.g.

Outcome can be quantitative or qualitative.

STATS 202: Data Mining and Analysis L. Tran 20/51

Supervised learning

Goal: We learn a mapping from input variables to outputvariables.

I If quantitative, then we refer to this as Regression

I e.g. E0[Y |X1,X2, ...,Xp]

I If qualitative, then we refer to this as Classification

I e.g. P0[Y = y |X1,X2, ...,Xp]

In both cases, we’re interested in learning some function,

f0(X1,X2, ...,Xp) (1)

We estimate f0 using our data.

STATS 202: Data Mining and Analysis L. Tran 21/51

Supervised learning

Motivation: Why learn f0?

Prediction

I Useful when we can readily get X1,X2, ...,Xp, but not Y .

I Allows us to predict what Y likely is.

I Example: Predict stock prices next month using data fromlast year.

Inference

I Allows us to understand how differences in X1,X2, ...,Xp

might affect Y .

I Example: What is the influence of genetic variations on theincidence of heart disease.

STATS 202: Data Mining and Analysis L. Tran 22/51

Learning f0

How do we estimate f0? Two classes of methods:

I Parametric models: We assume that f0 takes a specificform. For example, a linear form:

f0(X1,X2, ...,Xp) = X1β1 + X2β2 + ...+ Xpβp (2)

I Non-parametric models: We don’t make any assumptionson the form of f0, but we restrict how “wiggly” or “rough”the function can be. For example, using loess.

STATS 202: Data Mining and Analysis L. Tran 23/51

Parametric vs non-parametric models

Visualization

Non-parametric models tend to be larger than parametric models.

Recall: A statistical model is simply a set of probabilitydistributions that you allow your data to follow.

STATS 202: Data Mining and Analysis L. Tran 24/51

Parametric vs non-parametric fit

Non-parametric models tend to be more flexible.

STATS 202: Data Mining and Analysis L. Tran 25/51

Parametric vs non-parametric models

Question: Why don’t we just always use non-parametric models?

1. Interpretability: parametric models are simpler to interpret

2. Convenience: less computation, more reproducibility, betterbehavior

3. Overfitting: non-parametric models tend to overfit (aka. highvariance)

STATS 202: Data Mining and Analysis L. Tran 26/51

Parametric vs non-parametric models

Question: Why don’t we just always use non-parametric models?

1. Interpretability: parametric models are simpler to interpret

2. Convenience: less computation, more reproducibility, betterbehavior

3. Overfitting: non-parametric models tend to overfit (aka. highvariance)

STATS 202: Data Mining and Analysis L. Tran 26/51

Parametric vs non-parametric models

Question: Why don’t we just always use non-parametric models?

1. Interpretability: parametric models are simpler to interpret

2. Convenience: less computation, more reproducibility, betterbehavior

3. Overfitting: non-parametric models tend to overfit (aka. highvariance)

STATS 202: Data Mining and Analysis L. Tran 26/51

Parametric vs non-parametric models

Question: Why don’t we just always use non-parametric models?

1. Interpretability: parametric models are simpler to interpret

2. Convenience: less computation, more reproducibility, betterbehavior

3. Overfitting: non-parametric models tend to overfit (aka. highvariance)

STATS 202: Data Mining and Analysis L. Tran 26/51

Prediction error

Training data: (xi , yi ) : i = 1, 2, ..., nGoal: Estimate f0 with our data, resulting in fn

Typically: we get fn by minimizing a prediction error

I Assumes (xi , yi )iid∼ P0

Standard prediction error functions:

I Classification: Cross-entropy

CE (fn) = E0[−yi · log fn(xi )] (3)

I Regression: Mean squared error

MSE (fn) = E0[yi − fn(xi )]2 (4)

STATS 202: Data Mining and Analysis L. Tran 27/51

Prediction error

Training data: (xi , yi ) : i = 1, 2, ..., nGoal: Estimate f0 with our data, resulting in fn

Typically: we get fn by minimizing a prediction error

I Assumes (xi , yi )iid∼ P0

Standard prediction error functions:

I Classification: Cross-entropy

CE (fn) = E0[−yi · log fn(xi )] (3)

I Regression: Mean squared error

MSE (fn) = E0[yi − fn(xi )]2 (4)

STATS 202: Data Mining and Analysis L. Tran 27/51

Prediction error

Training data: (xi , yi ) : i = 1, 2, ..., nGoal: Estimate f0 with our data, resulting in fn

Typically: we get fn by minimizing a prediction error

I Assumes (xi , yi )iid∼ P0

Standard prediction error functions:

I Classification: Cross-entropy

CE (fn) = E0[−yi · log fn(xi )] (3)

I Regression: Mean squared error

MSE (fn) = E0[yi − fn(xi )]2 (4)

STATS 202: Data Mining and Analysis L. Tran 27/51

Prediction error

Cross entropy:

CE (fn) = E0[−yi · log fn(xi )] (5)

n.b. We can’t directly calculate this, since P0 is unknown.

But:We do have Pn, i.e. our training data (xi , yi ) : i = 1, 2, ..., n.

Estimating cross entropy

CE (fn) = En[−yi · log fn(xi )] (6)

=1

n

n∑i=1

−yi · log fn(xi ) (7)

STATS 202: Data Mining and Analysis L. Tran 28/51

Prediction error

Cross entropy:

CE (fn) = E0[−yi · log fn(xi )] (5)

n.b. We can’t directly calculate this, since P0 is unknown.

But:We do have Pn, i.e. our training data (xi , yi ) : i = 1, 2, ..., n.

Estimating cross entropy

CE (fn) = En[−yi · log fn(xi )] (6)

=1

n

n∑i=1

−yi · log fn(xi ) (7)

STATS 202: Data Mining and Analysis L. Tran 28/51

Prediction error

Cross entropy:

CE (fn) = E0[−yi · log fn(xi )] (5)

n.b. We can’t directly calculate this, since P0 is unknown.

But:We do have Pn, i.e. our training data (xi , yi ) : i = 1, 2, ..., n.

Estimating cross entropy

CE (fn) = En[−yi · log fn(xi )] (6)

=1

n

n∑i=1

−yi · log fn(xi ) (7)

STATS 202: Data Mining and Analysis L. Tran 28/51

Prediction error

Similarly:

We estimate the mean squared error using our data.

Estimating mean squared error

MSE (fn) = En[yi − fn(xi )]2 (8)

=1

n

n∑i=1

(yi − fn(xi ))2 (9)

STATS 202: Data Mining and Analysis L. Tran 29/51

Prediction error

There are two common problems with prediction errors:

1. A high prediction error could mean underfitting.

I e.g. You could have the wrong functional form

2. A low prediction error could mean overfitting.

I e.g. You made your model too flexible

STATS 202: Data Mining and Analysis L. Tran 30/51

Prediction error

There are two common problems with prediction errors:

1. A high prediction error could mean underfitting.

I e.g. You could have the wrong functional form

2. A low prediction error could mean overfitting.

I e.g. You made your model too flexible

STATS 202: Data Mining and Analysis L. Tran 30/51

Underfitting

1. A high prediction error could mean underfitting.

5.0

7.5

10.0

12.5

0 25 50 75 100x

y

True function f0.

●●

5.0

7.5

10.0

12.5

0 25 50 75 100x

y

Observed data and estimatedfunction fn.

STATS 202: Data Mining and Analysis L. Tran 31/51

Overfitting

2. A low prediction error could mean overfitting.

5.0

7.5

10.0

12.5

0 25 50 75 100x

y

True function, f0.

●●

5.0

7.5

10.0

12.5

0 25 50 75 100x

y

Observed data and estimatedfunction fn.

STATS 202: Data Mining and Analysis L. Tran 32/51

Prediction error

How to tell if we’ve under/overfit:

I Evaluate on data not used in training (i.e. from your test set).

Given our test data (xi′, yi

′) : i = 1, 2, ...,m, we can calculate amore accurate prediction error, e.g.:

MSE (fn) = Etestn [yi

′− fn(xi′)]2 (10)

=1

m

m∑i=1

(yi′− fn(xi

′))2 (11)

STATS 202: Data Mining and Analysis L. Tran 33/51

Prediction error

How to tell if we’ve under/overfit:

I So, now we have two prediction error estimates, e.g.:

1. MSEtrain

(fn) from our training data

2. MSEtest

(fn) from our test data

If MSEtrain

(fn) << MSEtest

(fn), then we’ve likely overfit on ourtraining data.

STATS 202: Data Mining and Analysis L. Tran 34/51

How to tell if we’ve under/overfit:

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

5.0

7.5

10.0

12.5

0 25 50 75 100x

y

factor(flexibility)

1

4

4.33

5

7

13

Estimates fn of f0.

●●

●●

0.5

1.0

1.5

1 2 3 4 5 6 7 8 9 10 11 12 13flexibility

MS

E Data

Test MSE

Train MSE

MSE for each fn.

STATS 202: Data Mining and Analysis L. Tran 35/51

How to tell if we’ve under/overfit (with an almost linear f0):

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

10

20

30

0 25 50 75 100x

y

factor(flexibility)

1

4

4.33

5

7

13

Estimates fn of f0.

●●●

●●●

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13flexibility

MS

E Data

Test MSE

Train MSE

MSE for each fn.

Low flexibility models work well.

STATS 202: Data Mining and Analysis L. Tran 36/51

How to tell if we’ve under/overfit (with low noise):

●●

●●

●●

● ●

●●●●●●

●●

●●

●●●

● ●

●●

●●

● ●

●●

● ●●

●●

●●●

●●

●●

● ●●

●●

●●●

●●●

●●

5.0

7.5

10.0

12.5

0 25 50 75 100x

y

factor(flexibility)

1

4

4.33

5

7

13

Estimates fn of f0.

●●●

●●●

0.00

0.25

0.50

0.75

1.00

1.25

1 2 3 4 5 6 7 8 9 10 11 12 13flexibility

MS

E Data

Test MSE

Train MSE

MSE for each fn.

High flexibility models work well.

STATS 202: Data Mining and Analysis L. Tran 37/51

Bias variance decomposition

Let x0 be a fixed point, y0 = f0(x0) + ε, and fn be an estimate of f0from (xi , yi ) : i = 1, 2, ..., n.

The MSE at x0 can be decomposed as

MSE (x0) = E0[y0 − fn(x0)]2 (12)

= Var(fn(x0)) + Bias(fn(x0))2 + Var(ε0)(13)

STATS 202: Data Mining and Analysis L. Tran 38/51

Bias variance decomposition

MSE (x0) = Var(fn(x0)) + Bias(fn(x0))2 + Var(ε0)

Var(ε0)

Noise from the data distribution, i.e. irreducible error.

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

0

10

20

30

40

0 25 50 75 100x

y

True function, f0 and observed data.

STATS 202: Data Mining and Analysis L. Tran 39/51

Bias variance decomposition

MSE (x0) = Var(fn(x0)) + Bias(fn(x0))2 + Var(ε0)

Var(fn(x0))

The variance of fn(x0) (i.e. the estimate of y).How much the estimate fn at x0 changes with new data.

●●

●●

●●

●●

●●

●●

0

10

20

30

40

0 25 50 75 100x

y

Observed data and estimate fn.

STATS 202: Data Mining and Analysis L. Tran 40/51

Bias variance decomposition

MSE (x0) = Var(fn(x0)) + Bias(fn(x0))2 + Var(ε0)

●●

●●

●●

●●

●●

●●

0

10

20

30

40

0 25 50 75 100x

y

●●

●●

●●

●●

0

10

20

30

40

0 25 50 75 100x

y

●●

●●●●

●●

●●

●●

0

10

20

30

40

0 25 50 75 100x

y

● ●

●●

● ●

●●

●●

●●●

●●

0

10

20

30

40

0 25 50 75 100x

y

● ●

●●

●●

●●

0

10

20

30

40

0 25 50 75 100x

y

●●

●●

●●

●●

●●

0

10

20

30

40

0 25 50 75 100x

y

●●

●●

●●

●●

●●●

● ●

●●

0

10

20

30

40

0 25 50 75 100x

y

●●

●●

●●

●●

0

10

20

30

40

0 25 50 75 100x

y

Estimates for 8 different data samples (n = 50).STATS 202: Data Mining and Analysis L. Tran 41/51

Bias variance decomposition

MSE (x0) = Var(fn(x0)) + Bias(fn(x0))2 + Var(ε0)

Var(fn(x0))

The variance of fn(x0) (i.e. the estimate of y).How much the estimate fn at x0 changes with new data.

0

10

20

30

40

0 25 50 75 100x

y

factor(Sample)

1

2

3

4

5

6

7

8

STATS 202: Data Mining and Analysis L. Tran 42/51

Bias variance decomposition

MSE (x0) = Var(fn(x0)) + Bias(fn(x0))2 + Var(ε0)

Bias(fn(x0))2

The square of the expected difference, E2[fn(x0)− f0(x0)].How far the average prediction fn is from f0 at x0.

●●

5.0

7.5

10.0

12.5

0 25 50 75 100x

y

Observed data, true f0, and estimate fn.STATS 202: Data Mining and Analysis L. Tran 43/51

Bias variance decomposition

MSE (x0) = Var(fn(x0)) + Bias(fn(x0))2 + Var(ε0)

Implications:

I The MSE is always non-negative.

I Each element on the right side is always non-negative.

I Consequently, lowering one element (beyond some point)typically increases another.

Bias variance trade-off

More flexibility ⇐⇒ Higher variance ⇐⇒ Lower bias

STATS 202: Data Mining and Analysis L. Tran 44/51

Bias variance decomposition

Bias variance trade-off

More flexibility ⇐⇒ Higher variance ⇐⇒ Lower bias

STATS 202: Data Mining and Analysis L. Tran 45/51

Classification

In classification, the output takes values in a discrete set (c.f.continuous values in regression).

Example

If we’re trying to predict the brand of a car (based on inputfeatures), the function f0 outputs the (conditional) probabil-ities of each car brand (e.g. Ford, Toyota, Mercedes, etc.),e.g.

P0[Y = y |X1,X2, ...,Xp] : y ∈ {Ford ,Toyota,Mercedes, etc .}(14)

STATS 202: Data Mining and Analysis L. Tran 46/51

Comparisons

Regression: f0 = E0[Y |X1,X2, ...,Xp]

I A scalar value, i.e. f0 ∈ R

I fn therefore gives us estimates of y

Classification: f0 = P0[Y = y |X1,X2, ...,Xp]

I A vectored value, i.e.f0 = [p1, p2, ..., pK ] : pj ∈ [0, 1],

∑K pj = 1

I n.b. In a binary setting this simplies to a scalar, i.e.f0 = p1 : p1 = P0[Y = 1|X1,X2, ...,Xp] ∈ [0, 1]

I fn therefore gives us predictions of each class

STATS 202: Data Mining and Analysis L. Tran 47/51

Bayes classifier

I f0 gives us a probability of the observation belonging to eachclass.

I To select a class, we can just pick the element inf0 = [p1, p2, ..., pK ] that’s the largest

I Called the Bayes Classifier

I As a classifier, produces the lowest error rate

Bayes error rate

1− E0

[maxy

P0[Y = y |X1,X2, ...,Xp]

](15)

Analogous to the irreducible error described previously

STATS 202: Data Mining and Analysis L. Tran 48/51

Bayes classifier

Example: Classifying in 2 classes with 2 features.

The Bayes error rate is 0.1304.

STATS 202: Data Mining and Analysis L. Tran 49/51

Bayes classifier

Note: C(x) =arg maxy

f0(y) may seem easier to estimate

I Can still be hard, depending on the distribution f0, e.g.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ●

● ● ●

5.0

7.5

10.0

12.5

0 25 50 75 100X2

X1 ● 0

1

Bayes error = 0.0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

5.0

7.5

10.0

12.5

0 25 50 75 100X2

X1 ● 0

1

Bayes error = 0.3

STATS 202: Data Mining and Analysis L. Tran 50/51

References

[1] ISL. Chapters 1-2.

[2] ESL. Chapters 1-2.

STATS 202: Data Mining and Analysis L. Tran 51/51