lecture 1(i): introduction & roadmapcucuring/lecture_1_intro_statml.pdf · 1.introduction &...

57
1 Lecture 1(i): Introduction & Roadmap Foundations of Data Science: Algorithms and Mathematical Foundations Mihai Cucuringu [email protected] CDT in Mathematics of Random System University of Oxford September 25, 2019

Upload: others

Post on 10-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

1

Lecture 1(i): Introduction & Roadmap

Foundations of Data Science:Algorithms and Mathematical Foundations

Mihai [email protected]

CDT in Mathematics of Random SystemUniversity of Oxford

September 25, 2019

Page 2: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

2

Course overview

I combines both theoretical and practical approachesI Goal 1: understand the mathematical, statistical and

algorithmic foundations behind some of the state-of-the-artalgorithms for tasks in machine learning/data miningI organization and visualization of data cloudsI dimensionality reduction •I clustering •I network analysis •I classificationI regressionI ranking •

I Goal 2: exposure to practical examples drawn from a widerange of topics including social network analysis, finance,statistics, etc

Page 3: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

3

BooksTextbook• An Introduction to Statistical Learning by James, Witten,Hastie, and Tibshirani, freely available at

http://faculty.marshall.usc.edu/gareth-james/ISL/

Slides will also be made available.

• More advanced + good reference book: The Elements ofStatistical Learning by Hastie, Tibshirani, and Friedman.http://www-stat.stanford.edu/ElemStatLearn.

• Popular reference in the ML/data mining: Pattern Recognitionand Machine Learning Book, by Christopher Bishop.

• Ten Lectures and Forty-Two Open Problems in theMathematics of Data Science, by Afonso S. Bandeira

https://people.math.ethz.ch/˜abandeira/TenLecturesFortyTwoProblems.pdf

Page 4: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

4

PrerequisitesProbability:I event, random variable, indicator variableI probability mass function, probability density function

cumulative distribution functionI joint distribution, marginal distributionI conditional probability, Bayes’s ruleI independenceI expectation, varianceI uniform, exponential, binomial, Poisson, Gaussian

distributionsStatistics:I Sampling from a population; mean, variance, standard

deviation, median, covariance, correlation, and theirsample versions; histogram;

I linear regression, response and predictor variables,coefficients, residuals.

Page 5: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

5

Prerequisites

Linear algebra:I vector and matrix arithmeticI eigenvalues and eigenvectors of a matrix

Programming:I arithmetic (scalar, vector, and matrix operations)I writing functionsI reading in data sets from csv filesI using and manipulating data structures (subset a data

frame)I installing, loading, and using packagesI plotting

Wilson et al., Good enough practices in scientific computing, PLoScomputational biology, 2017 Jun 22;13(6)

Page 6: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

6

A list of tentative topics1. Introduction & syllabus2. Statistical learning3. Measures of correlation and dependency4. Linear regression 1 (Simple regression)5. Linear regression 2 (Multiple regression)6. Singular Value Decomposition (SVD), rank-k

approximation, Principal Component Analysis (PCA)7. PCA in high dimensions and random matrix theory

(Marcenko-Pastur); applications to finance8. Basics of spectral graph theory

Nonlinear dimensionality reduction methods:9. Diffusion Maps and Vector Diffusion Maps

10. Multidimensional scaling and ISOMAP, Locally LinearEmbedding (LLE)

11. Kernel PCA

Page 7: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

7

Topics

Clustering:12. Clustering: K-means, K-medoids, Hierarchical clustering13. Spectral clustering and Cheeger’s inequality14. Constrained clustering, clustering of signed graphs and

correlation clustering

Modern regression:15. Ridge regression16. LASSO

Estimation from pairwise measurements:17. The Page-Rank algorithm18. Ranking from pairwise incomplete noisy information19. Group synchronization

Page 8: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

8

Additional Topics (if we had more time)

1. Classification: logistic regression, support vectormachines, linear discriminant analysis.

2. Tree-based methods for classification and regression

3. Bagging, Boosting, Random Forests

4. The Multiplicative Weights Update Method: a MetaAlgorithm and Applications• Arora, Sanjeev, Elad Hazan, Satyen Kale; Theory of Computing 2012

5. Johnson-Lindenstrauss Lemma; approximate nearestneighbors

Page 9: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

9

Additional Topics (if we had more time)

Other potential topics7. electrical networks and random walks

8. graph sparsification and Laplacian linear system solvers

9. graph embeddings from noisy distances

10. low-rank matrix completion

11. non-negative matrix factorization

12. matrix algorithms using sampling; sketch of a large matrix

Page 10: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

10

What is data science/data mining?Given (usually very large) data sets, how do you discoverstructural properties and make predictions?

Two very broad categories of problems:I Unsupervised learning: discover structure. E.g., given

measurements X1, . . . ,Xn, learn some underlying groupstructure based on the pattern of similarity between pairsof points (”working blind”)

I Supervised learning: make predictions. E.g., givenmeasurements (X1,Y1), ...(Xn,Yn), learn a model topredict Yi from Xi

I Semi-supervised learning: only for m (with m << n)observations we have both predictor + responsemeasurements.

Page 11: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

11

This course

I Combines both applied and theoretical perspectives,though for some of the topics the emphasis will be on thealgorithms/methodology

I Aim to understand what is it that we are trying to doI Often, it’s not enough to load your data in

R/Matlab/Python, use any available packages and expectto get an answer that makes sense (reasonable)

I Understand your data! Data is often very messy, noisyand incomplete

I Be able to identify what the end goal is, and based on that(and the given data) identify what tools are available

Page 12: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

12

Tradeoffs: exact versus approximation

I Many problems are computationally hard to solve exactlyI In order to come up with tractable algorithms (that run in

polynomial time) aim for an approximate solutionI Approximation algorithms can often perform well

(sometimes they even find the exact solution!) and scalewell computationally when applied to very large problems

I Polynomial time algorithms are often not enough, someapplications demand close to linear-time complexity(sometimes even sublinear!)

Page 13: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

13

Bias-variance tradeoffIn supervised learning, when moving beyond the training set:I If the model is too simple, the solution is biased and does

not fit the dataI If the model is too complex, the solution is very sensitive to

small changes in the dataProblem of simultaneously minimizing two sources of error

I bias: difference btwn truth and what you expect to learnI high bias leads to missing out on the relevant relations

between features and target outputs (underfitting).I decreases with more complex models

I variance: difference between what you learn from aparticular dataset and what you expect to learn. Arisesfrom sensitivity to small fluctuations in the training set.I high variance leads to overfitting: modeling the random

noise in the training data, rather than the intended outputs.I variance decreases with simpler models

Page 14: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

14

Interpretability versus forecasting power

I trade-off between a model that is interpretable and onethat predicts well under general circumstances

I essay on the distinction between explanatory andpredictive modeling:

To Explain or to Predict?, by Galit Shmueli, StatisticalScience 2010, Vol. 25, No. 3, 289–310https://www.stat.berkeley.edu/˜aldous/157/

Papers/shmueli.pdf

Page 15: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

15

Occam’s razor

I a problem-solving principle known as the ’law of parsimony’I when faced with different competing hypotheses that

predict equally well, choose the one with the fewestassumptions

I usually, more complex models may provide betterpredictions, but in the absence of differences in predictivepower, the fewer assumptions the better

Page 16: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

16

Statistics vs. Machine Learning

I Brian D. Ripley: ”machine learning is statistics minus anychecking of models and assumptions”

”Statistical Modeling: The Two Cultures”, Leo Breiman,Statistical Science, 16 (3), 2001; argued thatI statisticians rely too heavily on data modeling and

assumptionsI machine learning techniques are making progress by

instead relying on the predictive accuracy of models

Page 17: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

17

Statistics vs. Machine Learning

More recently, statisticians focused more on finite-sampleproperties, algorithms and massive data sets (big data).Some, still ongoing, differences between the two communities:I Statistics papers are more formal and often comes with

proofs, while Machine Learning papers are more open tonew methodologies even if the theory is lacking (for now)

I The Machine Learning community primarily publishes inconferences and proceedings, while statisticians usejournal papers (much slower process)

I Some statisticians still focus on areas which are welloutside the scope of ML (survey design, sampling,industrial statistics, survival analysis, etc)

Page 18: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

18

Final thoughts: No universal data mining recipe bookI hard to say which methods will work best in what situationsI sometimes, it’s crystal clear what method one should followI most often, we have little intuition a-priori on what

approach or set of tools should we use. Need toI understand the data first and the task at handI understand the methods and their assumptionsI make an educated guess on how to proceed

I often, customized tools are required to handle problemparticularities (eg., response variable is highly imbalanced,as in default rate prediction)

I sometimes (if enough resources are available) one oftentries many different methods, and chooses the one whichgives best (out-of-sample) resultsI most competitions are won by Ensemble methods -

techniques that create multiple models and then combinethem to produce improved results

I stacking: considers heterogeneous weak learners, learnsthem in parallel and combines by training a meta-model

Page 19: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

19

Alzheimer’s Markers Predict Start of Mental Decline

Page 20: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

20

Classification of handwritten digits

Figure: Automatic detection of handwritten postal codes.

Page 21: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

21

Facebook: friend suggestions

Figure: People you may know.

Page 22: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

22

Forecasting the stock market

Figure: Price of Google stock.

Source: http://businessforecastblog.com/forecasting-googles-stock-price-goog-on-20-trading-day-horizons/

Page 23: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

23

Netflix: the $ 1,000,000 Prize

Figure: Movies you might enjoy.

Page 24: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

24

eHarmony: predict whom you might fall in love with

Figure: Love is in the air (statistically speaking)

Page 25: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

25

Identifying patterns in migration networks

Figure: Eigenvector colourings for the similarity matrix Wij =M2

ijPi Pj

.

Page 26: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

26

Financial time series clustering• Clustering of the empirical correlation matrix of 1500 timeseries (stocks contained in the S&P 1500 index)• Compute the bottom k = 10 eigenvectors of L, and run astandard machine learning clustering algorithm (k-means++) torecover k clusters.

Figure: Left: the adjacency matrix A with rows/columns sorted inaccordance to cluster membership. Right: Sector decomposition ofthe recovered clusters (based on a standard classification of the USeconomy into sectors). See link for details: GICS link.M. Cucuringu, P. Davies, A. Glielmo, H. Tyagi, SPONGE: A generalized eigenproblemfor clustering signed networks, AISTATS 2019 (python code available)

Page 27: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

27

Yogi Berra:

It’s tough to make predictions, especially about the future

Page 28: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

28

Lecture 1(ii): Statistical Learning

Chapter 2 in ”An Introduction to Statistical Learning” by James, Witten, Hastie,and Tibshirani, freely available at

http://faculty.marshall.usc.edu/gareth-james/ISL/

Page 29: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

29

Advertising data set

I sales of a product in 200 different marketsI + budgets for the product in each of those markets for

three different media: TV, radio, and newspaperI Goal: predict sales given the three media budgetsI input variables (denoted by X1,X2, . . .)

I X1 TV budgetI X2 radio budgetI X3 newspaper budget

I inputs known as such as predictors, independent variables,features, variables

I the output variable (sales) is the response or dependentvariable (denoted by Y )

Page 30: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

30

Advertising data set

Page 31: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

31

I observe a quantitative response Y and p differentpredictors, X = (X1,X2, . . . ,Xp)

I assume there is some relationship between Y andX = (X1,X2, . . . ,Xp)

Y = f (X ) + ε

I f is fixed but unknown functionI ε is a random error term, independent of X , with mean zero

Page 32: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

32

Why Estimate f?

Task I: PredictionI Y = f (X )

I f is our estimate for fI Y is the resulting predictionI it’s OK if f is a black box algorithmI we only care about forecasting Y

Page 33: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

33

Reducible error vs. irreducible errorReducible error:I f is an imperfect estimate for f ; leads to reducible errorI we can potentially improve the accuracy of f by using more

complex models to estimate fIrreducible error:I even if we perfectly estimate f and

Y = f (X )

our prediction is not perfectI recall Y is also a function of ε, and ε ⊥ XI variability associated with ε also affects the accuracy of our

predictionsI most often is positive (ε may contain unmeasured variables

that impact Y ; unmeasurable variation)

Page 34: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

34

Prediction

Fix a given estimate f and a set of predictors XOur prediction Y = f (X )Consider the average/expected value of the squared error:

E[Y − Y ]2 = E[f (X ) + ε− f (X )]2 = E[(f (X )− f (X )) + ε]2 (1)

= E[(f (X )− f (X ))2 + 2(f (X )− f (X ))ε+ ε2

](2)

= E[(f (X )− f (X ))2] + 0 + E[ε2] (3)

= E[f (X )− f (X )]2 + Var[ε] (4)

= [f (X )− f (X )]2︸ ︷︷ ︸Reducible

+ Var[ε]︸ ︷︷ ︸Irreducible

(5)

Our goal: develop techniques for estimating f and minimize thereducible error.

Page 35: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

35

Task II: Inference

Want to understand how Y changes as a function of X1, . . . ,Xp

Cannot treat f as a black box; need to know its exact form.I Which predictors are associated with the response?

I most often, only a small fraction of the predictors arerelevant for Y

I What is the relationship between the response and eachpredictor?I postive/negative relationship; depending on third predictors

I Can the relationship between Y and each predictor beadequately summarized using a linear equation, or is thisnot enough?I most data out there is non-linear; need non-linear methods

Page 36: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

36

Inference for the Advertising data set

I Which media contribute to sales?I Which media generate the biggest boost in sales? orI How much increase in sales is associated with a given

increase in TV advertising?

Page 37: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

37

Prediction + inference

Real estate setting: relate values of homes toI crime rate, zoning, distance from a river, air quality,

schools, income level of community, size of housesInference:I How individual input variables affect the prices?I How much extra will a house be worth if it has a view of the

river?Prediction:I Predict the value of a home given its characteristics

Page 38: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

38

Linear vs Nonlinear

I linear methods: allow for relatively simple and interpretableinference, but worse prediction

I non-linear methods: very accurate predictions for Y , butless interpretable model thus making inference morechallenging

Page 39: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

39

Estimating f

I Parametric methods: Linear Regression, Model Selection,Generalized Linear Models, Mixture Models, Classification(linear, logistic, support vector machines), GraphicalModels, Structured Prediction, Hidden Markov Models

I Nonparametric methods: Nonparametric Regression andDensity Estimation, Nonparametric Classification,Boosting, Clustering and Dimension Reduction, PCA,Manifold Methods, Principal Curves, Spectral Methods,The Bootstrap and Subsampling, Nonparametric Bayes.

Page 40: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

40

Parametric Methods1. make an assumption about the the function form or shape

of f , e.g., in a linear model, f can be linear

f (X ) = β0 + β1X1 + β2X2 + . . .+ βpXp

2. fit or train the model; only need to estimate the p + 1coefficients βi s.t.

Y ≈ β0 + β1X1 + β2X2 + . . .+ βpXp

I use (ordinary) least squares or generalized linear modelsI reduced the problem of finding f to finding p + 1 coefficients

Downside: the chosen linear model might not match the(unknown) ground truth form of f .If we aim for more flexible and complex models:I need to estimate a larger number of parametersI may lead to overfitting (fit the noise too closely)

Page 41: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

41

Income data set

Figure: The blue surface represents the true underlying relationshipbetween income and years of education and seniority, which is knownsince the data are simulated. The red dots indicate the observedvalues of these quantities for 30 individuals.

Page 42: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

42

Linear model

income ≈ β0 + β1 × education + β2 × seniority

Page 43: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

43

Linear model for the Income data

income ≈ β0 + β1 × education + β2 × seniority

I True f has some curvature not captured by the linear fitI OK for capturing the positive relationship between years of

education/seniority and income

Page 44: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

44

Non-parametric Methods

I no explicit assumptions about the functional form of f(assumption-free approach)

I search for an estimate of f that best fits the data, withsome smoothness assumptions

I +: could fit a more wide range of functions fI -: a much larger number of observations is needed

(compared to parametric approaches)

Example: splines with varying levels of smoothness (allowingfor a rougher/tighter fit)

Page 45: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

45

Smooth splines

Page 46: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

46

Rough splines

Figure: Perfect fit on the given data!

Page 47: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

47

Rough splines (cont.)

Figure: ... but far from the ground truth f (right plot)!

This is a good example of overfitting: the fit does not produceaccurate estimates of the response on new observations thatwere not part of the original training data set.

Page 48: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

48

The Trade-Off Between Prediction Accuracy andModel Interpretability

I linear regression: fairly inflexibleI splines: considerably more flexible (can fit a much wider

range of possible shapes to estimate f )Inference:I then restrictive models are much more interpretable.I linear model: easy to understand the relationship between

Y and X1,X2, . . . ,Xp

Very flexible approaches (splines, boosting, etc)I can lead to such complicated estimates of fI hard to understand how any individual predictor is

associated with the responseLASSO:I less flexibleI linear model + sparsity of [β0, β1, . . . , βp]I more interpretable; only a small subset of predictors matter

Page 49: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

49

Flexibility vs. InterpretabilityGeneralized additive models (GAMs)I extend the linear model to allow for certain non-linear

relationships

y = β0 + f1(xi,1) + f2(xi,2) + . . .+ fp(xi,p) + εi

I more flexible than linear regressionI less interpretable than linear regression (the relationship

predictor vs. response is modeled by a curve)I main limitation: the model is restricted to be additive (with

many variables, important interactions can be missed)

Fully non-linear methods: bagging, boosting, and supportvector machines with non-linear kernelsI highly flexibleI harder to interpret

Page 50: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

50

Flexibility vs. Interpretability

Page 51: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

51

For inferenceI better off using simple and relatively inflexible statistical

learning methods.For predictionI interpretability is not an issueI shall we use the most flexible model available?I surprisingly, this is not always the case!I sometimes we get more accurate (out-of-sample)

predictions using a less flexible methodI overfitting in highly flexible methods is a real

concern/danger

Page 52: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

52

Supervised Versus Unsupervised LearningI supervised: there exists a response YI unsupervised: there is NO response Y (e.g., clustering)

I semi-supervised learning: some of the observations havean associated response Y , but some do not.

Personal note:I unsupervised learning: a means to an end!I in many/most applications, the ultimate goal is predictionI augment the unsupervised learning results (clustering,

non-linear dimensionality reduction) with a prediction step(even a simple linear regression)

Page 53: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

53

Performance evaluation

I mean squared error (MSE)

MSE =1n

n∑i=1

(yi − f (xi))2

I f (xi) is the prediction of f observation iI training MSEI test dataI we care most/only about out-of-sample performanceI test MSE

Average(y0 − f (x0))2

on the test observations (x0, y0) we have not seen yet

Page 54: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

54

Flexibility vs. MSE (non-linear)

Horizontal dashed line indicates Var(ε), the irreducible error.

Page 55: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

55

Flexibility vs. MSE (linear)

Page 56: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

56

Remarks

I test MSE >> train MSEI U-shape in the test MSEI as model flexibility increases, training MSE will decrease,

but the test MSE may notI overfitting: training MSE <<< test MSEI overfitting: when fitting the noise rather than properties of fI overfitting: when a less flexible model yields a smaller test

MSEI cross-validation: estimating test MSE using the training

data

Page 57: Lecture 1(i): Introduction & Roadmapcucuring/Lecture_1_Intro_StatML.pdf · 1.Introduction & syllabus 2.Statistical learning 3.Measures of correlation and dependency 4.Linear regression

57

R data sets (texbook)

I The UC Irvine Machine Learning Data Set Repository:http://archive.ics.uci.edu/ml/

I S&P 500/1500: daily open and close prices for 2003-2019I Networks data sets: directed, signed, temporal, multilayer