cs109a,stat121a,ac209a:datascience pavlosprotopapas ... · cs109a,stat121a,ac209a:datascience...

34
Lecture #6: Dimension Reduction CS 109A, STAT 121A, AC 209A: Data Science Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave

Upload: others

Post on 31-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Lecture #6: Dimension ReductionCS 109A, STAT 121A, AC 209A: Data Science

Pavlos Protopapas Kevin Rader Margo LevineRahul Dave

Page 2: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Lecture Outline

Regularization Review

More on Interaction Terms

High Dimensionality

Principal Components Analysis (PCA)

PCA for Regression (PCR)

PCA vs Variable Selection

2

Page 3: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Regularization Review

3

Page 4: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Overfitting and Regularization

4

Page 5: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Regularization: An Overview

The idea of regularization revolves around modifyingthe loss function L; in particular, we add a regularizationterm that penalizes some specified properties of themodel parameters

Lreg(β) = L(β) + λR(β),

where λ is a scalar that gives the weight (orimportance) of the regularization term.

Fitting the model using the modified loss function Lreg

would result in model parameters with desirableproperties (specified by R).

5

Page 6: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Bias-Variance Trade-off and Regularization

6

Page 7: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

LASSO Regression

Since we wish to discourage extreme values in model parameter,we need to choose a regularization term that penalizes parametermagnitudes. For our loss function, we will again use MSE.

Together our regularized loss function is

LLASSO(β) =1

n

n∑i=1

|yi − βββ⊤xxxi|2 + λ

J∑j=1

|βj |.

Note that∑J

j=1 |βj | is the ℓ1 norm of the vector βββ

J∑j=1

|βj | = ∥βββ∥1

Hence, we often say that LLASSO is the loss function for ℓℓℓ1regularization.

Finding model parameters βββLASSO that minimize the ℓ1regularized loss function is called LASSO regression.

7

Page 8: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Ridge Regression

Alternatively, we can choose a regularization term thatpenalizes the squares of the parameter magnitudes.

Then, our regularized loss function is

LRidge(β) =1

n

n∑i=1

|yi − βββ⊤xxxi|2 + λ

J∑j=1

β2j .

Note that∑J

j=1 β2j is related to the ℓ2 norm of βββ

J∑j=1

β2j = ∥βββ∥22

Hence, we often say that LRidge is the loss function for ℓℓℓ2regularization.

Finding model parameters βββRidge that minimize the ℓ2regularized loss function is called ridge regression.

8

Page 9: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Choosing λ

In both ridge and LASSO regression, we see that thelarger our choice of the regularization parameter λ, themore heavily we penalize large values in βββ,

1. If λ is close to zero, we recover the MSE, i.e. ridgeand LASSO regression is just ordinary regression.

2. If λ is sufficiently large, the MSE term in theregularized loss function will be insignificant andthe regularization term will force βββRidge and βββLASSO

to be close to zero.

To avoid ad-hoc choices, we should select λ usingcross-validation.

9

Page 10: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Ridge - Computational complexity

Solution to ridge regression:

β = (XTX + λI)−1XTY

The solution of the Ridge/Lasso regression involvesthree steps.

1. Select λ

2. Find the minimum of the ridge/Lasso regressionloss function (using linear algebra) as with themultiple regression and record the R2 on the testset.

3. Find the λ that gives the largest R2

10

Page 11: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Variable Selection as Regularization

Since LASSO regression tend to produce zero estimatesfor a number of model parameters - we say that LASSOsolutions are sparse - we consider LASSO to be a methodfor variable selection.

Many prefer using LASSO for variable selection (as wellas for suppressing extreme parameter values) ratherthan stepwise selection, as LASSO avoids the statisticproblems that arises in stepwise selection.

11

Page 12: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Lasso vs. Ridge

12

Page 13: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Ridge Regularization

13

Page 14: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Lasso Regularization

14

Page 15: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

More on Interaction Terms

15

Page 16: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Interaction Terms: A Review

Recall that an interaction term between predictorsX1

andX2 can be incorporated into a regression model byincluding the multiplicative (i.e. cross) term in themodel, for example

Y = β0 + β1X1 + β2X2 + β3(X1 ·X2) + ϵ.

Example

SupposeX1 is a binary predictor indicating whether aNYC ride pickup is a tax or an Uber,X2 is the times ofday of the pickup and Y is the length of the ride.

What is the interpretation of β3?

16

Page 17: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Including Interaction Terms in Models

Recall that to avoid overfitting, we sometimes elect toexclude a number of terms in a linear model.

It is standard practice to always include themain effectsin the model. That is, we always include the termsinvolving only one predictor, β1X1, β2X2 etc.

Question: Why are themain effects important?

Question: In what type of model would it make sense toinclude the interaction term without one of the maineffects?

17

Page 18: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

How Many Interaction Terms?

Example

Our NYC taxi and Uber dataset has 1.1 billion taxi tripsand 19 million Uber trips. Each trip is described byp = 23 predictors (and 1 response variable). How manyinteraction terms are there?

▶ Two-way interactions:(p2

)=

▶ Three-way interactions:(p3

)=

▶ Etc

The total number of interaction terms (including maineffects) is

∑pk=1

(pk

)= 2p ≈ 8.3million.

What is problem with building a model that includes allpossible interaction terms?

18

Page 19: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Model Unidentifiability

Previously, we had been using samples of 100kobservations from the dataset to build our models. If weinclude all possible interaction terms, our model willhave 8.3 mil parameters. We will not be able touniquely determine 8.3 mil parameters with only100k observations. In this case, we call the modelunidentifiable.

In practice, we can:▶ increase the number of observation▶ consider only scientifically important interactionterms

▶ perform variable selection▶ perform another dimensionality reduction techniquelike PCA

19

Page 20: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

High Dimensionality

20

Page 21: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

When Does High Dimensionality Occur?

The problem of high dimensionality can occur when thenumber of parameters exceeds or is close to thenumber of observations. This can occur when weconsider lots of interaction terms, like in our previousexample. But this can also happen when the number ofmain effects is high.

For example:▶ When we are performing polynomial regressionwith a high degree and a large number of predictors.

▶ When the predictors are genomic markers in acomputational biology problem.

▶ When the predictors are the counts of all Englishwords appearing in a text.

21

Page 22: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

A Framework For Dimensionality Reduction

One way to reduce the dimensions of the feature spaceis to create a new, smaller set of predictors by takinglinear combinations of the original predictors.

We choose Z1, Z2, . . . , Zm, wherem < p and where eachZi is a linear combination of the original p predictors

Zi =

p∑j=1

ϕjiXj

for fixed constants ϕji. Then we can build a linearregression regression model using the new predictors

Y = β0 + β1Z1 + . . .+ βmZm + ϵ.

Notice that this model has a smaller number (m < p) ofparameters.

22

Page 23: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

A Framework For Dimensionality Reduction

A method of dimensionality reduction includes 2 steps:

1. Determine a optimal set of new predictorsZ1, . . . , Zm, form < p.

2. Express each observation in the data in terms ofthese new predictors. The transformed data willhavem columns rather than p.

Thereafter, we can fit a model using the new predictors.

The method for determining the set of new predictors(what do we mean by an optimal predictors set) candiffer according to application. We will explore a way tocreate new predictors that captures the variations inthe observed data.

22

Page 24: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Principal Components Analysis (PCA)

23

Page 25: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Principal Components Analysis (PCA)

Principal Components Analysis (PCA) is a method toidentify a new set of predictors, as linear combinationsof the original ones, that captures the ‘maximumamount’ of variance in the observed data.

24

Page 26: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Principal Components Analysis (PCA)

DefinitionPrincipal Components Analysis (PCA) produces a list of pprinciple components (Z1, . . . , Zp) such that

▶ Each Zi is a linear combination of the originalpredictors, and it’s vector norm is 1

▶ The Zi’s are pairwise orthogonal▶ The Zi’s are ordered in decreasing order in theamount of captured observed variance.

That is, the observed data shows more variance inthe direction of Z1 than in the direction of Z2.

To perform dimensionality reduction we select the topmprinciple components of PCA as our new predictors andexpress our observed data in terms of these predictors.

24

Page 27: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

The Intuition Behind PCA

Top PCA components capture the most of amount ofvariation (interesting features) of the data.

Each component is a linear combination of the originalpredictors - we visualize them as vectors in the featurespace.

25

Page 28: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

The Intuition Behind PCA

Transforming our observed data means projecting ourdataset onto the space defined by the topm PCAcomponents, these components are our new predictors.

25

Page 29: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

The Math behind PCA

PCA is a well-known result from linear algebra. Let Z bethe n× pmatrix consisting of columns Z1, ..., Zp (theresulting PCA vectors), X be the n× pmatrix ofX1, ..., Xp

of the original data variables (each re-centered to havemean zero, and without the intercept), and letW be thep× pmatrix whose columns are the eigenvectors of thesquare matrix XTX, then

Zn×p = Xn×pWp×p

26

Page 30: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Implementation of PCA using linear algebra

To implement PCA yourself using this linear algebraresult, you can perform the following steps:

1. Subtract off the mean for each of your predictors(so they each have mean zero).

2. Calculate the eigenvectors of the XTXmatrix andcreate the matrix with those columns,W, in orderfrom largest to smallest eigenvalue.

3. Use matrix multiplication to determine Z = XW

Note: this is not efficient from a computationalperspective. This can be sped up using Choleskydecomposition.However, PCA is easy to perform in Python using thedecomposition.PCA function in the sklearn package.

27

Page 31: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

PCA for Regression (PCR)

28

Page 32: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Using PCA for Regression

PCA is easy to use in Python, so how do we then use itfor regression modeling in a real-life problem?If we use all p of the new Zj , then we have not improvedthe dimensionality. Instead, we select the firstM PCAvariables, Z1, ..., ZM , to use as predictors in a regressionmodel.The choice ofM is important and can vary fromapplication to application. It depends on various things,like how collinear the predictors are, how truly relatedthey are to the response, etc...What would be the best way to check for a specifiedproblem?Train and Test!!!

29

Page 33: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

PCA vs Variable Selection

30

Page 34: CS109A,STAT121A,AC209A:DataScience PavlosProtopapas ... · CS109A,STAT121A,AC209A:DataScience PavlosProtopapas KevinRader MargoLevine RahulDave. LectureOutline RegularizationReview

Bibliography

1. Bolelli, L., Ertekin, S., and Giles, C. L. Topic and trend detection in text collectionsusing latent dirichlet allocation. In European Conference on InformationRetrieval (2009), Springer, pp. 776-780.

2. Chen, W., Wang, Y., and Yang, S. Efficient influence maximization in socialnetworks. In Proceedings of the 15th ACM SIGKDD international conference onKnowledge discovery and data mining (2009), ACM, pp. 199-208.

3. Chong, W., Blei, D., and Li, F.-F. Simultaneous image classification andannotation. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEEConference on (2009), IEEE, pp. 1903-1910.

4. Du, L., Ren, L., Carin, L., and Dunson, D. B. A bayesian model for simultaneousimage clustering, annotation and object segmentation. In Advances in neuralinformation processing systems (2009), pp. 486-494.

5. Elango, P. K., and Jayaraman, K. Clustering images using the latent dirichletallocation model.

6. Feng, Y., and Lapata, M. Topic models for image annotation and text illustration.In Human Language Technologies: The 2010 Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics (2010),Association for Computational Linguistics, pp. 831-839.

7. Hannah, L. A., and Wallach, H. M. Summarizing topics: From word lists to phrases.8. Lu, R., and Yang, Q. Trend analysis of news topics on twitter. International

Journal of Machine Learning and Computing 2, 3 (2012), 327.

31