cs109a,stat121a,ac209a:datascience pavlosprotopapas ... · cs109a,stat121a,ac209a:datascience...

Lecture #6: Dimension ReductionCS 109A, STAT 121A, AC 209A: Data Science

Pavlos Protopapas Kevin Rader Margo LevineRahul Dave

Lecture Outline

Regularization Review

More on Interaction Terms

Interaction Terms: A Review

Recall that an interaction term between predictorsX1

andX2 can be incorporated into a regression model byincluding the multiplicative (i.e. cross) term in themodel, for example

Y = β0 + β1X1 + β2X2 + β3(X1 ·X2) + ϵ.

Example

SupposeX1 is a binary predictor indicating whether aNYC ride pickup is a tax or an Uber,X2 is the times ofday of the pickup and Y is the length of the ride.

What is the interpretation of β3?

Including Interaction Terms in Models

Recall that to avoid overfitting, we sometimes elect toexclude a number of terms in a linear model.

It is standard practice to always include themain effectsin the model. That is, we always include the termsinvolving only one predictor, β1X1, β2X2 etc.

Question: Why are themain effects important?

Question: In what type of model would it make sense toinclude the interaction term without one of the maineffects?

How Many Interaction Terms?

Example

Our NYC taxi and Uber dataset has 1.1 billion taxi tripsand 19 million Uber trips. Each trip is described byp = 23 predictors (and 1 response variable). How manyinteraction terms are there?

▶ Two-way interactions:(p2

▶ Three-way interactions:(p3

▶ Etc

The total number of interaction terms (including maineffects) is

∑pk=1

)= 2p ≈ 8.3million.

What is problem with building a model that includes allpossible interaction terms?

Model Unidentifiability

Previously, we had been using samples of 100kobservations from the dataset to build our models. If weinclude all possible interaction terms, our model willhave 8.3 mil parameters. We will not be able touniquely determine 8.3 mil parameters with only100k observations. In this case, we call the modelunidentifiable.

In practice, we can:▶ increase the number of observation▶ consider only scientifically important interactionterms

▶ perform variable selection▶ perform another dimensionality reduction techniquelike PCA

High Dimensionality

When Does High Dimensionality Occur?

The problem of high dimensionality can occur when thenumber of parameters exceeds or is close to thenumber of observations. This can occur when weconsider lots of interaction terms, like in our previousexample. But this can also happen when the number ofmain effects is high.

For example:▶ When we are performing polynomial regressionwith a high degree and a large number of predictors.

▶ When the predictors are genomic markers in acomputational biology problem.

▶ When the predictors are the counts of all Englishwords appearing in a text.

A Framework For Dimensionality Reduction

One way to reduce the dimensions of the feature spaceis to create a new, smaller set of predictors by takinglinear combinations of the original predictors.

We choose Z1, Z2, . . . , Zm, wherem < p and where eachZi is a linear combination of the original p predictors

p∑j=1

ϕjiXj

for fixed constants ϕji. Then we can build a linearregression regression model using the new predictors

Y = β0 + β1Z1 + . . .+ βmZm + ϵ.

Notice that this model has a smaller number (m < p) ofparameters.

A Framework For Dimensionality Reduction

A method of dimensionality reduction includes 2 steps:

1. Determine a optimal set of new predictorsZ1, . . . , Zm, form < p.

2. Express each observation in the data in terms ofthese new predictors. The transformed data willhavem columns rather than p.

Thereafter, we can fit a model using the new predictors.

The method for determining the set of new predictors(what do we mean by an optimal predictors set) candiffer according to application. We will explore a way tocreate new predictors that captures the variations inthe observed data.

Principal Components Analysis (PCA) is a method toidentify a new set of predictors, as linear combinationsof the original ones, that captures the ‘maximumamount’ of variance in the observed data.

DefinitionPrincipal Components Analysis (PCA) produces a list of pprinciple components (Z1, . . . , Zp) such that

▶ Each Zi is a linear combination of the originalpredictors, and it’s vector norm is 1

▶ The Zi’s are pairwise orthogonal▶ The Zi’s are ordered in decreasing order in theamount of captured observed variance.

That is, the observed data shows more variance inthe direction of Z1 than in the direction of Z2.

To perform dimensionality reduction we select the topmprinciple components of PCA as our new predictors andexpress our observed data in terms of these predictors.

The Intuition Behind PCA

Top PCA components capture the most of amount ofvariation (interesting features) of the data.

Each component is a linear combination of the originalpredictors - we visualize them as vectors in the featurespace.

The Intuition Behind PCA

Transforming our observed data means projecting ourdataset onto the space defined by the topm PCAcomponents, these components are our new predictors.

The Math behind PCA

PCA is a well-known result from linear algebra. Let Z bethe n× pmatrix consisting of columns Z1, ..., Zp (theresulting PCA vectors), X be the n× pmatrix ofX1, ..., Xp

of the original data variables (each re-centered to havemean zero, and without the intercept), and letW be thep× pmatrix whose columns are the eigenvectors of thesquare matrix XTX, then

Zn×p = Xn×pWp×p

Implementation of PCA using linear algebra

To implement PCA yourself using this linear algebraresult, you can perform the following steps:

1. Subtract off the mean for each of your predictors(so they each have mean zero).

2. Calculate the eigenvectors of the XTXmatrix andcreate the matrix with those columns,W, in orderfrom largest to smallest eigenvalue.

3. Use matrix multiplication to determine Z = XW

Note: this is not efficient from a computationalperspective. This can be sped up using Choleskydecomposition.However, PCA is easy to perform in Python using thedecomposition.PCA function in the sklearn package.

PCA for Regression (PCR)

Using PCA for Regression

PCA is easy to use in Python, so how do we then use itfor regression modeling in a real-life problem?If we use all p of the new Zj , then we have not improvedthe dimensionality. Instead, we select the firstM PCAvariables, Z1, ..., ZM , to use as predictors in a regressionmodel.The choice ofM is important and can vary fromapplication to application. It depends on various things,like how collinear the predictors are, how truly relatedthey are to the response, etc...What would be the best way to check for a specifiedproblem?Train and Test!!!

PCA vs Variable Selection

Bibliography

1. Bolelli, L., Ertekin, S., and Giles, C. L. Topic and trend detection in text collectionsusing latent dirichlet allocation. In European Conference on InformationRetrieval (2009), Springer, pp. 776-780.

2. Chen, W., Wang, Y., and Yang, S. Efficient influence maximization in socialnetworks. In Proceedings of the 15th ACM SIGKDD international conference onKnowledge discovery and data mining (2009), ACM, pp. 199-208.

3. Chong, W., Blei, D., and Li, F.-F. Simultaneous image classification andannotation. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEEConference on (2009), IEEE, pp. 1903-1910.

4. Du, L., Ren, L., Carin, L., and Dunson, D. B. A bayesian model for simultaneousimage clustering, annotation and object segmentation. In Advances in neuralinformation processing systems (2009), pp. 486-494.

5. Elango, P. K., and Jayaraman, K. Clustering images using the latent dirichletallocation model.

6. Feng, Y., and Lapata, M. Topic models for image annotation and text illustration.In Human Language Technologies: The 2010 Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics (2010),Association for Computational Linguistics, pp. 831-839.

7. Hannah, L. A., and Wallach, H. M. Summarizing topics: From word lists to phrases.8. Lu, R., and Yang, Q. Trend analysis of news topics on twitter. International

Journal of Machine Learning and Computing 2, 3 (2012), 327.

cs109a,stat121a,ac209a:datascience pavlosprotopapas ... · cs109a,stat121a,ac209a:datascience...

Documents

lecture 17: boosting - harvard-iacs.github.io ›...

welcome to cs109a - web.stanford.edu€¦ · = (0, 0)...

03 lectureoutline -...

practicalsat solving · 2 16.07.2019 lectureoutline...

lecture 10: classification and logistic regression ·...

lecture 5 search and matching...

lecture 6: multiple and poly linear regression · cs109a, p...

lecture 24: ab testing 2 and wrap -up - github pages · ab...

uppsala university - lecture 1 introduction ·...

perceptron and multilayer perceptron · applications show...

lecture 8: high dimensionality & pca...cs109a, protopapas,...

lecture 23: ab testing - github pages · cs109a,...

lecture #2: data engineering - github pages · cs109a,...

vectors in physics 03 lectureoutline - mr. dragan

quan%ta%on(with(xpress( and...

13 lectureoutline...

lecture 4: introduction to regression · cs109a...

advanced section #5: decision trees and ensemble methods ·...

version 7svoboda/courses/171... · lectureoutline rdfstores...

lecture 11: logistic regression extensions and evaluating...