data mining lectures lectures 6/7: regression padhraic smyth, uc irvine ics 278: data mining...

38
Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lectures 6, 7: Regression Algorithms

Padhraic SmythDepartment of Information and Computer

ScienceUniversity of California, Irvine

Page 2: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Notation

• Variables X, Y….. with values x, y (lower case)• Vectors indicated by X

• Components of X indicated by Xj with values xj

• “Matrix” data set D with n rows and p columns– jth column contains values for variable Xj – ith row contains a vector of measurements on object i, indicated

by x(i)– The jth measurement value for the ith object is xj(i)

• Unknown parameter for a model = – Can also use other Greek letters, like ew

– Vector of parameters =

Page 3: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Example: Multivariate Linear Regression

• Task: predict real-valued Y, given real-valued vector X

• Score function, e.g.,

S() = i [y(i) – f(x(i) ; ) ]2

• Model structure: f(x ; ) = 0 + j xj

• Model parameters = = {0, 1, …… p }

Page 4: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

S = e2 = e’ e = (y – X a)’ (y – X a)

= y’ y – a’ X’ y – y’ X a + a’ X’ X a = y’ y – 2 a’ X’ y + a’ X’ X a

Taking derivative of S with respect to the components of a gives…. dS/da = -2X’y + 2 X’ X a

Set this to 0 to find the extremum (minimum) of S as a function of a…

- 2X’y + 2 X’ X a = 0 X’Xa = X’ y

Letting X’X = C, and X’y = b, we have C a = b, i.e., a set of linear equations

We could solve this directly by matrix inversion, i.e., a = C-1 b = ( X’ X )-1 X’ y

…. but there are better ways to do this (e.g., QR-decomposition)

Page 5: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Comments on Multivariate Linear Regression

• prediction is a linear function of the parameters• Score function: quadratic in predictions and

parameters Derivative of score is linear in the parameters Leads to a linear algebra optimization problem, i.e., Ca = b

• Model structure is simple….– p-1 dimensional hyperplane in p-dimensions– Linear weights => interpretability

• Useful as a baseline model – to compare more complex models to

Page 6: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Limitations of Linear Regression

• True relationship of X and Y might be non-linear– Suggests generalizations to non-linear models

• Complexity:– O(p3) - could be a problem for large p

• Correlation/Collinearity among the X variables– Can cause numerical instability (C may be ill-

conditioned)– Problems in interpretability (identifiability)

• Includes all variables in the model…– But what if p=100 and only 3 variables are related to Y?

Page 7: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Finding the k best variables

• Find the subset of k variables that predicts best:– This is a generic problem when p is large

(arises with all types of models, not just linear regression)

• Now we have models with different complexity..– E.g., p models with a single variable– p(p-1)/2 models with 2 variables, etc…– 2p possible models in total– Note that when we add or delete a variable, the optimal

weights on the other variables will change in general• k best is not the same as the best k individual variables

• What does “best” mean here?– Return to this later

Page 8: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Search Problem

• How can we search over all 2p possible models?– exhaustive search is clearly infeasible

– Heuristic search is used to search over model space:• Forward search (greedy)• Backward search (greedy)• Generalizations (add or delete)

– Think of operators in search space• Branch and bound techniques

– This type of variable selection problem is common to many data mining algorithms

• Outer loop that searches over variable combinations• Inner loop that evaluates each combination

Page 9: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Defining what “best” means

• How do we measure “best”?– Best performance on the training data?

• K = p will be best (i.e., use all variables)• So this is not useful

• Note:– Performance on the training data will in general be optimistic

• Alternatives:– Measure performance on a single validation set

– Measure performance using multiple validation sets• Cross-validation

– Add a penalty term to the score function that “corrects” for optimism

• E.g., “regularized” regression: SSE + sum of weights squared

Page 10: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Empirical Learning

• Squared Error score (as an example: we could use other scores)

S() = i [y(i) – f(x(i) ; ) ]2

where S() is defined on the training data D

• We are really interested in finding the f(x; ) that best predicts y on future data, i.e., minimizing E [S] = E [y – f(x ; ) ]2

• Empirical learning

– Minimize S() on the training data Dtrain

– If Dtrain is large and model is simple we are assuming that the best f

on training data is also the best predictor f on future test data Dtest

Page 11: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Complexity versus Goodness of Fit

x

yTraining data

Page 12: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Complexity versus Goodness of Fit

x

y

x

yToo simple?Training data

Page 13: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Complexity versus Goodness of Fit

x

y

x

y

x

y

Too simple?

Too complex ?

Training data

Page 14: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Complexity versus Goodness of Fit

x

y

x

y

x

y

x

y

Too simple?

Too complex ? About right ?

Training data

Page 15: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Complexity and Generalization

Strain()

Stest()

Complexity = degreesof freedom in the model(e.g., number of variables)

ScoreFunction

Optimal modelcomplexity

Page 16: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Using Validation Data

Training Data

Validation Data

Test Data

Use this data to find the best for each model fk(x ; )

Use this data to (1) calculate an estimate of Sk() for

each fk(x ; ) and (2) select k* = arg mink Sk()

Use this data to calculate an unbiased estimate of Sk*() for the selected model

Page 17: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

2 different (but related) issues here

• 1. Finding the function f that minimizes S() for future data

• 2. Getting a good estimate of S(), using the chosen function, on future data,– e.g., we might have selected the best function f, but our

estimate of its performance will be optimistically biased if our estimate of the score uses any of the same data used to fit and select the model.

Page 18: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Non-linear models, linear in parameters

• We can add additional polynomial terms in our equations, e.g., all “2nd order” terms

f(x ; ) = 0 + j xj + ij xi xj

• Note that it is a non-linear functional form, but it is linear in the parameters (so still referred to as “linear regression”)

– We can just treat the xi xj terms as additional fixed inputs

– In fact we can add in any non-linear input functions!, e.g.

f(x ; ) = 0 + j fj(x)

Comments:- Exact same linear algebra for optimization (same math)- size of model space has now exploded -> very large search

problem

Page 19: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Non-linear (both model and parameters)

• We can generalize further to models that are nonlinear in all aspects

f(x ; ) = 0 + k gk(k0 kj xj )where the g’s are non-linear functions with fixed functional forms.

In machine learning this is called a neural networkIn statistics this might be referred to as a generalized linear

model or projection-pursuit regression

For almost any score function of interest, e.g., squared error, the score function is a non-linear function of the parameters.

Closed form (analytical) solutions are rare.

Thus, we have a multivariate non-linear optimization problem(which may be quite difficult!)

Page 20: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Optimization of a non-linear score function

• We seek the minimum of a function in d dimensions, where d is the number of parameters (d could be large!)

• There are a multitude of heuristic search techniques (see chapter 8)– Steepest descent (follow the gradient)– Newton methods (use 2nd derivative information)– Conjugate gradient– Line search– Stochastic search– Genetic algorithms

• Two cases:– Convex (nice -> means a single global optimum)– Non-convex (multiple local optima => need multiple starts)

Page 21: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Other non-linear models

• Splines– “patch” together different low-order polynomials over

different parts of the x-space– Works well in 1 dimension, less well in higher dimensions

• Memory-based models

y’ = w(x’,x) y, where y’s are from the training data

w(x’,x) = function of distance of x from x’

• Local linear regression

y’ = 0 + j xj , where the alpha’s are fit at prediction time just to the (y,x) pairs that are close to x’

Page 22: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Time-series prediction as regression

• Measurements over time x1,…… xt

• We want to predict xt+1 given x1,…… xt

• Autoregressive model f( x1,…… xt ; ) = k xt-k

– Number of coefficients K = memory of the model– Can take advantage of regression techniques in general to solve

this problem (e.g., linear in parameters, score function = squared error, etc)

• Generalizations– Vector x– Non-linear function instead of linear– Add in terms for time-trend (linear, seasonal), for “jumps”, etc

Page 23: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Other aspects of regression

• Diagnostics– Useful in low dimensions

• Weighted regression– Useful when rows have different weights

• Different score functions– E.g. absolute error, or additive noise varies as a function of x

• Predicting y values constrained to a certain range

• Predicting binary y values– Regression as a generalization of classification

Page 24: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Classification as a special case of Regression

• Let y take values 0 or 1

• Regression (in general) tries to learn an f function that matches E[y|x]

• For binary y we have E[y|x] = 1. p(y=1|x) + 0 p(y=0|x) = p(y=1|x)

• Thus, if regression drives f towards E[y|x], then for binary classification problems a regression model will try to approximate p(y=1|x) (posterior class probabilities)– e.g., this is what neural networks do

Page 25: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Predicting an output between 0 and 1

• We often have a problem where y lies between 0 and 1– probability that a person with attributes X will survive 10 years– proportion of people in Zip code X who will buy a product

• We could use linear regression, but…..

• Instead we can use the logistic function

log p(y=1|x)/log p(y=0|x) = 0 + j xj

Equivalently, p(y=1|x) = 1/[1 + exp(- 0 - j xj ) ]

We model the log-odds as a linear function of the input variables. This is known as logistic regression.

Could be viewed as a simple neural network with a single hidden unit

Page 26: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Logistic Regression• How do we estimate the alpha parameters?

– Problem: linear algebra solution no longer applies since our function form (and score) is now non-linear in the parameters

• Common approach– Score function = likelihood = probability of observed data– Select parameters to maximize the log-likelihood (“maximum

likelihood”)

– S() = i log p( y(i) | x(i) ; )

= i y(i) log p( y(i)=1| x(i) ; ) + [1-y(i)] log(1- p( y(i)=1| x(i) ; ))

– Can compute the 2nd derivative directly as weighted matrix• Forms the basis for an iterative 2nd order Newton scheme• Known as iteratively reweighted least-squares• Log-likelihood here is convex: so it is quite stable (only one global

maximum!).

Page 27: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Tree-Structured Regression

• Functional form of model is a “regression tree”– Univariate thresholds at internal nodes– Constant or linear surfaces at the leaf nodes– Yields piecewise constant (or linear) surface– (like classification tree, but for regression)

• Very crude functional form…. But– Can be very useful in high-dimensional problems– Can be very interpretable

• Search problem– Finding the optimal tree is NP-hard (for any reasonable score)– Practice: greedy algorithms

Page 28: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

More on regression trees

• Greedy search algorithm– For each variable, find the best split point such the mean of

Y either side of the split minimizes the mean-squared error– Select the variable with the minimum average error

• Partition the data using the threshold– Recursively apply this selection procedure to each

“branch”

• What size tree?– A full tree will likely overfit the data– Common methods for tree selection

• Grow a large tree and select an appropriate subtree by Xvalidation

• Grow a number of small fixed-sized trees and average their predictions

Page 29: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Model Averaging

• Can average over parameters and models– Why? Any point estimate of parameters or a single model has

only a small chance of the being the best– Averaging makes our predictions more stable and less

sensitive to random variations in a particular data set (good for less stable models like trees)

• Model averaging flavors– Fully Bayesian: parameters and models– “empirical Bayesian”: learn weights over multiple models

• E.g., stacking and bagging

– Build multiple simple models in a systematic way and combine them, e.g., boosting

– Will say more about this in later lectures

Page 30: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Components of Data Mining Algorithms

• Representation:– Determining the nature and structure of the

representation to be used;

• Score function– quantifying and comparing how well different

representations fit the data

• Search/Optimization method– Choosing an algorithmic process to optimize the score

function; and

• Data Management– Deciding what principles of data management are

required to implement the algorithms efficiently.

Page 31: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Task

What’s in a Data Mining Algorithm?

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Page 32: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Task

Multivariate Linear Regression

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Regression

Y = Weighted linear sum of X’s

Least-squares

Linear algebra

None specified

Regression coefficients

Page 33: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Task

Autoregressive Time Series Models

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Time Series Regression

X = Weighted linear sum of earlier X’s

Least-squares

Linear algebra

None specified

Regression coefficients

Page 34: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Task

Neural Networks

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Regression

Y = nonlin function of X’s

Least-squares

Gradient descent

None specified

Network weights

Page 35: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Task

Logistic Regression

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Regression

Log-odds(Y) = linear function of X’s

Log-likelihood

Iterative (Newton) method

None specified

Logistic weights

Page 36: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Software

• MATLAB– Many free “toolboxes” on the Web for regression and

prediction– e.g., see http://lib.stat.cmu.edu/matlab/

and in particular the CompStats toolbox

• R– General purpose statistical computing environment (successor

to S)– Free (!)– Widely used by statisticians, has a huge library of functions

and visualization tools

• Commercial tools– SAS, other statistical packages– Data mining packages– Often are not progammable: offer a fixed menu of items

Page 37: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Suggested Reading in Text

• Chapter 4:– General statistical aspects of model fitting– Pages 93 to 116, plus Section 4.7 on sampling

• Chapter 5:– “reductionist” view of learning algorithms (can skim this)

• Chapter 6:– Different forms of functional forms for modeling– Pages 165 to 183

• Chapter 8:– Section 8.3 on multivariate optimization

• Chapter 9:– linear regression and related methods– Can skip Section 11.3

Page 38: Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine

Useful References

N. R. Draper and H. Smith, Applied Regression Analysis, 2nd edition,Wiley, 1981(the “bible” for classical regression methods in statistics)

T. Hastie, R. Tibshirani, and J. Friedman, Elements of Statistical Learning,Springer Verlag, 2001(statistically-oriented overview of modern ideas in regression and classificatio, mixes machine learning and statistics)