on prediction

On prediction

Jussi Hakanen

Post-doctoral researcher [email protected]

April 22, 2014 TIES445 Data mining (guest lecture)

mailto:[email protected]

Learning outcomes

To understand the basic principles of

prediction

To understand linear regression in prediction

To be aware of the connection between the

least squares and optimization


Exercise

Find out issues related to prediction in

data mining

Work in pairs or small groups

Time: 10 minutes

April 22, 2014 TIES445 Data mining (guest

lecture)

Summary of the exercise

How to measure quality of prediction?

How to avoid overlearning/overfitting?

Supervised learning

How to predict for missing data?

Different applications (stock markets, biology, sport prediction, income in finance, energy water consumption, …)

Underfitting

Bias of the model

Concept drift (inaccuracy of the prediction)

Obtaining new knowledge from a given data


lecture)

Motivation

How to estimate data – Within the ranges of the

dataset?

– Outside of the dataset?

Handling missing data, outliers

Predictive vs. descriptive models

Prediction for numerical values (cf. classification)


Concepts

Predictor (or independent or input) variables

𝑥 = 𝑥1, … , 𝑥𝑁𝑇 (𝑁 ≥ 1)

Response (or dependent or output) variables

𝑦 = 𝑦1, … , 𝑦𝑀𝑇 (𝑀 ≥ 1)

Regression model/function

– A model/function describing the prediction used

Linear regression

– Regression model/function is linear


Prediction

A set of data for which the values of predictor and response variables are known

– 𝑃 data points 𝑥𝑗 , 𝑦𝑗 , 𝑗 = 1, … , 𝑃

– 𝑥𝑗 = 𝑥1𝑗, … , 𝑥𝑁

𝑗 𝑇

– 𝑦𝑗 = 𝑦1𝑗, … , 𝑦𝑀

𝑗 𝑇

Idea is to use prediction models for predicting a value for such predictor variables for which we don’t know the response

Note: 𝑃 should be greater or equal than 𝑁!

Interpolation vs. extrapolation

Can give misleading results if not interpreted carefully!!!

Accuracy important – Different measures of accuracy

– Can be used e.g. to choose between different models and/or for choosing values for different parameters in the models

– Can sometimes be sacrificed for a simpler model


Regression analysis

Used for prediction and forecasting

Parametric/non-parametric regression

Regression function depends of a finite number of

unknown parameters

Non-parametric: regression function is in a set of

functions (can be infinite dimensional)

Linear and nonlinear regression

W.r.t to the parameters


Linear regression

Model linear w.r.t. parameters – Not necessarily linear w.r.t. predictor variables

𝑦 = 𝑎0 + 𝑎𝑖𝑥𝑖𝑁𝑖=1

– 𝑦 is a predicted estimate of the mean value at 𝑥

– 𝑎0, … , 𝑎𝑁 are parameters

Oldest and most widely used due to simplicity

Typically, the model used is not exact → an error exists – 𝑦𝑗 = 𝑦 𝑗 + 𝑒𝑗 for each data point 𝑥𝑗 , 𝑗 = 1, … , 𝑃

– In matrix terms: 𝑦 = 𝑋𝑎 + 𝑒

How to select values for the parameters 𝑎?


Least squares

Determining orbits of bodies around the Sun from astronomical observations (Legendre, 1805; Gauss, 1809)

Idea: minimize the sum of the squared errors

For problems with 𝑃 > 𝑁

𝑒𝑗 2𝑃𝑗=1 = 𝑦𝑗 − 𝑎𝑖𝑥𝑖

𝑗𝑁𝑖=0

2𝑃𝑗=1

min𝑎

𝑦𝑗 − 𝑎𝑖𝑥𝑖𝑗𝑁

𝑖=0

2𝑃𝑗=1

– Optimization problem!

Parameter values minimizing the above can be shown to be 𝑎∗ = 𝑋𝑇𝑋 −1𝑋𝑇𝑦 – Direct solution requires 𝑋𝑇𝑋 to be invertible (problems if 𝑃 is small or

there are linear dependences between 𝑥𝑖)

– Typically 𝑎∗ is computed by numerical linear algebra


Example


𝑦 = −0.6777 + 3.0166𝑥

Example (cont.)


𝑦 = 4.1579 − 0.0057𝑥 + 0.3053𝑥2

Notes

The parameter values in linear regression can

be interpreted as follows:

– If the value of predictor variable 𝑥𝑖 is increased by

one unit and the values of other predictor variables

remain the same, then 𝑎𝑖 denotes the change in

prediction

– 𝑦 = 𝑎0 + 𝑎𝑖𝑥𝑖𝑁𝑖=1


lecture)

Connection to optimization

Least squares → unconstrained optimization problem

min𝑎

1

2 (𝑓𝑗(𝑎))2𝑃

𝑗=1 = min𝑎

1

2𝑓 𝑎 2

– Function 𝑓𝑗(𝑎) = 𝑦𝑗 − ℎ(𝑎, 𝑥𝑗) where ℎ(𝑎, 𝑥) is the model used

– E.g. ℎ 𝑎, 𝑥 = 𝑎0 + 𝑎𝑖𝑥𝑖𝑁𝑖=1

Gauss-Newton method

– Taylor (1st order): 𝑓 𝑎, 𝑎ℎ ≈ 𝑓 𝑎ℎ + 𝛻𝑓 𝑎ℎ 𝑇(𝑎 − 𝑎ℎ)

– → 𝑎ℎ+1 = 𝑎ℎ − 𝛻𝑓 𝑎ℎ 𝛻𝑓 𝑎ℎ 𝑇 −1𝛻𝑓 𝑎ℎ 𝑓(𝑎ℎ)

Connection to Newton’s method

– Hessian of 1

2𝑓 𝑎 2: 𝛻𝑓 𝑎ℎ 𝛻𝑓 𝑎ℎ 𝑇

+ 𝛻2𝑓𝑗 𝑎ℎ 𝑓𝑗(𝑎ℎ)𝑃

𝑗=1

– Gauss-Newton is equivalent with Newton except the second order term!


Function approximation

Prediction can be used in optimization for approximating the objective function

Typically used when the evaluation of the objective function is time consuming – E.g. if the model is a partial differential equation that

takes significant amount of time to solve numerically

– Reduces time for optimization since typically a large amount of function evaluations are required

Examples of approximation models are polynomial approximation, radial basis functions (RBFs), Kriging, support vector regression


Regularization

Previously, no requirements were made for the parameter values – Unconstraint optimization problem

E.g. need for constraining the size of parameters

Tikhonov regularization (ridge regression) – Add a constraint that 𝑎 2, the 𝐿2 norm of the parameter vector is

not greater than a given value

– Can be considered as unconstraint optimization problem by adding a penalty term 𝛽 𝑎 2 to the objective function

Lasso method (least absolute shrinkage and selection operator) – Add a constraint that 𝑎 1, the 𝐿1 norm of the parameter vector is

not greater than a given value

– Prefers solutions with fewer non-zeros


Conclusions

What were the keypoints from your

perspective? What do you remember best?

Extrapolation and interpolation can be

dangerous

Regularization is important?


April 22, 2014

Thank You!

Dr. Jussi Hakanen

Industrial Optimization Group

http://www.mit.jyu.fi/optgroup/

Department of Mathematical Information Technology

P.O. Box 35 (Agora)

FI-40014 University of Jyväskylä

[email protected]

http://users.jyu.fi/~jhaka/en/

TIES445 Data mining (guest lecture)

http://www.mit.jyu.fi/optgroup/

http://users.jyu.fi/~jhaka/en/

on prediction

Documents