on prediction

18
On prediction Jussi Hakanen Post-doctoral researcher [email protected] April 22, 2014 TIES445 Data mining (guest lecture)

Upload: others

Post on 23-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On prediction

On prediction

Jussi Hakanen

Post-doctoral researcher [email protected]

April 22, 2014 TIES445 Data mining (guest lecture)

Page 2: On prediction

Learning outcomes

To understand the basic principles of

prediction

To understand linear regression in prediction

To be aware of the connection between the

least squares and optimization

April 22, 2014 TIES445 Data mining (guest lecture)

Page 3: On prediction

Exercise

Find out issues related to prediction in

data mining

Work in pairs or small groups

Time: 10 minutes

April 22, 2014 TIES445 Data mining (guest

lecture)

Page 4: On prediction

Summary of the exercise

How to measure quality of prediction?

How to avoid overlearning/overfitting?

Supervised learning

How to predict for missing data?

Different applications (stock markets, biology, sport prediction, income in finance, energy water consumption, …)

Underfitting

Bias of the model

Concept drift (inaccuracy of the prediction)

Obtaining new knowledge from a given data

April 22, 2014 TIES445 Data mining (guest

lecture)

Page 5: On prediction

Motivation

How to estimate data – Within the ranges of the

dataset?

– Outside of the dataset?

Handling missing data, outliers

Predictive vs. descriptive models

Prediction for numerical values (cf. classification)

April 22, 2014 TIES445 Data mining (guest lecture)

Page 6: On prediction

Concepts

Predictor (or independent or input) variables

π‘₯ = π‘₯1, … , π‘₯𝑁𝑇 (𝑁 β‰₯ 1)

Response (or dependent or output) variables

𝑦 = 𝑦1, … , 𝑦𝑀𝑇 (𝑀 β‰₯ 1)

Regression model/function

– A model/function describing the prediction used

Linear regression

– Regression model/function is linear

April 22, 2014 TIES445 Data mining (guest lecture)

Page 7: On prediction

Prediction

A set of data for which the values of predictor and response variables are known

– 𝑃 data points π‘₯𝑗 , 𝑦𝑗 , 𝑗 = 1, … , 𝑃

– π‘₯𝑗 = π‘₯1𝑗, … , π‘₯𝑁

𝑗 𝑇

– 𝑦𝑗 = 𝑦1𝑗, … , 𝑦𝑀

𝑗 𝑇

Idea is to use prediction models for predicting a value for such predictor variables for which we don’t know the response

Note: 𝑃 should be greater or equal than 𝑁!

Interpolation vs. extrapolation

Can give misleading results if not interpreted carefully!!!

Accuracy important – Different measures of accuracy

– Can be used e.g. to choose between different models and/or for choosing values for different parameters in the models

– Can sometimes be sacrificed for a simpler model

April 22, 2014 TIES445 Data mining (guest lecture)

Page 8: On prediction

Regression analysis

Used for prediction and forecasting

Parametric/non-parametric regression

Regression function depends of a finite number of

unknown parameters

Non-parametric: regression function is in a set of

functions (can be infinite dimensional)

Linear and nonlinear regression

W.r.t to the parameters

April 22, 2014 TIES445 Data mining (guest lecture)

Page 9: On prediction

Linear regression

Model linear w.r.t. parameters – Not necessarily linear w.r.t. predictor variables

𝑦 = π‘Ž0 + π‘Žπ‘–π‘₯𝑖𝑁𝑖=1

– 𝑦 is a predicted estimate of the mean value at π‘₯

– π‘Ž0, … , π‘Žπ‘ are parameters

Oldest and most widely used due to simplicity

Typically, the model used is not exact β†’ an error exists – 𝑦𝑗 = 𝑦 𝑗 + 𝑒𝑗 for each data point π‘₯𝑗 , 𝑗 = 1, … , 𝑃

– In matrix terms: 𝑦 = π‘‹π‘Ž + 𝑒

How to select values for the parameters π‘Ž?

April 22, 2014 TIES445 Data mining (guest lecture)

Page 10: On prediction

Least squares

Determining orbits of bodies around the Sun from astronomical observations (Legendre, 1805; Gauss, 1809)

Idea: minimize the sum of the squared errors

For problems with 𝑃 > 𝑁

𝑒𝑗 2𝑃𝑗=1 = 𝑦𝑗 βˆ’ π‘Žπ‘–π‘₯𝑖

𝑗𝑁𝑖=0

2𝑃𝑗=1

minπ‘Ž

𝑦𝑗 βˆ’ π‘Žπ‘–π‘₯𝑖𝑗𝑁

𝑖=0

2𝑃𝑗=1

– Optimization problem!

Parameter values minimizing the above can be shown to be π‘Žβˆ— = 𝑋𝑇𝑋 βˆ’1𝑋𝑇𝑦 – Direct solution requires 𝑋𝑇𝑋 to be invertible (problems if 𝑃 is small or

there are linear dependences between π‘₯𝑖)

– Typically π‘Žβˆ— is computed by numerical linear algebra

April 22, 2014 TIES445 Data mining (guest lecture)

Page 11: On prediction

Example

April 22, 2014 TIES445 Data mining (guest lecture)

𝑦 = βˆ’0.6777 + 3.0166π‘₯

Page 12: On prediction

Example (cont.)

April 22, 2014 TIES445 Data mining (guest lecture)

𝑦 = 4.1579 βˆ’ 0.0057π‘₯ + 0.3053π‘₯2

Page 13: On prediction

Notes

The parameter values in linear regression can

be interpreted as follows:

– If the value of predictor variable π‘₯𝑖 is increased by

one unit and the values of other predictor variables

remain the same, then π‘Žπ‘– denotes the change in

prediction

– 𝑦 = π‘Ž0 + π‘Žπ‘–π‘₯𝑖𝑁𝑖=1

April 22, 2014 TIES445 Data mining (guest

lecture)

Page 14: On prediction

Connection to optimization

Least squares β†’ unconstrained optimization problem

minπ‘Ž

1

2 (𝑓𝑗(π‘Ž))2𝑃

𝑗=1 = minπ‘Ž

1

2𝑓 π‘Ž 2

– Function 𝑓𝑗(π‘Ž) = 𝑦𝑗 βˆ’ β„Ž(π‘Ž, π‘₯𝑗) where β„Ž(π‘Ž, π‘₯) is the model used

– E.g. β„Ž π‘Ž, π‘₯ = π‘Ž0 + π‘Žπ‘–π‘₯𝑖𝑁𝑖=1

Gauss-Newton method

– Taylor (1st order): 𝑓 π‘Ž, π‘Žβ„Ž β‰ˆ 𝑓 π‘Žβ„Ž + 𝛻𝑓 π‘Žβ„Ž 𝑇(π‘Ž βˆ’ π‘Žβ„Ž)

– β†’ π‘Žβ„Ž+1 = π‘Žβ„Ž βˆ’ 𝛻𝑓 π‘Žβ„Ž 𝛻𝑓 π‘Žβ„Ž 𝑇 βˆ’1𝛻𝑓 π‘Žβ„Ž 𝑓(π‘Žβ„Ž)

Connection to Newton’s method

– Hessian of 1

2𝑓 π‘Ž 2: 𝛻𝑓 π‘Žβ„Ž 𝛻𝑓 π‘Žβ„Ž 𝑇

+ 𝛻2𝑓𝑗 π‘Žβ„Ž 𝑓𝑗(π‘Žβ„Ž)𝑃

𝑗=1

– Gauss-Newton is equivalent with Newton except the second order term!

April 22, 2014 TIES445 Data mining (guest lecture)

Page 15: On prediction

Function approximation

Prediction can be used in optimization for approximating the objective function

Typically used when the evaluation of the objective function is time consuming – E.g. if the model is a partial differential equation that

takes significant amount of time to solve numerically

– Reduces time for optimization since typically a large amount of function evaluations are required

Examples of approximation models are polynomial approximation, radial basis functions (RBFs), Kriging, support vector regression

April 22, 2014 TIES445 Data mining (guest lecture)

Page 16: On prediction

Regularization

Previously, no requirements were made for the parameter values – Unconstraint optimization problem

E.g. need for constraining the size of parameters

Tikhonov regularization (ridge regression) – Add a constraint that π‘Ž 2, the 𝐿2 norm of the parameter vector is

not greater than a given value

– Can be considered as unconstraint optimization problem by adding a penalty term 𝛽 π‘Ž 2 to the objective function

Lasso method (least absolute shrinkage and selection operator) – Add a constraint that π‘Ž 1, the 𝐿1 norm of the parameter vector is

not greater than a given value

– Prefers solutions with fewer non-zeros

April 22, 2014 TIES445 Data mining (guest lecture)

Page 17: On prediction

Conclusions

What were the keypoints from your

perspective? What do you remember best?

Extrapolation and interpolation can be

dangerous

Regularization is important?

April 22, 2014 TIES445 Data mining (guest lecture)

Page 18: On prediction

April 22, 2014

Thank You!

Dr. Jussi Hakanen

Industrial Optimization Group

http://www.mit.jyu.fi/optgroup/

Department of Mathematical Information Technology

P.O. Box 35 (Agora)

FI-40014 University of JyvΓ€skylΓ€

[email protected]

http://users.jyu.fi/~jhaka/en/

TIES445 Data mining (guest lecture)