[width=3.3cm]images/logomoa.jpg .5cm regressionabifet/523/regression-slides.pdf · i regression...

Regression

Albert Bifet

May 2012

COMP423A/COMP523A Data Stream Mining

Outline

1. Introduction2. Stream Algorithmics3. Concept drift4. Evaluation5. Classification6. Ensemble Methods7. Regression8. Clustering9. Frequent Pattern Mining

10. Distributed Streaming

Data Streams

Big Data & Real Time

Regression

DefinitionGiven a numeric class attribute, a regression algorithm builds amodel that predicts for every unlabelled instance I a numericvalue with accuracy.

y = f (x)

ExampleStock-Market price prediction

ExampleAirplane delays

Evaluation

1. Error estimation: Hold-out or Prequential

2. Evaluation performance measures: MSE or MAE

3. Statistical significance validation: Nemenyi test

Evaluation Framework

2. Performance Measures

Regression mean measures

I Mean square error:

MSE =∑

(f (xi)− yi)2/N

I Root mean square error:

RMSE =√

MSE =√∑

(f (xi)− yi)2/N

Forgetting mechanism for estimating measuresSliding window of size w with the most recent observations


Regression relative measures

I Relative Square error:

RSE =∑

(f (xi)− yi)2/

∑(yi − yi)

2

I Root relative square error:

RRSE =√

RSE =√∑

(f (xi)− yi)2/∑

(yi)− yi)2


Linear Methods for Regression

Linear Least Squares fitting

I Linear Regression Model

f (x) = β0 +

p∑j=1

βjxj = Xβ

I Minimize residual sum of squares

RSS(β) =N∑

i=1

(yi − f (xi))2/N = (y− Xβ)′(y− Xβ)

I Solution:β = (X′X)−1X′y

Perceptron

Attribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w (~xi)

w1

w2

w3

w4

w5

I Data stream: 〈~xi , yi〉I Classical perceptron: h~w (~xi) = ~wT~xi ,I Minimize Mean-square error: J(~w) = 1

2∑

(yi − h~w (~xi))2

Perceptron

I Minimize Mean-square error: J(~w) = 12∑

(yi − h~w (~xi))2

I Stochastic Gradient Descent: ~w = ~w − η∇J~xi

I Gradient of the error function:

∇J = −∑

i

(yi − h~w (~xi))

I Weight update rule

~w = ~w + η∑

i

(yi − h~w (~xi))~xi

Fast Incremental Model Tree with Drift DetectionFIMT-DD

FIMT-DD differences with HT:

1. Splitting Criterion2. Numeric attribute handling using BINTREE3. Linear model at the leaves4. Concept Drift Handling: Page-Hinckley5. Alternate Tree adaption strategy

Splitting Criterion

Standard Deviation Reduction MeasureI Classification

Information Gain = Entropy(before Split)− Entropy(after split)

Entropy = −c∑

pi · log pi

Gini Index =c∑

pi(1− pi) = 1−c∑

p2i

I Regression

Gain = SD(before Split)− SD(after split)

StandardDeviation (SD) =√∑

(y − yi)2/N

Numeric Handling Methods

Exhaustive Binary Tree (BINTREE – Gama et al, 2003)

I Closest implementation of a batch methodI Incrementally update a binary tree as data is observedI Issues: high memory cost, high cost of split search, data

order

Page Hinckley Test

I The CUSUM test

g0 = 0, gt = max (0,gt−1 + εt − υ)

if gt > h then alarm and gt = 0

I The Page Hinckley Test

g0 = 0, gt = gt−1 + (εt − υ)

Gt = min(gt )

if gt −Gt > h then alarm and gt = 0

Lazy Methods

kNN Nearest Neighbours:

1. Mean value of the k nearest neighbours

f (xq) =

∑ki=1 f (xi)

k

2. Depends on distance function

[width=3.3cm]images/logomoa.jpg .5cm regressionabifet/523/regression-slides.pdf · i regression...

Documents