online temporally adaptive parameter estimation with applications to streaming data analysis

8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

1/34

Online temporally adaptive parameterestimationwith applications to streaming data analysis

Christoforos AnagnostopoulosResearch Fellow, Statistical Laboratory, Cambridge

Professor David J. Hand (Imperial College London)Dr Niall M. Adams (Imperial College London)

Dr David Leslie (Bristol University)


2/34

Streaming data analysis

Specifications: discrete-time, regular sampling(xi)i=1,....

online (operate in real time)

drift-tolerant (handle unforeseen disturbances / data shifts)

Examples:

internet services(e.g., spam filtering, recommender systems)

retail banking (e.g., credit card fraud detection)

financial data (e.g., online portfolio optimisation)

audio-video (e.g., data mining in video sequences)

sensor networks (e.g., adaptive sensor querying, environmentalmonitoring, situational awareness, multiple target tracking)


3/34

Adapting offline tools to streaming contexts

sliding windows

Fit your model to thewmost recent observations. Pros/cons:

online, drift-tolerant, universally applicable, BUT

! inefficient in terms of computation and information

! unnatural sharp cut-off

! how to choosew?


4/34


learning rates

Offline learning theory perspective:

letL(x, )be a loss function

targetthat minimises EX[L(x; )]

uset

i=1L(x

i, )as a proxy

Gradient descent:

k : =k1+c

t

i=1

L(xi, ) (offline)

t : =t1+L(xt, ) (online)

Fairly generic. Learning ratecontrols relative importance of

past and novel data. Vanishingyields convergence. Holding

>0 yields temporal adaptivity.How to choose?


5/34


forgetting factors

Consider again learning theory perspective, replace

log-additive loss

L(x1:(t+1), ) :=L(x1:t; ) +L(xt; )

withexponentially weighted version:

L(x1:(t+1), ) =L(x1:t; ) +L(xt+1; )

=1 yields offline solution, =0 discards the data history

there exist relationships with learning rates in certain cases

how to choose?


6/34

How to choosew / /?

Intuition:bias-variance tradeoff.

in practice need online, data-dependent methods

common: optimise one-step-ahead lossL(xt+1; )

Example:

att+1, run three SGD updates for eacht+c,t,t c

select among them using one-step-ahead loss

sett+1

to the best-performing one

There are far more sophisticated choices... See later.


7/34

Research agenda: what is really going on?

Parametric estimation, butrelax the i.i.d. assumption:

xt f(X; t)

Assumptions about thet-process?

(Bayesian) dynamic modelling: joint distribution on(xt, t)t Markov assumptions (e.g., state-space model) tis produced via inferential machinery

temporally adaptive estimation: produce a recursive estimator e.g., hack a static estimator

into shape as described earlier (analogy with WML?) study it underweak assumptionsabout dynamics always reasonable,never optimalperformance


8/34

Online EM for exponential family models

Focus first on a simple average:

t=1

t

t

i=1

xi = (1 1

t)t1+

1

txt

For exponential family models, switch to sufficient statistics:

t=1t

t

i=1

s(xi) = (1 1t)t1+1

ts(xt)

With exponential forgetting, t=1

t

ti=1

tis(xi), we have

t=t1+t(s(xt) t1), t= 1nt

, nt=tnt1+1

If only a part (or a deterministic function) vtofxtis observed:

t=t1+t(Et1 [s(xt) | vt] t1)

Algorithm a Robbins-Monro scheme, and still converges.


9/34

Structure of talk hereon

1.The Benveniste-Sutton gradient approach

2. Theoretical work in progress

3. Applications


10/34

The Benveniste-Sutton gradient approach

Approaches:

vanishing learning rates

! unsuitable for drifting contexts

small, fixed learning rates! choice of value is crucial for performance

! time-varying learning rates may be preferable! theoretical determination of optimal value begs the question

self-tuning data-dependentmethods

systematic study in least squares contexts various heuristics in streaming data analysis


11/34


tunetin accordance to anapproximateSGD step:

t=t1+G(t1,t, xt)

whereGis an online approximation to the gradient of the

one-step-ahead loss with respect to

infixedcase recursive computation of the gradient is

possible. In the time-varyingtcase

treat all of1,2, . . . , tas instances of the same

formal variable, and differentiateL(xt+1;t)with

respect to that variable.

Suttons interpretation of this construction:

derivative with respect to an infinitesimal change

in [the learning rate] at all time-steps


12/34


Reaction of gradient to abrupt jump


13/34


Drifting Gaussians

0 100 200 300 400 500 600 700 800 900 10000.7

0.8

0.9

1 STATIC

F

orgettingFactor

Time

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

Time

KL(estimated,true)

Ensemble Average +/ 1 Standard Deviation Path

Typical Path


14/34


Drifting Gaussians

0 100 200 300 400 500 600 700 800 900 1000.7

0.8

0.9

1 ABRUPT SLOW

F

orgettingFactor

Time

0 100 200 300 400 500 600 700 800 900 1000

10

20

30

40

Time

KL(estimated

,true)


Typical Path


15/34


Drifting Gaussians

0 100 200 300 400 500 600 700 800 900 10000.7

0.8

0.9

1 ABRUPT FAST

F

orgettingFactor

Time

0 100 200 300 400 500 600 700 800 900 10000

10

20

30

40

Time

KL(estimated

,true)


Typical Path


16/34


Drifting Gaussians

0 100 200 300 400 500 600 700 800 900 10000.7

0.8

0.9

1 SMOOTH SLOW

F

orgettingFactor

Time

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

Time

KL(estimated

,true)


Typical Path


17/34


Drifting Gaussians

0 100 200 300 400 500 600 700 800 900 10000.7

0.8

0.9

1 SMOOTH FAST

F

orgettingFactor

Time

0 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

Time

KL(estimated

,true)


Typical Path


18/34


Drifting Gaussians

0

2

4

6

8

10

12 STATIC

KL(estimated,true)

AF(0

.000

1)

FF(1)

FF(0

.99)

FF(0

.98)

FF(0

.96)

FF(0

.94)

FF(0

.92)

FF(0

.9)

FF(0

.88)

FF(0

.86)

FF(0

.84)

FF(0

.82)

FF(0

.8)

Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)


19/34


Drifting Gaussians

5

0

5

10

15

20

25 ABRUPT SLOW

KL(estimated,true)

AF(0

.000

1)

FF(1)

FF(0

.99)

FF(0

.98)

FF(0

.96)

FF(0

.94)

FF(0

.92)

FF(0

.9)

FF(0

.88)

FF(0

.86)

FF(0

.84)

FF(0

.82)

FF(0

.8)



20/34


Drifting Gaussians

5

0

5

10

15

20

25 ABRUPT FAST

KL(estimated,true)

AF(0

.000

1)

FF(1)

FF(0

.99)

FF(0

.98)

FF(0

.96)

FF(0

.94)

FF(0

.92)

FF(0

.9)

FF(0

.88)

FF(0

.86)

FF(0

.84)

FF(0

.82)

FF(0

.8)



21/34


Drifting Gaussians

2

4

6

8

10

12

14

16 SMOOTH SLOW

KL(estimated,true)

AF(0

.000

1)

FF(1)

FF(0

.99)

FF(0

.98)

FF(0

.96)

FF(0

.94)

FF(0

.92)

FF(0

.9)

FF(0

.88)

FF(0

.86)

FF(0

.84)

FF(0

.82)

FF(0

.8)



22/34


Drifting Gaussians

2

4

6

8

10

12

14

16

18

20

22 SMOOTH FAST

KL(estimated,tr

ue)

AF(0

.000

1)

FF(1)

FF(0

.99)

FF(0

.98)

FF(0

.96)

FF(0

.94)

FF(0

.92)

FF(0

.9)

FF(0

.88)

FF(0

.86)

FF(0

.84)

FF(0

.82)

FF(0

.8)



23/34

Theoretical work

sketch

use weak convergence of stochastic approximations theory

key idea: apply it on the learning rate update

t=t1+G(.)rather than the parameter update

show that the adaptive forgetting module picks the bestpossible estimator from the given class

proof requires stationarity of the gradient process:

KL(()

t , xt)

characterisation of applicability of the BS approximation?

proof technique allows relaxation to near-stationarity


24/34

Applications

Streaming classification for credit card fraud detection

Streaming Poisson Mixture Models for internet trafficmonitoring

Streaming variable selection for short-term local wind

forecasting


25/34

Streaming Classification

Problem Formulation

Consider a stream of labelled data(ci, xi)i=1,.... Assumptions:

ctarrives beforext(but beforext+1)

feature vectors arrive at regular intervals

! Relaxing these assumptions requires arguments about

sampling frequency and/or information fusion approach

Objectives:

track process and predict current class label

dont store past data

deal with drift / jumps


26/34

Streaming Classification: some results

Quadratic Discriminant Analysis, Credit Card Fraud Detection

fraudsters change tactics often (legal users not so often)

19 features, 624, 440 instances

assume that label is available with lag 1 (unrealistic)

class imbalance: use Area Under the ROC Curve (AUC)heavy pre-processing stage (collaborative work)


27/34

Streaming PMM: some results

internet traffic data

0 2 4 6 8 10

x 104

0

50

100

150

Time

Learnt component parameter estimates 1and

2

1

2(offline)

1

2(online)

0 2 4 6 8 10

x 104

0

0.2

0.4

0.6

0.8

1

Time

Learnt prior estimates

1

2(offline)

1

2(online)

Figure:Offline EM (solid lines) and SPMM-F1.0 (dotted lines) with

block sizem=500 against the IPTRACE dataset.


28/34



0 2 4 6 8 10

x 104

0

50

100

150

Time


2

1

2(offline)

1

2(online)

0 2 4 6 8 10

x 104

0

0.2

0.4

0.6

0.8

1

Time


1

2(offline)

1

2

(online)


block sizem=10, against the IPTRACE dataset.


29/34



0 2 4 6 8 10

x 104

0

50

100

150

Time


2

1

2(offline)

1

2(online)

0 2 4 6 8 10

x 104

0

0.2

0.4

0.6

0.8

1

Time


1

2(offline)

1

2

(online)


block sizem=10, against a randomly permuted version of IPTRACE

dataset.

S i PMM l


30/34



0 2 4 6 8 10

x 104

0

50

100

150

Time

Learnt component parameter estimates 1and 2

1

2(offline)

1

2(online)

0 2 4 6 8 10

x 104

0

0.5

1

Time


1

2(offline)

1

2(online)

0 1 2 3 4 5 6 7 8 9 10

x 104

0.85

0.9

0.95

1Learnt forgetting factor

Time

Figure:Offline EM (solid lines) and SPMM-AF (dotted lines) withblock size 10, against a randomly permuted version of IPTRACEdataset.

St i PMM lt


31/34



0 2 4 6 8 10

x 104

0

50

100

150

Time


2

1

2(offline)

1

2(online)

0 2 4 6 8 10

x 104

0

0.5

1

Time


1

2(offline)

1

2(online)

0 1 2 3 4 5 6 7 8 9 10

x 104

0.85

0.9

0.95

1Learnt forgetting factor

Time

Figure:Offline EM (solid lines) and SPMM-AF (dotted lines) withblock size 10, against the IPTRACE dataset.

St i i bl l ti lt


32/34

Streaming variable selection: some results

short-term wind forecasting

Decentralised, streaming version of graphical Lasso forcovariance selection (i.e., sparse precision matrix)Use this as the basis of an adaptive querying algorithmTune both the forgetting factor, and the sparsity parameter

St i i bl l ti lt


33/34

Streaming variable selection: some results

short-term wind forecasting

Figure:Forgetting factors, sparsity parameters, and predictiveperformance of RLASSO-AF.

Conclusions


34/34

Conclusions

Stochastic approximation algorithms with forgetting factors

/ learning rates are becoming increasingly popular in thestreaming data literature as a means to readily obtain

streaming implementations of standard offline techniques

Adaptive forgetting (online, self-tuning) is of interest. one-size-fits-all approach vs bespoke approach usually far more computationally efficient requires novel type of theoretical reasoning

Adaptive forgettingis not easyto get right.

Speculation:

could forgetting be understood as robustness to model

mis-specification with respect to dynamics?

what about particle filtering?

For more information, visit www.canagnos.co.uk
http://www.canagnos.co.uk/http://www.canagnos.co.uk/

online temporally adaptive parameter estimation with applications to streaming data analysis

Documents