online temporally adaptive parameter estimation with applications to streaming data analysis
TRANSCRIPT
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
1/34
Online temporally adaptive parameterestimationwith applications to streaming data analysis
Christoforos AnagnostopoulosResearch Fellow, Statistical Laboratory, Cambridge
Professor David J. Hand (Imperial College London)Dr Niall M. Adams (Imperial College London)
Dr David Leslie (Bristol University)
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
2/34
Streaming data analysis
Specifications: discrete-time, regular sampling(xi)i=1,....
online (operate in real time)
drift-tolerant (handle unforeseen disturbances / data shifts)
Examples:
internet services(e.g., spam filtering, recommender systems)
retail banking (e.g., credit card fraud detection)
financial data (e.g., online portfolio optimisation)
audio-video (e.g., data mining in video sequences)
sensor networks (e.g., adaptive sensor querying, environmentalmonitoring, situational awareness, multiple target tracking)
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
3/34
Adapting offline tools to streaming contexts
sliding windows
Fit your model to thewmost recent observations. Pros/cons:
online, drift-tolerant, universally applicable, BUT
! inefficient in terms of computation and information
! unnatural sharp cut-off
! how to choosew?
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
4/34
Adapting offline tools to streaming contexts
learning rates
Offline learning theory perspective:
letL(x, )be a loss function
targetthat minimises EX[L(x; )]
uset
i=1L(x
i, )as a proxy
Gradient descent:
k : =k1+c
t
i=1
L(xi, ) (offline)
t : =t1+L(xt, ) (online)
Fairly generic. Learning ratecontrols relative importance of
past and novel data. Vanishingyields convergence. Holding
>0 yields temporal adaptivity.How to choose?
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
5/34
Adapting offline tools to streaming contexts
forgetting factors
Consider again learning theory perspective, replace
log-additive loss
L(x1:(t+1), ) :=L(x1:t; ) +L(xt; )
withexponentially weighted version:
L(x1:(t+1), ) =L(x1:t; ) +L(xt+1; )
=1 yields offline solution, =0 discards the data history
there exist relationships with learning rates in certain cases
how to choose?
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
6/34
How to choosew / /?
Intuition:bias-variance tradeoff.
in practice need online, data-dependent methods
common: optimise one-step-ahead lossL(xt+1; )
Example:
att+1, run three SGD updates for eacht+c,t,t c
select among them using one-step-ahead loss
sett+1
to the best-performing one
There are far more sophisticated choices... See later.
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
7/34
Research agenda: what is really going on?
Parametric estimation, butrelax the i.i.d. assumption:
xt f(X; t)
Assumptions about thet-process?
(Bayesian) dynamic modelling: joint distribution on(xt, t)t Markov assumptions (e.g., state-space model) tis produced via inferential machinery
temporally adaptive estimation: produce a recursive estimator e.g., hack a static estimator
into shape as described earlier (analogy with WML?) study it underweak assumptionsabout dynamics always reasonable,never optimalperformance
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
8/34
Online EM for exponential family models
Focus first on a simple average:
t=1
t
t
i=1
xi = (1 1
t)t1+
1
txt
For exponential family models, switch to sufficient statistics:
t=1t
t
i=1
s(xi) = (1 1t)t1+1
ts(xt)
With exponential forgetting, t=1
t
ti=1
tis(xi), we have
t=t1+t(s(xt) t1), t= 1nt
, nt=tnt1+1
If only a part (or a deterministic function) vtofxtis observed:
t=t1+t(Et1 [s(xt) | vt] t1)
Algorithm a Robbins-Monro scheme, and still converges.
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
9/34
Structure of talk hereon
1.The Benveniste-Sutton gradient approach
2. Theoretical work in progress
3. Applications
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
10/34
The Benveniste-Sutton gradient approach
Approaches:
vanishing learning rates
! unsuitable for drifting contexts
small, fixed learning rates! choice of value is crucial for performance
! time-varying learning rates may be preferable! theoretical determination of optimal value begs the question
self-tuning data-dependentmethods
systematic study in least squares contexts various heuristics in streaming data analysis
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
11/34
The Benveniste-Sutton gradient approach
tunetin accordance to anapproximateSGD step:
t=t1+G(t1,t, xt)
whereGis an online approximation to the gradient of the
one-step-ahead loss with respect to
infixedcase recursive computation of the gradient is
possible. In the time-varyingtcase
treat all of1,2, . . . , tas instances of the same
formal variable, and differentiateL(xt+1;t)with
respect to that variable.
Suttons interpretation of this construction:
derivative with respect to an infinitesimal change
in [the learning rate] at all time-steps
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
12/34
The Benveniste-Sutton gradient approach
Reaction of gradient to abrupt jump
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
13/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
0 100 200 300 400 500 600 700 800 900 10000.7
0.8
0.9
1 STATIC
F
orgettingFactor
Time
0 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
Time
KL(estimated,true)
Ensemble Average +/ 1 Standard Deviation Path
Typical Path
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
14/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
0 100 200 300 400 500 600 700 800 900 1000.7
0.8
0.9
1 ABRUPT SLOW
F
orgettingFactor
Time
0 100 200 300 400 500 600 700 800 900 1000
10
20
30
40
Time
KL(estimated
,true)
Ensemble Average +/ 1 Standard Deviation Path
Typical Path
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
15/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
0 100 200 300 400 500 600 700 800 900 10000.7
0.8
0.9
1 ABRUPT FAST
F
orgettingFactor
Time
0 100 200 300 400 500 600 700 800 900 10000
10
20
30
40
Time
KL(estimated
,true)
Ensemble Average +/ 1 Standard Deviation Path
Typical Path
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
16/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
0 100 200 300 400 500 600 700 800 900 10000.7
0.8
0.9
1 SMOOTH SLOW
F
orgettingFactor
Time
0 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
Time
KL(estimated
,true)
Ensemble Average +/ 1 Standard Deviation Path
Typical Path
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
17/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
0 100 200 300 400 500 600 700 800 900 10000.7
0.8
0.9
1 SMOOTH FAST
F
orgettingFactor
Time
0 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
Time
KL(estimated
,true)
Ensemble Average +/ 1 Standard Deviation Path
Typical Path
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
18/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
0
2
4
6
8
10
12 STATIC
KL(estimated,true)
AF(0
.000
1)
FF(1)
FF(0
.99)
FF(0
.98)
FF(0
.96)
FF(0
.94)
FF(0
.92)
FF(0
.9)
FF(0
.88)
FF(0
.86)
FF(0
.84)
FF(0
.82)
FF(0
.8)
Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
19/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
5
0
5
10
15
20
25 ABRUPT SLOW
KL(estimated,true)
AF(0
.000
1)
FF(1)
FF(0
.99)
FF(0
.98)
FF(0
.96)
FF(0
.94)
FF(0
.92)
FF(0
.9)
FF(0
.88)
FF(0
.86)
FF(0
.84)
FF(0
.82)
FF(0
.8)
Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
20/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
5
0
5
10
15
20
25 ABRUPT FAST
KL(estimated,true)
AF(0
.000
1)
FF(1)
FF(0
.99)
FF(0
.98)
FF(0
.96)
FF(0
.94)
FF(0
.92)
FF(0
.9)
FF(0
.88)
FF(0
.86)
FF(0
.84)
FF(0
.82)
FF(0
.8)
Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
21/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
2
4
6
8
10
12
14
16 SMOOTH SLOW
KL(estimated,true)
AF(0
.000
1)
FF(1)
FF(0
.99)
FF(0
.98)
FF(0
.96)
FF(0
.94)
FF(0
.92)
FF(0
.9)
FF(0
.88)
FF(0
.86)
FF(0
.84)
FF(0
.82)
FF(0
.8)
Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
22/34
The Benveniste-Sutton gradient approach
Drifting Gaussians
2
4
6
8
10
12
14
16
18
20
22 SMOOTH FAST
KL(estimated,tr
ue)
AF(0
.000
1)
FF(1)
FF(0
.99)
FF(0
.98)
FF(0
.96)
FF(0
.94)
FF(0
.92)
FF(0
.9)
FF(0
.88)
FF(0
.86)
FF(0
.84)
FF(0
.82)
FF(0
.8)
Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
23/34
Theoretical work
sketch
use weak convergence of stochastic approximations theory
key idea: apply it on the learning rate update
t=t1+G(.)rather than the parameter update
show that the adaptive forgetting module picks the bestpossible estimator from the given class
proof requires stationarity of the gradient process:
KL(()
t , xt)
characterisation of applicability of the BS approximation?
proof technique allows relaxation to near-stationarity
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
24/34
Applications
Streaming classification for credit card fraud detection
Streaming Poisson Mixture Models for internet trafficmonitoring
Streaming variable selection for short-term local wind
forecasting
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
25/34
Streaming Classification
Problem Formulation
Consider a stream of labelled data(ci, xi)i=1,.... Assumptions:
ctarrives beforext(but beforext+1)
feature vectors arrive at regular intervals
! Relaxing these assumptions requires arguments about
sampling frequency and/or information fusion approach
Objectives:
track process and predict current class label
dont store past data
deal with drift / jumps
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
26/34
Streaming Classification: some results
Quadratic Discriminant Analysis, Credit Card Fraud Detection
fraudsters change tactics often (legal users not so often)
19 features, 624, 440 instances
assume that label is available with lag 1 (unrealistic)
class imbalance: use Area Under the ROC Curve (AUC)heavy pre-processing stage (collaborative work)
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
27/34
Streaming PMM: some results
internet traffic data
0 2 4 6 8 10
x 104
0
50
100
150
Time
Learnt component parameter estimates 1and
2
1
2(offline)
1
2(online)
0 2 4 6 8 10
x 104
0
0.2
0.4
0.6
0.8
1
Time
Learnt prior estimates
1
2(offline)
1
2(online)
Figure:Offline EM (solid lines) and SPMM-F1.0 (dotted lines) with
block sizem=500 against the IPTRACE dataset.
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
28/34
Streaming PMM: some results
internet traffic data
0 2 4 6 8 10
x 104
0
50
100
150
Time
Learnt component parameter estimates 1and
2
1
2(offline)
1
2(online)
0 2 4 6 8 10
x 104
0
0.2
0.4
0.6
0.8
1
Time
Learnt prior estimates
1
2(offline)
1
2
(online)
Figure:Offline EM (solid lines) and SPMM-F1.0 (dotted lines) with
block sizem=10, against the IPTRACE dataset.
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
29/34
Streaming PMM: some results
internet traffic data
0 2 4 6 8 10
x 104
0
50
100
150
Time
Learnt component parameter estimates 1and
2
1
2(offline)
1
2(online)
0 2 4 6 8 10
x 104
0
0.2
0.4
0.6
0.8
1
Time
Learnt prior estimates
1
2(offline)
1
2
(online)
Figure:Offline EM (solid lines) and SPMM-F1.0 (dotted lines) with
block sizem=10, against a randomly permuted version of IPTRACE
dataset.
S i PMM l
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
30/34
Streaming PMM: some results
internet traffic data
0 2 4 6 8 10
x 104
0
50
100
150
Time
Learnt component parameter estimates 1and 2
1
2(offline)
1
2(online)
0 2 4 6 8 10
x 104
0
0.5
1
Time
Learnt prior estimates
1
2(offline)
1
2(online)
0 1 2 3 4 5 6 7 8 9 10
x 104
0.85
0.9
0.95
1Learnt forgetting factor
Time
Figure:Offline EM (solid lines) and SPMM-AF (dotted lines) withblock size 10, against a randomly permuted version of IPTRACEdataset.
St i PMM lt
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
31/34
Streaming PMM: some results
internet traffic data
0 2 4 6 8 10
x 104
0
50
100
150
Time
Learnt component parameter estimates 1and
2
1
2(offline)
1
2(online)
0 2 4 6 8 10
x 104
0
0.5
1
Time
Learnt prior estimates
1
2(offline)
1
2(online)
0 1 2 3 4 5 6 7 8 9 10
x 104
0.85
0.9
0.95
1Learnt forgetting factor
Time
Figure:Offline EM (solid lines) and SPMM-AF (dotted lines) withblock size 10, against the IPTRACE dataset.
St i i bl l ti lt
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
32/34
Streaming variable selection: some results
short-term wind forecasting
Decentralised, streaming version of graphical Lasso forcovariance selection (i.e., sparse precision matrix)Use this as the basis of an adaptive querying algorithmTune both the forgetting factor, and the sparsity parameter
St i i bl l ti lt
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
33/34
Streaming variable selection: some results
short-term wind forecasting
Figure:Forgetting factors, sparsity parameters, and predictiveperformance of RLASSO-AF.
Conclusions
-
8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis
34/34
Conclusions
Stochastic approximation algorithms with forgetting factors
/ learning rates are becoming increasingly popular in thestreaming data literature as a means to readily obtain
streaming implementations of standard offline techniques
Adaptive forgetting (online, self-tuning) is of interest. one-size-fits-all approach vs bespoke approach usually far more computationally efficient requires novel type of theoretical reasoning
Adaptive forgettingis not easyto get right.
Speculation:
could forgetting be understood as robustness to model
mis-specification with respect to dynamics?
what about particle filtering?
For more information, visit www.canagnos.co.uk
http://www.canagnos.co.uk/http://www.canagnos.co.uk/