online temporally adaptive parameter estimation with applications to streaming data analysis

Upload: supermanvix

Post on 03-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    1/34

    Online temporally adaptive parameterestimationwith applications to streaming data analysis

    Christoforos AnagnostopoulosResearch Fellow, Statistical Laboratory, Cambridge

    Professor David J. Hand (Imperial College London)Dr Niall M. Adams (Imperial College London)

    Dr David Leslie (Bristol University)

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    2/34

    Streaming data analysis

    Specifications: discrete-time, regular sampling(xi)i=1,....

    online (operate in real time)

    drift-tolerant (handle unforeseen disturbances / data shifts)

    Examples:

    internet services(e.g., spam filtering, recommender systems)

    retail banking (e.g., credit card fraud detection)

    financial data (e.g., online portfolio optimisation)

    audio-video (e.g., data mining in video sequences)

    sensor networks (e.g., adaptive sensor querying, environmentalmonitoring, situational awareness, multiple target tracking)

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    3/34

    Adapting offline tools to streaming contexts

    sliding windows

    Fit your model to thewmost recent observations. Pros/cons:

    online, drift-tolerant, universally applicable, BUT

    ! inefficient in terms of computation and information

    ! unnatural sharp cut-off

    ! how to choosew?

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    4/34

    Adapting offline tools to streaming contexts

    learning rates

    Offline learning theory perspective:

    letL(x, )be a loss function

    targetthat minimises EX[L(x; )]

    uset

    i=1L(x

    i, )as a proxy

    Gradient descent:

    k : =k1+c

    t

    i=1

    L(xi, ) (offline)

    t : =t1+L(xt, ) (online)

    Fairly generic. Learning ratecontrols relative importance of

    past and novel data. Vanishingyields convergence. Holding

    >0 yields temporal adaptivity.How to choose?

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    5/34

    Adapting offline tools to streaming contexts

    forgetting factors

    Consider again learning theory perspective, replace

    log-additive loss

    L(x1:(t+1), ) :=L(x1:t; ) +L(xt; )

    withexponentially weighted version:

    L(x1:(t+1), ) =L(x1:t; ) +L(xt+1; )

    =1 yields offline solution, =0 discards the data history

    there exist relationships with learning rates in certain cases

    how to choose?

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    6/34

    How to choosew / /?

    Intuition:bias-variance tradeoff.

    in practice need online, data-dependent methods

    common: optimise one-step-ahead lossL(xt+1; )

    Example:

    att+1, run three SGD updates for eacht+c,t,t c

    select among them using one-step-ahead loss

    sett+1

    to the best-performing one

    There are far more sophisticated choices... See later.

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    7/34

    Research agenda: what is really going on?

    Parametric estimation, butrelax the i.i.d. assumption:

    xt f(X; t)

    Assumptions about thet-process?

    (Bayesian) dynamic modelling: joint distribution on(xt, t)t Markov assumptions (e.g., state-space model) tis produced via inferential machinery

    temporally adaptive estimation: produce a recursive estimator e.g., hack a static estimator

    into shape as described earlier (analogy with WML?) study it underweak assumptionsabout dynamics always reasonable,never optimalperformance

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    8/34

    Online EM for exponential family models

    Focus first on a simple average:

    t=1

    t

    t

    i=1

    xi = (1 1

    t)t1+

    1

    txt

    For exponential family models, switch to sufficient statistics:

    t=1t

    t

    i=1

    s(xi) = (1 1t)t1+1

    ts(xt)

    With exponential forgetting, t=1

    t

    ti=1

    tis(xi), we have

    t=t1+t(s(xt) t1), t= 1nt

    , nt=tnt1+1

    If only a part (or a deterministic function) vtofxtis observed:

    t=t1+t(Et1 [s(xt) | vt] t1)

    Algorithm a Robbins-Monro scheme, and still converges.

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    9/34

    Structure of talk hereon

    1.The Benveniste-Sutton gradient approach

    2. Theoretical work in progress

    3. Applications

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    10/34

    The Benveniste-Sutton gradient approach

    Approaches:

    vanishing learning rates

    ! unsuitable for drifting contexts

    small, fixed learning rates! choice of value is crucial for performance

    ! time-varying learning rates may be preferable! theoretical determination of optimal value begs the question

    self-tuning data-dependentmethods

    systematic study in least squares contexts various heuristics in streaming data analysis

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    11/34

    The Benveniste-Sutton gradient approach

    tunetin accordance to anapproximateSGD step:

    t=t1+G(t1,t, xt)

    whereGis an online approximation to the gradient of the

    one-step-ahead loss with respect to

    infixedcase recursive computation of the gradient is

    possible. In the time-varyingtcase

    treat all of1,2, . . . , tas instances of the same

    formal variable, and differentiateL(xt+1;t)with

    respect to that variable.

    Suttons interpretation of this construction:

    derivative with respect to an infinitesimal change

    in [the learning rate] at all time-steps

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    12/34

    The Benveniste-Sutton gradient approach

    Reaction of gradient to abrupt jump

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    13/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    0 100 200 300 400 500 600 700 800 900 10000.7

    0.8

    0.9

    1 STATIC

    F

    orgettingFactor

    Time

    0 100 200 300 400 500 600 700 800 900 10000

    5

    10

    15

    20

    Time

    KL(estimated,true)

    Ensemble Average +/ 1 Standard Deviation Path

    Typical Path

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    14/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    0 100 200 300 400 500 600 700 800 900 1000.7

    0.8

    0.9

    1 ABRUPT SLOW

    F

    orgettingFactor

    Time

    0 100 200 300 400 500 600 700 800 900 1000

    10

    20

    30

    40

    Time

    KL(estimated

    ,true)

    Ensemble Average +/ 1 Standard Deviation Path

    Typical Path

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    15/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    0 100 200 300 400 500 600 700 800 900 10000.7

    0.8

    0.9

    1 ABRUPT FAST

    F

    orgettingFactor

    Time

    0 100 200 300 400 500 600 700 800 900 10000

    10

    20

    30

    40

    Time

    KL(estimated

    ,true)

    Ensemble Average +/ 1 Standard Deviation Path

    Typical Path

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    16/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    0 100 200 300 400 500 600 700 800 900 10000.7

    0.8

    0.9

    1 SMOOTH SLOW

    F

    orgettingFactor

    Time

    0 100 200 300 400 500 600 700 800 900 10000

    5

    10

    15

    20

    Time

    KL(estimated

    ,true)

    Ensemble Average +/ 1 Standard Deviation Path

    Typical Path

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    17/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    0 100 200 300 400 500 600 700 800 900 10000.7

    0.8

    0.9

    1 SMOOTH FAST

    F

    orgettingFactor

    Time

    0 100 200 300 400 500 600 700 800 900 10000

    5

    10

    15

    20

    Time

    KL(estimated

    ,true)

    Ensemble Average +/ 1 Standard Deviation Path

    Typical Path

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    18/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    0

    2

    4

    6

    8

    10

    12 STATIC

    KL(estimated,true)

    AF(0

    .000

    1)

    FF(1)

    FF(0

    .99)

    FF(0

    .98)

    FF(0

    .96)

    FF(0

    .94)

    FF(0

    .92)

    FF(0

    .9)

    FF(0

    .88)

    FF(0

    .86)

    FF(0

    .84)

    FF(0

    .82)

    FF(0

    .8)

    Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    19/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    5

    0

    5

    10

    15

    20

    25 ABRUPT SLOW

    KL(estimated,true)

    AF(0

    .000

    1)

    FF(1)

    FF(0

    .99)

    FF(0

    .98)

    FF(0

    .96)

    FF(0

    .94)

    FF(0

    .92)

    FF(0

    .9)

    FF(0

    .88)

    FF(0

    .86)

    FF(0

    .84)

    FF(0

    .82)

    FF(0

    .8)

    Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    20/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    5

    0

    5

    10

    15

    20

    25 ABRUPT FAST

    KL(estimated,true)

    AF(0

    .000

    1)

    FF(1)

    FF(0

    .99)

    FF(0

    .98)

    FF(0

    .96)

    FF(0

    .94)

    FF(0

    .92)

    FF(0

    .9)

    FF(0

    .88)

    FF(0

    .86)

    FF(0

    .84)

    FF(0

    .82)

    FF(0

    .8)

    Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    21/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    2

    4

    6

    8

    10

    12

    14

    16 SMOOTH SLOW

    KL(estimated,true)

    AF(0

    .000

    1)

    FF(1)

    FF(0

    .99)

    FF(0

    .98)

    FF(0

    .96)

    FF(0

    .94)

    FF(0

    .92)

    FF(0

    .9)

    FF(0

    .88)

    FF(0

    .86)

    FF(0

    .84)

    FF(0

    .82)

    FF(0

    .8)

    Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    22/34

    The Benveniste-Sutton gradient approach

    Drifting Gaussians

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    22 SMOOTH FAST

    KL(estimated,tr

    ue)

    AF(0

    .000

    1)

    FF(1)

    FF(0

    .99)

    FF(0

    .98)

    FF(0

    .96)

    FF(0

    .94)

    FF(0

    .92)

    FF(0

    .9)

    FF(0

    .88)

    FF(0

    .86)

    FF(0

    .84)

    FF(0

    .82)

    FF(0

    .8)

    Mean1 Standard Deviation (Ensemble)1 Standard Deviation (Temporal)

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    23/34

    Theoretical work

    sketch

    use weak convergence of stochastic approximations theory

    key idea: apply it on the learning rate update

    t=t1+G(.)rather than the parameter update

    show that the adaptive forgetting module picks the bestpossible estimator from the given class

    proof requires stationarity of the gradient process:

    KL(()

    t , xt)

    characterisation of applicability of the BS approximation?

    proof technique allows relaxation to near-stationarity

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    24/34

    Applications

    Streaming classification for credit card fraud detection

    Streaming Poisson Mixture Models for internet trafficmonitoring

    Streaming variable selection for short-term local wind

    forecasting

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    25/34

    Streaming Classification

    Problem Formulation

    Consider a stream of labelled data(ci, xi)i=1,.... Assumptions:

    ctarrives beforext(but beforext+1)

    feature vectors arrive at regular intervals

    ! Relaxing these assumptions requires arguments about

    sampling frequency and/or information fusion approach

    Objectives:

    track process and predict current class label

    dont store past data

    deal with drift / jumps

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    26/34

    Streaming Classification: some results

    Quadratic Discriminant Analysis, Credit Card Fraud Detection

    fraudsters change tactics often (legal users not so often)

    19 features, 624, 440 instances

    assume that label is available with lag 1 (unrealistic)

    class imbalance: use Area Under the ROC Curve (AUC)heavy pre-processing stage (collaborative work)

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    27/34

    Streaming PMM: some results

    internet traffic data

    0 2 4 6 8 10

    x 104

    0

    50

    100

    150

    Time

    Learnt component parameter estimates 1and

    2

    1

    2(offline)

    1

    2(online)

    0 2 4 6 8 10

    x 104

    0

    0.2

    0.4

    0.6

    0.8

    1

    Time

    Learnt prior estimates

    1

    2(offline)

    1

    2(online)

    Figure:Offline EM (solid lines) and SPMM-F1.0 (dotted lines) with

    block sizem=500 against the IPTRACE dataset.

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    28/34

    Streaming PMM: some results

    internet traffic data

    0 2 4 6 8 10

    x 104

    0

    50

    100

    150

    Time

    Learnt component parameter estimates 1and

    2

    1

    2(offline)

    1

    2(online)

    0 2 4 6 8 10

    x 104

    0

    0.2

    0.4

    0.6

    0.8

    1

    Time

    Learnt prior estimates

    1

    2(offline)

    1

    2

    (online)

    Figure:Offline EM (solid lines) and SPMM-F1.0 (dotted lines) with

    block sizem=10, against the IPTRACE dataset.

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    29/34

    Streaming PMM: some results

    internet traffic data

    0 2 4 6 8 10

    x 104

    0

    50

    100

    150

    Time

    Learnt component parameter estimates 1and

    2

    1

    2(offline)

    1

    2(online)

    0 2 4 6 8 10

    x 104

    0

    0.2

    0.4

    0.6

    0.8

    1

    Time

    Learnt prior estimates

    1

    2(offline)

    1

    2

    (online)

    Figure:Offline EM (solid lines) and SPMM-F1.0 (dotted lines) with

    block sizem=10, against a randomly permuted version of IPTRACE

    dataset.

    S i PMM l

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    30/34

    Streaming PMM: some results

    internet traffic data

    0 2 4 6 8 10

    x 104

    0

    50

    100

    150

    Time

    Learnt component parameter estimates 1and 2

    1

    2(offline)

    1

    2(online)

    0 2 4 6 8 10

    x 104

    0

    0.5

    1

    Time

    Learnt prior estimates

    1

    2(offline)

    1

    2(online)

    0 1 2 3 4 5 6 7 8 9 10

    x 104

    0.85

    0.9

    0.95

    1Learnt forgetting factor

    Time

    Figure:Offline EM (solid lines) and SPMM-AF (dotted lines) withblock size 10, against a randomly permuted version of IPTRACEdataset.

    St i PMM lt

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    31/34

    Streaming PMM: some results

    internet traffic data

    0 2 4 6 8 10

    x 104

    0

    50

    100

    150

    Time

    Learnt component parameter estimates 1and

    2

    1

    2(offline)

    1

    2(online)

    0 2 4 6 8 10

    x 104

    0

    0.5

    1

    Time

    Learnt prior estimates

    1

    2(offline)

    1

    2(online)

    0 1 2 3 4 5 6 7 8 9 10

    x 104

    0.85

    0.9

    0.95

    1Learnt forgetting factor

    Time

    Figure:Offline EM (solid lines) and SPMM-AF (dotted lines) withblock size 10, against the IPTRACE dataset.

    St i i bl l ti lt

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    32/34

    Streaming variable selection: some results

    short-term wind forecasting

    Decentralised, streaming version of graphical Lasso forcovariance selection (i.e., sparse precision matrix)Use this as the basis of an adaptive querying algorithmTune both the forgetting factor, and the sparsity parameter

    St i i bl l ti lt

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    33/34

    Streaming variable selection: some results

    short-term wind forecasting

    Figure:Forgetting factors, sparsity parameters, and predictiveperformance of RLASSO-AF.

    Conclusions

  • 8/12/2019 Online temporally adaptive parameter estimation with applications to streaming data analysis

    34/34

    Conclusions

    Stochastic approximation algorithms with forgetting factors

    / learning rates are becoming increasingly popular in thestreaming data literature as a means to readily obtain

    streaming implementations of standard offline techniques

    Adaptive forgetting (online, self-tuning) is of interest. one-size-fits-all approach vs bespoke approach usually far more computationally efficient requires novel type of theoretical reasoning

    Adaptive forgettingis not easyto get right.

    Speculation:

    could forgetting be understood as robustness to model

    mis-specification with respect to dynamics?

    what about particle filtering?

    For more information, visit www.canagnos.co.uk

    http://www.canagnos.co.uk/http://www.canagnos.co.uk/