gradient boosting for multi-step forecasting · which weak learner for lk? p-splines? least-squares...

Gradient boosting for

multi-step forecasting

Souhaib Ben TaiebMachine Learning Group

Rob J HyndmanBusiness & Economic Forecasting Unit

Multi-step time series forecasting

•Goal

{y1, . . . , yT }multi-step forecasting−−−−−−−−−−−−−−−−→ {yT+1, . . . , yT+H}

• Error measure

MSEh = E[(yt+h − yt+h)2

]• Autoregressive process

yt = f︸︷︷︸function

(yt−1, . . . , yt− d︸︷︷︸embedding

+1) + εt︸︷︷︸error

with E[εt] = 0 and E[ε2t ] = σ2

•Task

Estimate µt+h|t = E [yt+h | yt, . . . , yt−d] for h = {1, . . . ,H}

Two main forecasting strategies

yt = m(yt−1, . . . , yt−p;θ) + et

• Learning step

{y1, . . . , yT }↓

p and θ

• Forecasting steps

– yT+1 = m(1)(yT , . . . , yT−p+1; θ)

– ...

– yT+h = m(h)(yT , . . . , yT−p+1; θ)

yt = mh(yt−h, . . . , yt−h−ph+1;θh) + et,h

• Learning step

{y1, . . . , yT }↓

{{p1, θ1}, . . . , {pH , θH}}

– yT+1 = m1(yT , . . . , yT−p1+1; θ1)

– ...

– yT+h = mh(yT , . . . , yT−ph+1; θh)

The rectify strategy ?

yt = g(h)(yt−h, . . . , yt−h−p+1;θ)︸︷︷︸recursive (linear)

+ rh(yt−h, . . . , yt−h−ph+1;γh)︸︷︷︸direct

• Learning steps

{y1, . . . , yT } → {{p, θ}, {p1, γ1}, . . . , {pH , γH}}

– yT+1 =

recursive (linear)︷︸︸︷g(1)(yT , . . . , yT−p+1; θ) +

direct︷︸︸︷r1(yT , . . . , yT−p1+1; γ1)

– ...

– yT+h = g(h)(yT , . . . , yT−p+1; θ) + rh(yT , . . . , yT−ph+1; γh)

RFY−KNN

5 10 15 20 25

RFY−KNN

5 10 15 20 25

g(h)(xt)

RFY−KNN

5 10 15 20 25

g(h)(xt)+ mh(xt)

−16 −45 25 −49 −53 4792 −61 −29 −93 251 8130 93 −7 118 −7 37 −49 −29 −35 −64

RFY−MLP

5 10 15 20 25

RFY−MLP

5 10 15 20 25

g(h)(xt)

RFY−MLP

5 10 15 20 25

g(h)(xt)+ mh(xt)

2.50 −2.53 18.58 −6.77 −6.12 216.04 −47.50 −25.97 −24.72 −99.85 1163.18 97.04 −33.96 144.01 −25.47 0.08 −29.30 39.28 −11.90 −24.89

The rectify strategy

• One unifying base model linking all the rectification models

, Decrease of forecast variance

• If the linear base model already produce good forecasts

/ Complex rectification models will increase the variance

The rectify strategy

• One unifying base model linking all the rectification models, Decrease of forecast variance

• If the linear base model already produce good forecasts/ Complex rectification models will increase the variance

• Apply several small rectification steps

• several : How many steps ?

• small : What does that means ?

The BOOST strategy

• Apply a Boosting procedure for each rectification model

• Boosting ??

– Weak learning algorithm (small)

– Use the weak learner several times to build the final model→ Improve the performance at each iteration→ Early stopping to avoid overfitting (several)

– Effective Machine Learning algorithm

The BOOST strategy

Ih∑i=0

ν Li(xt−h;γi)︸︷︷︸boosting steps

• Ih : number of rectification steps at horizon h

• ν : shrinkage factor with 0 < ν ≤ 1

• Li(xt−h;γi) : weak learner estimate

→ high bias and low variance but non-trivial

→ regression trees, P-splines, etc.

RFY−BST1

5 10 15 20 25

RFY−BST1

5 10 15 20 25

g(h)(xt)

BOOST (RFY−BST1)

5 10 15 20 25

17 −22 18 −29 16 695 20 37 23 84 1254 −59 11 −68 8 24 7 23 1 3

g(h)(xt)+ ∑i=0

IhνLi(xt, γi)

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

BOOST (RFY−BST1)

5 10 15 20 25

24.4 −21.8 21.1 −28.6 16.2 694.9 3.2 −4.4 −60.0 −99.8 5005.5 −30.3 −60.4 0.3 −42.3 −68.9 −43.9 154.9 −20.6 3.0

g(h)(xt)+ ∑i=0

IhνLi(xt, γi)

8 1 5 1 1 1 23 32 59 59 67 48 74 42 99 99 99 99 99 1

BOOST (RFY−BST1)

5 10 15 20 25

24.4 −21.8 21.1 −28.6 16.2 694.9 3.2 −4.4 −60.0 −99.8 5005.5 −30.3 −60.4 0.3 −48.9 −64.2 −50.0 224.6 −24.6 3.0

g(h)(xt)+ ∑i=0

IhνLi(xt, γi)

8 1 5 1 1 1 23 32 59 59 67 48 74 42 199 199 199 199 199 1

BOOST (RFY−BST1)

5 10 15 20 25

24.4 −21.8 21.1 −28.6 16.2 694.9 3.2 −4.4 −60.0 −99.8 5005.5 −30.3 −60.4 0.3 −75.1 −72.5 −76.3 286.5 −32.9 3.0

g(h)(xt)+ ∑i=0

IhνLi(xt, γi)

8 1 5 1 1 1 23 32 59 59 67 48 74 42 989 998 1000 1000 1000 1

Algorithm 1 Component-wise gradient boosting ?

1: {y1, . . . , yT} → {(xt−h, yt)}Tt=1 → zt = yt −m(h)(xt−h) → {(xt−h, zt)}Tt=1

2: Ih : number of boosting iterations and 0 < ν ≤ 1: shrinkage parameter

3: F (0)(xt) = L0(xt) = z = 1T

∑Tt=1 zt

4: for i← 1, . . . , Ih do

5: zit = −12∂(zt−F (xt))

∂F (xt)

∣∣∣∣F (xt)=F

(i−1)p (xt)

= (zt − F (i−1)(xt))

6: for k ← 1, . . . , K do

7: {(xt, zit)}Tt=1Regression with weak learner Lk−−−−−−−−−−−−−−−−−−−−−−−−→ Lk(xt; γk)

8: end for

9: ki = argmin1≤k≤K

∑Tt=1[zit − Lk(xt; γk)]2

10: F (i)(xt) = F (i−1)(xt) + νLki(xt; γki)

11: end for

12: F (xt) = L0(xt) +

Ih∑i=1

νLki(xt; γki)

Which weak learner for Lk ?

• P-Splines ?

→ Least-squares criterion + λ∗difference penalty on coefficients

→ Component-wise gradient boosting with P-splines ?

• How to make it weak ?

→ ν, degree of freedom (which gives λ) and number of inputs

•We expect real-world functions to depend on lower-order interactions

→ univariate (BST1) and bivariate (BST2) P-Splines

ANOVA expansion of F (xt)

F (xt) =∑j

fj(xjt)︸︷︷︸main effects

+∑j,k

fjk(xjt, xkt)︸︷︷︸2-order interactions

+∑j,k,l

fjkl(xjt, xkt, xlt)︸︷︷︸3-order interactions

Simulation experiments

• STAR (Smooth Transition Autoregressive) model

yt = 0.3yt−1+0.6yt−2+(0.1−0.9yt−1+0.8yt−2)[1+e(−10yt−1)]−1+εt

with E[εt] = 0 and E[ε2t ] = 0.12

0 200 400 600 800 1000

• Regression methods : Linear model - KNN - MLP - BST1 - BST2

MSE decomposition at horizon h

MSEh(xt) = E[(yt+h − m(h)(xt))

= E[(yt+h − µt+h|t)

︸︷︷︸Noise

+ (µt+h|t −m(h)(xt))

2︸︷︷︸Bias

+E[(m(h)(xt)−m(h)(xt))

︸︷︷︸Variance

yt+h : true value

m(h)(xt) : forecast

µt+h|t : conditional mean

m(h)(xt) = E[m(h)(xt)] : average forecast

Simulation results - REC VS DIR

5 10 15 20

T = 50 − REC−KNN

Horizon

5 10 15 200.

T = 50 − REC−MLP

Horizon

5 10 15 20

T = 50 − DIR−KNN

Horizon

5 10 15 20

T = 50 − DIR−MLP

Horizon

5 10 15 20

T = 100 − REC−KNN

Horizon

5 10 15 20

T = 100 − REC−MLP

Horizon

5 10 15 20

T = 100 − DIR−KNN

Horizon

5 10 15 20

T = 100 − DIR−MLP

Horizon

Simulation results - RFY VS DIR

Bias+Var − T = 50

Horizon

5 10 15 20

DIR−KNNRFY−KNN

Bias − T = 50

HorizonE

5 10 15 20

Variance − T = 50

Horizon

5 10 15 20

Bias+Var − T = 100

Horizon

5 10 15 20

Bias − T = 100

Horizon

5 10 15 20

8Variance − T = 100

Horizon

5 10 15 20

Simulation results - RFY VS DIR

Horizon

5 10 15 20

DIR−KNNRFY−KNN

Bias − T = 200

HorizonE

5 10 15 20

Variance − T = 200

Horizon

5 10 15 20

Horizon

5 10 15 20

012 Bias − T = 400

Horizon

5 10 15 20

Horizon

5 10 15 20

Simulation results - RFY VS BOOST

5 10 15 20

T = 50 − RFY−KNN

Horizon

5 10 15 200.

T = 50 − RFY−MLP

Horizon

5 10 15 20

T = 50 − RFY−BST1

Horizon

5 10 15 20

T = 50 − RFY−BST2

Horizon

5 10 15 20

T = 100 − RFY−KNN

Horizon

5 10 15 20

T = 100 − RFY−MLP

Horizon

5 10 15 20

T = 100 − RFY−BST1

Horizon

5 10 15 20

T = 100 − RFY−BST2

Horizon

Simulation results - RFY VS BOOST

Horizon

5 10 15 20

RFY−KNNRFY−BST1RFY−BST2

Bias − T = 200

HorizonE

5 10 15 20

Horizon

5 10 15 20

Horizon

5 10 15 20

Bias − T = 400

Horizon

5 10 15 20

Horizon

5 10 15 20

Simulation results - DIR VS BOOST

Bias+Var − T = 50

Horizon

5 10 15 20

DIR−BST1RFY−BST1DIR−BST2RFY−BST2

Bias − T = 50

HorizonE

5 10 15 20

Variance − T = 50

Horizon

5 10 15 20

Horizon

5 10 15 20

Bias − T = 100

Horizon

5 10 15 20

Horizon

5 10 15 20

Simulation results - DIR VS BOOST

Horizon

5 10 15 20

DIR−BST1RFY−BST1DIR−BST2RFY−BST2

Bias − T = 200

HorizonE

5 10 15 20

Horizon

5 10 15 20

Horizon

5 10 15 20

Bias − T = 400

Horizon

5 10 15 20

Horizon

5 10 15 20

Conclusion

• Strategies for multi-step forecasting

– Recursive and Direct strategies– Rectify takes advantage of both strategies→ linear forecasts (Recursive) + one rectification step (Direct)

• The BOOST strategy

– several small rectification steps (Direct)– Low-order interactions : univariate and bivariate P-Splines– Component-wise gradient boosting : variable selection

• Future work

– Linear simulated time series + Real-world time series– Other weak learners– Comparison with AdaBoost.R2 and AdaBoost.RT

http://souhaib-bentaieb.com and http://robjhyndman.com

gradient boosting for multi-step forecasting · which weak learner for lk? p-splines? least-squares...

Documents

greedy function approximation: a gradient boosting machine...

general functional matrix factorization using gradient...

gradient boosting factorization...

gradient boosting to boost the efficiency of hydraulic

boosting algorithms as gradient descent

accelerating gradient boosting...

a gentle introduction to gradient boosting - cheng li · a...

top rank optimization with gradient boosting for...

tracking-by-segmentation with online gradient boosting...

component-wise gradient boosting and false discovery...

cost efficient gradient boosting - weizmann

gradient boosting machine with...

using extreme gradient boosting to predict changes in...

gradient boosting decision trees on...

greedy function approximation: a gradient boosting machine...

gradient boosting machines, a tutorial - semantic...

a discussion on gbdt: gradient boosting decision tree · a...

gradient tree boosting for training conditional random...

gradient boosting for quantitative finance

writing a gradient boosting model node for sas …1 paper...