automatic forecasting at scale
TRANSCRIPT
![Page 1: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/1.jpg)
Automatic Forecasting at Scale
Sean J. Taylor 12 Aug 2015
Joint Statistical Meetings
![Page 2: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/2.jpg)
![Page 3: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/3.jpg)
Many Forecasting Problems at Facebook
• capacity planning: servers, switches, people, even food
• user / advertiser growth
• revenue
• goal setting for teams (with respect to forecast)
• detecting anomalies
• “trending” units
![Page 4: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/4.jpg)
Business Time Series Have Similar Attributes• comprised by multiple “units”
(e.g. countries, users, advertisers, hardware units)
• units are “born” at different times, can exit the sample
• growth curves are common (e.g. saturating a market)
• complex, human-scale seasonality, holidays and events
• structural breaks as exogenous changes happen(e.g. new products, redesigns, site outages)
• missing data
![Page 5: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/5.jpg)
Thousands or millions of forecasts?
![Page 6: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/6.jpg)
Mo’ Data, Mo’ Problems
![Page 7: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/7.jpg)
A second (and third) kind of scale: many people and problems
![Page 8: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/8.jpg)
Goal is to create technology: people who are not experts can use it easily with few decisions and trust the output
![Page 9: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/9.jpg)
Technology?
![Page 10: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/10.jpg)
Results of my search for forecasting advice
▪ carefully clean, scale, and fix missingness in data
▪ try many kinds of models
▪ use model selection procedures based on (penalized) goodness-of-fit or just ocular goodness-of-fit
▪ lots of tacit knowledge involved — experienced forecasters have earned a lot of credibility
![Page 11: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/11.jpg)
Why is building a forecaster harder than building a classifier?
![Page 12: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/12.jpg)
How most people build a classifier:
1. Choose a loss function.
2. Gather as much data as possible and construct potentially useful features.
3. Train models using different amounts of regularization.
4. Choose the one that predicts the best out-of-sample using some cross-validation procedure.
With a flexible enough learner, the only time a human needs to intervene is during feature construction!
![Page 13: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/13.jpg)
Forecasting as (special) supervised learningFeatures
▪ state-features constructed from historical data
▪ time-based features for seasonality, events, etc.
Training
▪ off-the-shelf regularized regression (glmnet, VW)
Model selection
▪ use simulated forecasts to estimate expected loss
![Page 14: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/14.jpg)
When you have a really awesome
hammer, make everything
look like a regularized regression.
argmin�
ky �X�k2 + �1k�k1 + �2k�k2
![Page 15: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/15.jpg)
A flexible extrapolation model
![Page 16: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/16.jpg)
Fixed-Horizon Forecasting Regression
Regressors are generated from paste state:
yt+H = f(yt, yt�1, yt�2, . . .)
yt+H = ↵yt + �1
t
tX
i=1
yi
Last ValueMean Value
![Page 17: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/17.jpg)
State features from one-sided kernel-weighted statistics
t
Can use any weighted statistic to generate features: mean, variance, quantiles, etc.
past data
![Page 18: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/18.jpg)
Assumption: local smoothnessAssume parameters vary smoothly over forecast horizon (same as assuming forecast is locally smooth).
yt+H = ↵H · yt + �H · 1t
tX
i=1
yi
Different model for each horizon
↵H
H0 Max Horizon
![Page 19: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/19.jpg)
Adding Seasonality FeaturesAdd components to the model that represent deterministic functions of time:
▪ trend
▪ cyclic cubic splines for yearly seasonality
▪ day-of-week, day-of-year, hour-of-day dummy variables
▪ smooth curves around known holidays
yt+H = f(yt, yt�1, yt�2, . . .) + g(t+H)
![Page 20: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/20.jpg)
t y1/1 51/2 91/3 16
t last mean1/1 - -1/2 5 51/3 9 7
t+H y Mon Tue1/1 5 1 01/2 9 0 11/3 14 0 0
State Features
Target + Time Features
t+H t H y last mean Mon Tues
1/2 1/1 1 5 - - 0 1
1/3 1/1 2 9 - - 0 0
1/3 1/2 1 14 5 5 0 0
Input Data for TrainingSeries
![Page 21: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/21.jpg)
Making it hierarchicalWe want to borrow information about processes across units. Huge opportunity because:
1. We know more about “new” time series than we think if we are willing to assume they are generated from a similar process.
2. The more examples from a family of time series processes we have, the better we are able to learn about its structure. Example: stock market.
3. Precision gains from borrowing information.
![Page 22: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/22.jpg)
One weird trick for hierarchical models
Common Features
United States
Canada
Mexico
Global parameters Unit-specific
yi,t+H = ↵yt + �1
t
tX
i=1
yi + ↵iyt + �i1
t
tX
i=1
yi
![Page 23: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/23.jpg)
Training
▪ BIG DATA: optimization-based techniques are difficult to use here because
▪ Online learning using SGD/Adagrad/Adadelta work well here AND we can update parameters for different loss functions and regularization parameters at the same time.
▪ Other bonus for online learning: incremental learning on data sorted by time!
![Page 24: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/24.jpg)
Model Selection via Forward Cross-ValidationWe have two sets of hyper-parameters:
1. regularization of the model coefficients.
2. amount of differencing we do before fitting.
Just like in the classification version of the problem, we choose the model that empirically forecasts the best by selecting K simulated forecast dates.
Training stream
Testing stream
Checkpoint Model
1
2 1
23
![Page 25: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/25.jpg)
Predictive Intervals with Quantile Regression
Very important to quantify uncertainty about a forecast. Often we’d prefer that people not even look at the point estimates.
Once you’re in the land of regularized linear regression, we can get predictive intervals simply by changing loss function to quantile loss.
Directly optimizing the model for the correct amount of empirical coverage!
![Page 26: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/26.jpg)
Computational Tricks
▪ online feature scaling
▪ feature hashing
▪ stochastic gradient descent (and Adagrad, Adadelta)
▪ fitting several models simultaneously on the same data stream
![Page 27: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/27.jpg)
Scaling to More People/Problems
1. Start with a single use-case and nail it.
2. Parameterize that solution — adding new problems should simple be configuration.
3. Work on model/fitting procedure, then run all previous models for diagnostics.
4. Provide easy tools for model criticism — top predictive errors, examples with under/over coverage, etc.
![Page 28: Automatic Forecasting at Scale](https://reader033.vdocument.in/reader033/viewer/2022052401/55d1e842bb61eb80548b4632/html5/thumbnails/28.jpg)
Conclusions▪ Different kinds of “at scale” — people and problems are
more important than size of data
▪ If a model/technique is hard to use, it’s worth thinking about what it would take for a non-expert to use it.
▪ Making problems look like regularized linear regression is GREAT.
▪ Forecasting can be made into a very special kind of supervised learning.
▪ Email me with comments/feedback: [email protected]