masters project report - minchao lin

31
Master’s Project Report Sales Prediction of 111 Weather Sensitive Products in 45 Walmart Stores using Machine Learning Techniques and Discussion on its Implications for Inventory Policy by Minchao Lin December 10, 2015

Upload: minchao-lin

Post on 22-Jan-2018

370 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Masters Project Report - Minchao Lin

Master’s Project Report

Sales Prediction of 111 Weather Sensitive Products in 45

Walmart Stores using Machine Learning Techniques and

Discussion on its Implications for Inventory Policy

by

Minchao Lin

December 10, 2015

Page 2: Masters Project Report - Minchao Lin

Contents

1 Motivation ................................................................................................................................ 3

2 Objectives ................................................................................................................................ 3

3 Data Description ...................................................................................................................... 4

3.1 Training Data and Test Data ................................................................................................. 4

3.2 Data Features ......................................................................................................................... 5

3.3 Feature Engineering .............................................................................................................. 6

3.4 Feature Correlation ................................................................................................................ 8

4 Models and Techniques ......................................................................................................... 10

4.1 Performance Metric ............................................................................................................. 10

4.2 Models ................................................................................................................................. 11

4.2.1 Stepwise Linear Regression .......................................................................................... 11

4.2.2 K-Nearest Neighbors Search ........................................................................................ 13

4.2.3 Ensemble Learning ....................................................................................................... 17

4.2.4 Combinations of Models .............................................................................................. 19

5 Implications............................................................................................................................ 20

5.1 Cross Validation .................................................................................................................. 20

5.2 Evaluating Forecasts ........................................................................................................... 21

5.3 Standard Deviation of Forecast Errors and its Implications for Safety Stock .................... 26

6 Conclusion ............................................................................................................................. 29

7 References .............................................................................................................................. 30

8 Appendices ............................................................................................................................. 31

Page 3: Masters Project Report - Minchao Lin

1 Motivation

Demand forecasting and inventory control are two of the most important aspects in

supply chain management. An accurate prediction of demand can not only help replenishment

managers correctly predict the level of inventory needed but also avoid being out of stock or

overstock. To better forecast demand, we need to take into consideration the various factors that

may have significant contribution to the demand variability. For a retail store, extreme weather

events such as hurricanes and blizzards can have a huge impact on sales at the store and product

level. Thus, accurately predicting the sales of potentially weather-sensitive products around the

time of major weather events becomes essential to the timely adjustment in inventory. In

addition, the difference between the predicted and realized demand can also provide further

information for setting the inventory policy such as the level of safety stock.

2 Objectives

The objectives of this project are two-fold. The first objective is to fit an effective model to

predict the sales of 111 potentially weather-sensitive products that are affected by snow and rain

in 45 Walmart retail stores. For each product specifically, the task is to predict the units sold for

a window of ±3 days surrounding each storm. The model performance is evaluated with the

Root Mean Squared Logarithmic Error (RMSLE) and compared with other 485 teams’ results in

the online Walmart recruiting competition. The training data used to generate the model is

provided with actual product demand and actual weather data while the actual demand in the test

data used to evaluate the effectiveness of predicted demand is not provided. The only way to

know the efficiency of the model is by submitting the predicted demand online and obtaining its

RMSLE. Considering that the actual demand in the test data is unknown which will limit further

Page 4: Masters Project Report - Minchao Lin

analysis on the inventory policy of these products, the next objective is introduced. The second

objective of the project is to fully utilize the training data by applying the most effective model

from previous steps via cross validation and compare the predicted demand and actual demand

for each product, then develop analysis on each of their safety stocks.

3 Data Description

3.1 Training Data and Test Data

Sales data for 111 products whose sales may be affected by the weather such as milk, bread and

umbrellas are provided. These 111 products are sold in stores at 45 different Walmart locations.

Each product id is provided but not name or description. The competition teams are reminded

that some of the products are similar but have a different id in different stores. The 45 store

locations are covered by 20 weather stations. Some stores share a weather station. The full

observed weather covering both the training data and test data is provided. Training data contains

4,617,600 observations and test data contains 526,917 observations.

In the following graph, the green dots show the training set days, the red dots show the test set

days, and the event=True are the days with storms. The graph is for 20 weather stations.

Page 5: Masters Project Report - Minchao Lin

Figure 1. Training set days and test set days for 20 weather stations1.

3.2 Data Features

The features in the training data provided include:

date

store id

Item id

number of units sold

The features in the weather data provided include:

1 “Data - Walmart Recruiting II: Sales in Stormy Weather | Kaggle,” accessed December 9, 2015, https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data.

Page 6: Masters Project Report - Minchao Lin

date

weather station id

dew point temperature

wet bulb temperature

heating degree days

cooling degree days

time for sunrise

time for sunset

significant weather types

snowfall in inches

water equivalent of rainfall and melted snow

average station pressure

average sea pressure

resultant wind speed

resultant wind direction

average wind speed

3.3 Feature Engineering

In order to better describe the underlying structure in the data, new features are created based on

the observation and analysis of the provided original data. It is reasonable to assume that sales in

each day may be related to the position of that day in a month, in a year, or in the whole timeframe

of the provided dataset, so new features generated from the date includes day in a month, month,

day in a year, year, numeric number for each date, weekday, and if that day is a holiday or not.

In addition, from observation of the data, it is noticed that sales in each month varies significantly.

Thus, monthly average sales for each product is calculated and serves as another new feature.

Based on the monthly average sales, a binary variable identifying whether the monthly average

sales equals zero is created. Indicating whether the same month has zero sales in each year for a

product can provide further details for the predicted demand during that month, thus improving

the accuracy of the model.

Page 7: Masters Project Report - Minchao Lin

Temperature can be another related feature because too high or too low temperature may influence

a customer’s decision to go out or stay home. In addition, “feels like” temperature may be a better

indicator. Since feels like temperature is related to the moisture in the air, two new features

identifying the moisture in the air in two different ways are created. The first feature calculates the

difference between the dew point temperature and average temperature since this difference

represents the how far away the amount of moisture in the air is from the saturation. The second

feature calculates the difference between the wet bulb temperature and average temperature. This

difference shows the relative humidity in the air. The larger the difference, the lower the relative

humidity.

Features including precipitation and average wind speed are included directly without further

processing. The feature snowfalls is eliminated as it includes too many undefined values (NaN or

empty cells). Resultant wind speed is not included either as it is closely correlated with average

wind speed. The rest of the features in the weather data are ignored either because they are

constructed with too many different text entries that are hard to describe numerically or because

they are not related to the sales of the product intuitively. These features include heating degree

days, cooling degree days, time for sunrise, time for sunset, significant weather types, average

station pressure, average sea pressure and resultant wind direction.

Because some products have a lot of zero sales, I assume that the number of days with zero sales

before or after each day may also have an influence on the sales of that day. Three new features

are created based on this assumption: number of continuous days with zero sales before today,

number of continuous days with zero sales after today, and the minimum of the previous two

features. Besides number of days with zero sales, the average number of sales before or after each

day may also impact the sales of each day. Thus, I created one more variable calculating the

Page 8: Masters Project Report - Minchao Lin

average sales seven days before today, and another variable calculating the average sales seven

days after today. If the seven days before a date are not all included in the training data, which

means some dates are in the test data, the average of only the available sales in the training data

will be calculated.

To conclude, features that are used to build models are:

1. numeric number for the date

2. month

3. day in month

4. year

5. weekday

6. is holiday or not

7. day in year

8. monthly average sales

9. is a month having zero sales or not

10. precipitation

11. average wind speed

12. difference between average temperature and dew point temperature

13. difference between average temperature and wet bulb temperature

14. number of continuous days with zero sales after today

15. number of continuous days with zero sales before today

16. minimum of the number of continuous days with zero sales before or after today

17. average sales seven days before today

18. average sales seven days after today

3.4 Feature Correlation

Because multiple variables are used for generating the model, multicollinearity problem may

arise if these variables are not independent. As a first step towards model specification, it is

useful to identify any possible dependencies among the predictors. The correlation matrix is a

standard measure of the strength of pairwise linear relationships. In the following table, R value

between each numeric variable is calculated:

Variables 1 2 3 4 5 6 7 8 9 10

1 1 0.0066 -0.038 0.035 -0.11 -0.12 -0.29 -0.19 -0.24 0.015

2 0.0066 1 0.027 0.056 -0.20 0.067 -0.35 -0.40 -0.42 0.82

Page 9: Masters Project Report - Minchao Lin

3 -0.038 0.027 1 0.12 -0.37 -0.027 0.023 0.029 0.049 0.033

4 0.035 0.056 0.12 1 0.24 0.020 0.10 -0.071 0.0011 0.064

5 -0.11 -0.20 -0.37 0.24 1 0.040 0.23 0.13 0.18 -0.16

6 -0.12 0.067 -0.027 0.020 0.040 1 -0.047 -0.053 -0.047 -0.035

7 -0.29 -0.35 0.023 0.10 0.23 -0.047 1 0.17 0.58 -0.25

8 -0.19 -0.40 0.029 -0.071 0.13 -0.053 0.17 1 0.58 -0.36

9 -0.24 -0.42 0.049 0.0011 0.18 -0.047 0.58 0.58 1 -0.35

10 0.015 0.82 0.033 0.064 -0.16 -0.035 -0.25 -0.36 -0.35 1

Table 1. R value between each numeric variable

Variables 1 to 10 each represent features: numeric date, monthly average sales, precipitation,

average wind speed, average temperature subtracted by dew point temperature, average

temperature subtracted by wet bulb temperature, number of continuous days with zero sales after

today, number of continuous days with zero sales before today, and minimum value of the

previous two features.

From the table, we observe that only the number of continuous days with zero sales after today

and number of continuous days with zero sales before today have a moderate correlation with

minimum value of the previous two features. These moderate correlation would be dealt with in

the ensemble methods where only a subset of features are selected to generate a decision tree

every time. The other R values show little correlation between each other pair of features.

Besides pairwise correlation, relationships among arbitrary feature subsets may imply

multicollinearity problem. To diagnose multicollinearity, we can calculate the variance inflation

factor (VIF). VIF quantifies the severity of multicollinearity in an ordinary least squares

regression analysis and it is calculated as:

𝑉𝐼𝐹𝑖 =1

1 − 𝑅𝑖2

When the variation of feature 𝑖 is largely explained by a linear combination of the other features,

𝑅𝑖2 is close to and the VIF for that feature is correspondingly large. A rule of thumb is that if

Page 10: Masters Project Report - Minchao Lin

VIF is greater than 10 then multicollinearity is high. Again, VIF for the previous data is

calculated:

Variables 1 2 3 4 5 6 7 8 9 10

VIF 1.20 3.65 1.27 1.18 1.45 1.05 1.91 1.84 2.44 3.24

Table 2. VIF for each variable

The above values show that monthly average sales and minimum value of continuous days of

zero sales before or after today have the two highest VIFs, but their values are still far below the

significant level of 10. Thus we conclude that no significant multicollinearity between variables

exist.

4 Models and Techniques

4.1 Performance Metric

For regression problem, the method of measuring the distance between the estimated outputs and

the actual outputs is used to quantify the model's performance. The Mean Squared Error

penalizes the bigger difference more because of the square effect. On the other hand, if we want

to reduce the penalty of bigger difference, we can log transform the numeric quantity first. The

effect of introducing the logarithm function is to balance the emphasis on small and big

predictive errors. For the Walmart recruiting competition, the submissions of predictions are

evaluated based on the Root Mean Squared Logarithmic Error (RMSLE):

√1

𝑛∑(log(𝑝𝑖 + 1) − log(𝑎𝑖 + 1))2

𝑛

𝑖=1

Where:

n is the number of hours in the test set

Page 11: Masters Project Report - Minchao Lin

pi is the predicted count

ai is the actual count

log(x) is the natural logarithm

4.2 Models

4.2.1 Stepwise Linear Regression

Stepwise Linear regression creates a linear model and automatically adds or removes terms in the

model based on their statistical significance in a regression. The method begins with an initial

model and then compares the explanatory power of incrementally larger and smaller models

using forward selection and backward elimination. Specifically, at each step, the p values of an F

statistics is computed to test the model with and without a potential term. If a term is not

currently in the model, the null hypothesis is that the term would have a zero coefficient if added

to the model. If the null hypothesis is rejected, then the term that have the smallest p value

among all the terms having p values less than an entrance tolerance will be added to the model.

Conversely, if the term is already in the model, the null hypothesis is that the term has a zero

coefficient and if there is no significant evidence to reject the null hypothesis, the term that has

the greatest p value among all the terms in the model having p values greater than an exit

tolerance will be removed from the model.2 In this sense, stepwise models are locally optimal but

may not be globally optimal.

For this method, five stepwise models were built based on different combinations of variables

(the numbers that represent each feature correspond to the ones listed in section 3.2). The first

four models are listed as below:

2 “Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm,” accessed December 10, 2015, http://www.mathworks.com/help/stats/stepwiselm.html.

Page 12: Masters Project Report - Minchao Lin

RMSLE

of each

models

1 2 3 4 5 6 8 9 10 11 14 15 16 17 18

0.12995 √ √ √ √ √

0.11892 √ √ √ √ √ √ √

0.13218 √ √ √ √ √ √ √ √ √ √ √ √ √

0.19076 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √

Table 3. Stepwise Linear Regression Models

The model having the best RMSLE in the table is the second one with an RMSLE equaling to

0.11892. From the results, we can see that having more features doesn’t necessarily improve the

model. Thus, instead of creating more features, the focus was shifted from the predictor variables

to the response variable. Since the performance metric for the Walmart recruiting online

competition uses log transformation on the difference between the predicted values and actual

values in the test data, log transformation is then applied to the response values (i.e. units sold for

each item in each store) in the training data as an attempted way to improve the performance of

prediction models. In order to avoid negative transformed values, log (1+x) is applied to each

response value. The best result is as follows:

RMSLE 1 2 3 4 5 6 8 9 10 11 14 15 16 17 18

0.10477 √ √ √ √ √

Table 4. Stepwise Linear Regression Models with log-transformed response variable

The above result shows that log transformation of the response value in the training data does

improve the performance. However, it is also observed that even for log-transformed response

values, having more features doesn’t necessarily improve the model. The final ranking of the

best stepwise linear regression model from above is 94/485.

Page 13: Masters Project Report - Minchao Lin

Figure 2. Ranking of Stepwise Linear Regression Model

4.2.2 K-Nearest Neighbors Search

K-Nearest Neighbors Search finds the k closest points in X for each point in Y, the predicted

value is often calculated as the average of those k closest points or the weighted average of the k

closest points using the inverse distance weights. Two different search methods can be used. The

exhaustive search method finds the distance from each query point to every point in X, ranks

them in ascending order, and returns the k points with the smallest distances. Kd-trees search

method divides the data into nodes with a certain bucket size based on coordinates. The closest k

points are found within the node that the query point in Y belongs to. Then points in all other

nodes that are within the distance between the previous k points and the query point are chosen

as well. Using a Kd-tree for large data sets can be much more efficient than using the exhaustive

search method because it only calculates a subset of the distances. Distances can also be

determined with various metrics. The most general distance metric is Euclidean distance. The

other distance metrics that will also be tested later in this section include correlation distance,

spearman distance, and cosine distance, and Hamming distance. Correlation distance is

calculated as one minus the sample linear correlation between observations which are treated as

Page 14: Masters Project Report - Minchao Lin

sequences of values. Spearman distance is calculated as one minus the sample Spearman’s rank

correlation between observations which are treated as sequences of values. Cosine distance is

calculated as one minus the cosine of the included angle between observations which are treated

as vectors. Hamming distance is calculated as the percentage of coordinates that differ. 3

Thus, changing parameters include nearest neighbors search method, methods to calculate

predicted value with the values from the closest neighbors, number of closest neighbors, and

distance metric. The default setting in Matlab is followed to choose search method: exhaustive

search method is used when the number of columns of X is more than 10, and Kd-trees search

method is used otherwise.

For exhaustive search method, all the 18 predictors listed in Section 3.2 are included. Different

distance metric are tested first with the number of closest neighbors set to a fixed number 10.

The results are as follows:

Distance metric RMSLE

Euclidean distance 0.11189

Correlation distance 0.14171

Spearman distance 0.18862

Cosine distance 0.14401

Hamming distance 0.12848

Table 5. Testing Distance Metrics.

From the table, we see that Euclidean distance works significantly better than the other distance

metrics. Thus, for the next step, Euclidean distance is set to be the distance metric. The number

of closest neighbors is still set to 10. Yet instead of using the mean value of the 10 closest

3 “Classification Using Nearest Neighbors - MATLAB & Simulink,” accessed December 10, 2015, http://www.mathworks.com/help/stats/classification-using-nearest-neighbors.html.

Page 15: Masters Project Report - Minchao Lin

neighbors, the weighted average of the k closest points using the inverse distance weights is

used.

Inverse distance weights are defined as

𝑢(𝑥) = ∑ 𝑤𝑖(𝑥)𝑢(𝑥𝑖)

𝑁𝑖=1

∑ 𝑤𝑖(𝑥)𝑁𝑖=1

where 𝑤𝑖(𝑥) is defined as

𝑤𝑖(𝑥) =1

𝑑(𝑥, 𝑥𝑖)𝑝

The result is as follows:

Ways to calculate predicted values RMSLE

Arithmetic mean 0.11189

Weighted mean with inverse distance weights (𝑝 = 1) 0.10341

Weighted mean with inverse distance weights (𝑝 = 2) 0.10473

Weighted mean with inverse distance weights (𝑝 = 3) 0.10732

Weighted mean with inverse distance weights (𝑝 = 7) 0.11666

Table 6. Testing ways to calculate predicted values

The above table shows that weighted mean with inverse distance weights having p = 1 gives the

best RMSLE. In the next step, this way of calculating the final predicted values remains and

different number of closest neighbors to choose for each point in Y are tested. Let k denote the

number of closest neighbors. The results are as follows:

K RMSLE

3 0.11008

10 0.10341

40 0.10215

60 0.10193

80 0.10198

Page 16: Masters Project Report - Minchao Lin

100 0.10200

Table 7. Testing K values.

For Kd-tree search method, only predictors related to time are included. These variables

correspond to the 1, 2, 3, 4, 5, and 7 in Section 3.2.

K RMSLE

20 0.10182

60 0.10126

70 0.10136

Table 8. Kd-tree search method

Figure 3. Ranking of K-Nearest Neighbors Search

To conclude, the best generated k-nearest neighbor model uses the Euclidean distance as the

distance metric, uses weighted mean with inverse distance weights having p = 1 to predict the

response value, uses only variables related to time (numeric date, month, day in month, year, day

in year, and weekday) as the predictor variables, and chooses 60 as the number of closet nearest

neighbors in the algorithm. The best RMSLE returns 0.10126 and ranks 66/485 in the

competition.

Page 17: Masters Project Report - Minchao Lin

4.2.3 Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than

could be obtained from any of the constituent learning algorithms.4 Among the constituent

learning algorithms, decision tree, neural network and other machine learning algorithms are

commonly used. Decision tree builds regression or classification models in the form of a tree

structure where a dataset is divided into smaller subsets at each node. In a regression tree, a

regression model is fit to the target variable using each of the independent variables. For each

independent variable the data is split at several split points where the squared mean error

between the predicted value and the actual values are calculated. The node chooses to split the

predictor variable at the split point that maximizes the squared mean error reduction.

Regression tree ensembles work with two methods. One is least squares boosting, and the other

is bagging. Least squares boosting fits regression ensembles in order to minimize mean squared

error. At every step, the ensemble fits a new learner to the difference between the observed

response and the aggregated prediction of all learners grown previously.5 The ensemble fits to

minimize mean-squared error. Bagging trains each model in the ensemble using a randomly

drawn subset (with replacement) of the training set and finds the predicted response of a trained

ensemble by taking an average over predictions from individual trees. Furthermore, random

sampling with replacement omits on average 37% of observations for each decision tree and

every tree in the ensemble can randomly select predictors for decision splits.

4 “Ensemble Learning - Wikipedia, the Free Encyclopedia,” accessed December 9, 2015, https://en.wikipedia.org/wiki/Ensemble_learning. 5 Jerome Friedman et al., “Discussion of Boosting Papers,” Ann. Statist 32 (2004): 102–7.

Page 18: Masters Project Report - Minchao Lin

Since ensembles tend to overtrain, lasso regularization of the ensembles is implemented in order

to choose fewer weak learners with no loss in predictive performance.

To start training the data, both least squares boosting and bagging are applied respectively with

all the predictor variables listed in section 3.2 included. The results are as follows:

Ensemble Learning Methods RMSLE

Least Squares Boosting 0.10388

Bagging 0.10142

Table 9. Ensemble Learning Methods

The results indicate that bagging works much better than least squares boosting. Thus, bagging is

chosen as the ensemble learning method.

In consideration of the potential interactions between each variable, two ways to include more

terms of features are applied. The first method is to include all products of pairs of distinct

predictors into the pool of features and the number of features will increase from 18 to 171 as a

result. The other method is to only include interactions between numerical terms and the number

of features will increase from 18 to 52 accordingly. Ensemble method is then applied to both sets

of data. The result is as follows:

Number of features RMSLE

52 0.11728

171 0.09907

Table 10. Number of features

The result shows that including interaction terms between each pair of predictors significantly

improves the model. Hence the best performance given by regression learning ensembles has an

RMSLE equaling to 0.00907. The result ranks 47/482 in the competition.

Page 19: Masters Project Report - Minchao Lin

Figure 4. Ranking of Ensemble Learning Method

4.2.4 Combinations of Models

In this section, three different combinations of previous generated models are tested in order to

see if there is any improvement in the prediction performance. Specifically, the first combination

takes the median of predicted values from all previous models for each entry in the test data, the

second combination takes the linear combination of the most efficient models from k-nearest

neighbors search and ensemble learning. The third combination is a linear combination of the

three most effective ensemble learning models together with the most effective stepwise linear

regression model. The coefficients for the linear regression are generated by fitting the predicted

values of the training data from each model to the actual values. The results are as follows:

Combinations of Models RMSLE

Median 0.09972

Linear combination of 1 k nearest neighbors and 1 ensemble learning (appendix 1) 0.10384

Linear combination of 1 stepwise linear regression and 3 ensemble learning (appendix 2) 0.09818

Table 11. Combinations of Models

The above table shows that the third combination returns the best result with a ranking of 40/485.

From the graph below, we see that the difference between the current best result and the top

Page 20: Masters Project Report - Minchao Lin

result is around 0.09875 – 0.09340 = 0.00535 for RMSLE. Instead of generating more models to

fit the actual value in the test data to explain the 0.00535 difference, the focus of the project is

shifted to analyzing the current obtained predicted values and their implications on inventory

policy. In the next section, the second objective of the project will be introduced and explained in

details.

Figure 5. Ranking of Combinations of Models

5 Implications

5.1 Cross Validation

Although for the competition, the lower the RMSLE the higher the ranking among the

participating teams, the generality of the model needs further proof. For this reason, cross

validation is applied to the training data while test data is ignored since its actual sales value are

not provided. Specifically, 5-fold cross validation is applied, which means each group of

observations for each product in each store is partitioned into 5 disjoint subsamples (or folds),

Page 21: Masters Project Report - Minchao Lin

chosen randomly but with roughly equal size. Every time, 4 folds are used for training and last

fold is used for evaluation. Predicted values for that last fold is created at the same time. This

process is repeated 5 times, leaving one different fold for evaluation each time. The models used

for training the data are the most effective ones generated in the sections 4.3.1, 4.3.2, and 4.3.3.

RMSLE of each model is ranked in order to compare the effectiveness of prediction performance

from cross validation with those that are submitted to the online competition. The results are as

follows:

testRMSLE ranking trainRMSLE ranking

Stepwise Linear Regression 0.10477 5 0.129844 5

Ensemble Learning – LS Boosting -18 features 0.10388 4 0.122193 3

Ensemble Learning –Bagging -18 features 0.10142 3 0.105286 2

Ensemble Learning –Bagging -171 features 0.09907 2 0.1029 1

Linear combination of the previous 4 models 0.09818 1 0.123611 4

Table 12. Cross Validation

From table above, we notice that linear combination of models does not work well for the cross

validation (ranked number four out of five). If we ignore that last row, the rest four models share

the same ranking in both the RMSLE for test data in the online competition and for cross

validation in the training data. With these results, we are more confident in applying the best

prediction model (Ensemble Learning –Bagging -171 features) to the analysis of inventory policy.

5.2 Evaluating Forecasts

In this section, two common measures of forecast accuracy are applied to the predictions for the

training data generated with cross validation from previous section. Specifically, these two

measures are mean absolute deviation (MAD) and mean absolute percentage error (MAPE).

Page 22: Masters Project Report - Minchao Lin

To calculate these three measures, denote 𝑒𝑖 as the difference between the forecast value and

actual value for each observation in the training data and suppose there are n observations. MAD

and MAPE are calculated as:

MAD = (1

𝑛) ∑ |𝑒𝑖|

𝑛𝑖=1

MAPE = [(1

𝑛) ∑ |𝑒𝑖/𝐷𝑖|𝑛

𝑖=1 ] × 100%

Because some products have a lot of days with zero sales, 𝐷𝑖 used in MAPE is replaced with

average demand to avoid undefined values. Each of the above measure is applied to each product

in each store. Since there are 255 combinations of different stores and products, 255 MADs and

MAPEs are generated.

It should be noted that in the original model that generates the best result, feature 18 which is

average sales 7 days after today is included. However, when developing the inventory policy

based on the predictions, the data for this feature is obviously not available in real life. For this

reason, feature 18 and its interaction terms with other predictors are eliminated and a new cross-

validated ensemble learning model is built with this new update. MADs and MAPEs are then

calculated. It turned out that feature 18 contributes little to the original model and its elimination

does not have significant influence on the original predicted value. To illustrate this point, the

ranking of variables importance for predicting sales of product 23 in store 8 is shown as an

example:

rank variables importance rank variables importance rank variables importance

1 7 4.47E-04 31 102 4.43E-06 61 12 1.95E-06

2 78 5.98E-05 32 62 4.41E-06 62 37 1.75E-06

3 21 5.11E-05 33 17 4.28E-06 63 58 1.60E-06

4 3 4.66E-05 34 147 4.27E-06 64 76 1.57E-06

5 87 3.88E-05 35 98 4.15E-06 65 38 1.39E-06

6 63 2.41E-05 36 77 4.14E-06 66 35 1.08E-06

Page 23: Masters Project Report - Minchao Lin

7 24 2.29E-05 37 138 4.14E-06 67 4 9.18E-07

8 5 1.98E-05 38 103 4.07E-06 68 39 9.00E-07

9 8 1.45E-05 39 42 3.92E-06 69 80 6.61E-07

10 66 1.28E-05 40 112 3.75E-06 70 55 6.43E-07

11 20 1.08E-05 41 28 3.68E-06 71 22 5.86E-07

12 29 9.34E-06 42 111 3.59E-06 72 101 5.59E-07

13 113 9.33E-06 43 43 3.58E-06 73 71 5.53E-07

14 83 9.29E-06 44 27 3.50E-06 74 132 4.33E-07

15 1 9.14E-06 45 53 3.31E-06 75 127 4.25E-07

16 2 8.70E-06 46 36 3.16E-06 76 44 4.12E-07

17 117 7.68E-06 47 70 3.14E-06 77 126 3.83E-07

18 133 5.70E-06 48 134 3.01E-06 78 89 3.67E-07

19 81 5.40E-06 49 88 2.95E-06 79 68 3.49E-07

20 108 5.38E-06 50 19 2.82E-06 80 41 3.46E-07

21 82 5.17E-06 51 11 2.75E-06 81 128 3.39E-07

22 143 5.04E-06 52 69 2.74E-06 82 10 2.48E-07

23 50 4.84E-06 53 18 2.57E-06 83 110 2.24E-07

24 33 4.76E-06 54 49 2.34E-06 84 6 2.23E-07

25 57 4.68E-06 55 99 2.25E-06 85 26 2.04E-07

26 56 4.67E-06 56 65 2.18E-06 86 64 1.73E-07

27 52 4.60E-06 57 139 2.11E-06 87 92 1.03E-07

28 48 4.52E-06 58 51 2.09E-06 88 93 8.48E-08

29 75 4.48E-06 59 104 2.07E-06 89 94 3.20E-08

30 34 4.47E-06 60 23 2.00E-06

Table 13. Variable Importance

We see that feature 18 (the average sales 7 days after today) ranked 53 among all the features

and it is about half as important as feature 17 (the average sales 7 days before today).

Since MADs and MAPEs each has 255 values, it is not convenient to show them all in the report.

Instead, the detailed values from the top 10 and the last 10 sorted with descending order

according to the average daily sales for each product in each store will be shown while the rest of

the values will be shown in the graphs to indicate the trend in MAD and MAPE. The tables and

graphs are as follows:

Top 10 in average daily sales:

store_nbr item_nbr sum of sales # of days recorded MAD MAPE mean daily demand

33 44 189903 914 36.219 0.115 207.771

16 25 135046 857 28.097 0.118 157.580

Page 24: Masters Project Report - Minchao Lin

30 44 136473 868 26.824 0.317 157.227

17 9 135367 939 45.548 0.204 144.161

2 44 117125 875 21.016 0.120 133.857

4 9 117123 960 36.619 0.190 122.003

33 9 101586 914 36.785 0.227 111.144

25 9 98560 1011 28.217 0.157 97.488

34 45 87419 947 15.747 0.125 92.312

38 45 80068 875 15.488 0.130 91.506

Table 14. Top 10 in average daily sales

Bottom 10 in average daily sales:

store_nbr item_nbr sum of sales # of days recorded MAD MAPE mean daily demand

16 85 67 857 0.099 0.810 0.078

40 106 78 1011 0.093 1.049 0.077

9 105 73 947 0.099 0.884 0.077

22 104 68 898 0.094 0.883 0.076

38 86 62 875 0.088 0.929 0.071

25 84 69 1011 0.087 0.906 0.068

20 106 61 896 0.085 0.968 0.068

31 104 58 947 0.070 1.025 0.061

34 84 46 947 0.065 0.883 0.049

3 102 31 896 0.045 0.936 0.035

Table 15. Bottom 10 in average daily sales

MADs for each store and item combination sorted according to its average daily sales sorted in

descending order:

Page 25: Masters Project Report - Minchao Lin

Figure 6. MAD

MAPEs for each store and item combination sorted according to its average daily sales sorted in

descending order:

Figure 7. MAPE

0.000

5.000

10.000

15.000

20.000

25.000

30.000

35.000

40.000

45.000

50.000

20

7.7

71

92

.31

2

76

.20

5

65

.06

9

57

.63

5

48

.45

7

43

.12

3

37

.20

8

33

.20

0

22

.53

4

15

.67

3

9.5

92

3.4

47

1.6

98

1.2

79

1.1

30

1.0

28

0.9

41

0.8

78

0.8

07

0.7

63

0.6

97

0.6

18

0.5

81

0.5

34

0.4

69

0.3

66

0.3

08

0.1

95

0.1

46

0.0

91

0.0

76

MA

D

Average Daily Sales for each store and item combination sorted in descending order

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

1.600

1.800

20

7.7

71

97

.48

87

9.3

57

69

.66

96

3.2

40

50

.12

74

7.7

51

41

.43

23

7.2

08

34

.46

42

6.0

10

17

.44

21

2.2

99

5.2

67

2.8

45

1.6

28

1.2

79

1.1

51

1.0

81

0.9

80

0.9

02

0.8

28

0.7

92

0.7

49

0.6

97

0.6

22

0.5

97

0.5

47

0.5

00

0.4

35

0.3

60

0.3

05

0.1

95

0.1

51

0.0

99

0.0

78

MA

PE

Average Daily of Sales for each store and item combination sorted in descending order

Page 26: Masters Project Report - Minchao Lin

The above plots show that in general the MADs decrease with the number of average daily sales

and MAPEs increase with number of average daily sales. For MAD, some models for store and

item combination do not perform as well as others. This is particularly obvious for items with

large volume of sales. For those models that do not perform as well, extra effort to fit a better

model may be applied as a further approach. For MAPE, we can see a big jump from an average

of around 0.1 to an average of around 0.4 when the sum of average daily sales drops to around

five. Yet it should be noted that The MAPE is scale sensitive and should not be used when

working with low-volume data because when the average demand is very low, the denominator

in MAPE formula will often make MAPE take on extreme values.

5.3 Standard Deviation of Forecast Errors and its Implications for Safety Stock

In general, forecasting error variance is higher than the demand variance since forecasting error

also incorporates sampling error. If a forecast is used to estimate the mean demand, we keep

safety stocks in order to protect against the error in the forecast6. Thus, the standard deviation

(STD) in forecast errors instead of standard deviation in demand should be used to calculate

safety stocks.

When the model is built at the very beginning, it used 5-fold cross validation which means each

prediction group (generated by the model with data from the other four subsamples) accounts for

only one fifth of the overall prediction. Thus, instead of calculating the standard deviation over

all predictions, the average standard deviation of each of the five prediction groups should be

used in order to comply with the cross validation method. Again, graph of averaged STDs

against the mean daily demand for each of the 255 store and item combinations is shown below:

6 Steven Nahmias, Production and Operations Analysis (New York: McGraw-Hill/Irwin, 2009).

Page 27: Masters Project Report - Minchao Lin

Figure 8. Averaged STD

Assuming overnight replenishment and 98% service level (which corresponds to a z-score of

2.05), daily safety stock is calculated as 2 × 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝑆𝑇𝐷. the percentage of daily safety

stock over average daily demand for each store and item combination is shown below:

0.000

10.000

20.000

30.000

40.000

50.000

60.000

70.000

20

7.7

71

92

.31

2

76

.20

5

65

.06

9

57

.63

5

48

.45

7

43

.12

3

37

.20

8

33

.20

0

22

.53

4

15

.67

3

9.5

92

3.4

47

1.6

98

1.2

79

1.1

30

1.0

28

0.9

41

0.8

78

0.8

07

0.7

63

0.6

97

0.6

18

0.5

81

0.5

34

0.4

69

0.3

66

0.3

08

0.1

95

0.1

46

0.0

91

0.0

76

aver

aged

STD

Average Daily Sales for each store and item combination sorted in descending order

Page 28: Masters Project Report - Minchao Lin

Figure 9. Percentage of safety stock over average daily demand

Part of the previous graph with only average daily sales above 5 products is shown below:

Figure 10. Percentage of safety stock over average daily demand that are above 5 units

0.000

5.000

10.000

15.000

20.000

25.000

30.000

20

7.7

71

92

.31

2

76

.20

5

65

.06

9

57

.63

5

48

.45

7

43

.12

3

37

.20

8

33

.20

0

22

.53

4

15

.67

3

9.5

92

3.4

47

1.6

98

1.2

79

1.1

30

1.0

28

0.9

41

0.8

78

0.8

07

0.7

63

0.6

97

0.6

18

0.5

81

0.5

34

0.4

69

0.3

66

0.3

08

0.1

95

0.1

46

0.0

91

0.0

76

per

cen

tage

of

safe

ty s

tock

ove

r av

erag

e d

aily

dem

and

Average Daily Sales for each store and item combination sorted in descending order

0.000

0.500

1.000

1.500

2.000

20

7.7

71

14

4.1

61

11

1.1

44

91

.50

6

81

.03

5

79

.27

9

72

.61

4

69

.66

9

65

.06

9

63

.39

2

62

.39

5

54

.68

7

49

.76

9

48

.49

7

47

.75

1

45

.88

0

43

.12

3

39

.88

8

37

.21

8

36

.97

5

35

.48

6

34

.46

4

32

.69

7

28

.77

0

22

.53

4

17

.92

0

16

.85

0

13

.87

3

12

.29

9

11

.20

0

7.1

64

per

cen

tage

of

safe

ty s

tock

ove

r av

erag

e d

aily

dem

and

Average Daily Sales for each store and item combination sorted in descending order

Page 29: Masters Project Report - Minchao Lin

From the plot, we notice that for the products that have average daily sales below five, the

percentage of safety stock over average daily demand increase dramatically and has very

unstable fluctuation. This situation poses a question of whether it is profitable to maintain those

low demand products in stock since the number of safety stock for these products is much larger

than its daily demand. However, similar to the problem with MAPE, when the average daily

demand is very close to zero, its location in the denominator will often make the percentage take

on very high values. This may partially account for the high spikes in the graph.

6 Conclusion

For the first objective to fit an effective model in order to lower RMSLE in the test data, three

different methods with different model parameters are sequentially tested. Stepwise linear

regression provides the highest RMSLE among the three methods. K-nearest neighbors Search

generates a better result, and ensemble learning provides the best prediction performance. Linear

combination improves the prediction performance for the test data even further, although this

combination cannot be applied generally which is indicated by its poor performance when tested

with only the training data using cross validation. The variable importance implies that weather

information is not significant in predicting the daily sales. Instead, features related to time

contribute a lot more and rank among the top features in the importance ranking. Thus, although

these products are assumed to be weather-sensitive, weather does not influence their sales as

much as it is originally supposed. Future research on other machine learning techniques may

further improve the prediction performance. However, the robustness of model should always be

kept in mind when the prediction is going to be used in business activities such as setting up the

inventory policy.

Page 30: Masters Project Report - Minchao Lin

The second objective allows us to dive into the implications from the predictions. With cross

validation, ensemble tree model proves its robustness. It is natural that MAD decrease with

average daily demand, yet the products with rather large MAD compared to those having similar

average daily demand may require more attention for further model improvement. In addition,

the two spikes in MAPE before the aforementioned jump at around 5 average daily sales impose

concern. The models for these two spikes should be further tested with other machine learning

techniques. Finally, the calculated safety stock and its value as percentage of average daily

demand poses a question of whether the products is profitable to be maintained on the store

shelves. Although no further data is provided, inventory costs such as holding cost of high

inventory, the obsolescence cost, the ordering cost, the storage space costs, and the transportation

costs for those products should all be taken into account when more detailed information

regarding those products become available.

7 References

“Data - Walmart Recruiting II: Sales in Stormy Weather | Kaggle.” Accessed December 9, 2015.

https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data.

“Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm.” Accessed

December 10, 2015. http://www.mathworks.com/help/stats/stepwiselm.html.

“Classification Using Nearest Neighbors - MATLAB & Simulink.” Accessed December 10, 2015.

http://www.mathworks.com/help/stats/classification-using-nearest-neighbors.html.

Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. “Discussion of

Boosting Papers.” Ann. Statist 32 (2004): 102–7.

“Ensemble Learning - Wikipedia, the Free Encyclopedia.” Accessed December 9, 2015.

https://en.wikipedia.org/wiki/Ensemble_learning.

Page 31: Masters Project Report - Minchao Lin

Nahmias, Steven. Production and Operations Analysis. New York: McGraw-Hill/Irwin, 2009.

8 Appendices

1. Linear regression model of 1 k nearest neighbors and 1 ensemble learning:

y ~ 1 + x1 + x2

Estimated Coefficients:

Estimate SE tStat pValue

________ _________ ______ ___________

(Intercept) 0.24598 0.044419 5.5377 3.0688e-08

Ensemble learning 0.85715 0.0058669 146.1 0

K-nearest neighbors 0.21063 0.0055461 37.977 1.2375e-314

Root Mean Squared Error: 18.8

R-squared: 0.773, Adjusted R-Squared 0.773

2. Linear regression model of 1 stepwise linear regression and 3 ensemble learning:

y ~ 1 + x1 + x2 + x3 + x4

Estimated Coefficients:

Estimate SE tStat pValue

________ _________ _______ __________

(Intercept) -0.18353 0.033454 -5.486 4.1145e-08

x1 0.2025 0.0043553 46.495 0

x2 0.33965 0.0049085 69.195 0

x3 -0.36155 0.017711 -20.414 1.5038e-92

x4 0.8802 0.017254 51.014 0

Root Mean Squared Error: 14.3

R-squared: 0.868, Adjusted R-Squared 0.868