s499 kroger final report
TRANSCRIPT
* [email protected] ° [email protected]
Case Study Final Report
Inquiry on factors affecting units sold of products in specific Kroger Stores
Cheng Shi*, Xiaohan Xu°
Indiana University
April 24, 2015
Background: Dunnhumby is a UK-based Customer Service company. Its primary business is analyzing
data from nearly one billion customers worldwide with their willingness and demand in order to assist
corporations to make decisions and maximize their benefits. It is an independent company in the United
Kingdom but owned by Kroger Company in the United States. The Kroger Company is an American
retailer, which is the second-largest general retailer and has the largest revenue of supermarket chain.
Dunnhumby collected the data of customers from Kroger and use appropriate tools to analysis the data
with purpose of aiding Kroger Company better develop business.
1 Introduction
Our team was given an Excel file from Breakfast at the Frat, supported by Dunnhumby, containing sales
and promotional information from a sampling of 79 stores with 58 unique UPC (Universal Product Code)
over 156 week beginning from January 2009 to December 2011. The products were represented by UPC
and classified into 4 categories: BAG SNACKS, COLD CEREALS, FROZEN PIZZA, and ORAL
HYGIENE PRODUCTS. There are 24 variables, which are symbolic names containing some known or
unknown quantity or categorical values, including BASE_PRICE, DISPLAY, FEATURE, PRICE,
STORE_NUM, TPR_ONLY, UNITS, WEEK_END_DATE and UPC, etc. Each variable has a precise
description in the given database. For example, UNITS refers to units sold. PRICE refers to actual amount
charged for the products at shelf. UPC refers to specific identifiers of products. HHS refers to number of
purchasing households. DISPLAY refers to products that were parts of in-store promotional display.
TPR_ONLY refers to temporary price reduction only (i.e., shelf tag only, products were reduced in price
but not on display or in an advertisement). There were four sheets including Glossary, dh_Store Lookup,
dh_Products Lookup and dh_Transaction Data. In Glossary sheet, it listed names and descriptions of all
variables. In dh_ Store Lookup sheet, it included Store ID, name, address, parking space, store size and
average weekly baskets sold in each store. In dh_Products Lookup, it offered UPC, descriptions of
products, manufacturer, subcategory and product size. In dh_transaction data, there were 524,950 rows
and 13 columns. The sheet contains WEEK_END_DATE, STORE NUMBER, UPC, UNITS, VISITS,
HHS, SPEND, PRICE, BASE_PRICE, FEATURE, DISAPLY, and TPR_ONLY. In addition, we were
also given some examples of questions based on the data set. For example, what is the range of prices
offered on products? What will happen to unit sales if the price changes by X%? What is the impact on
sales of promotions, displays, or being featured in the circular? How do the above differ by store price tier
(e.g. upscale stores vs. value stores)? If you were the retailer, for which products would you be more
likely to increase the price to improve profits? Why? We were then motivated and inspired to begin to
look at our data, solve those example questions and come up with our own questions.
We have been given 500,000 data. The best thing is that the database is very fertile, and we can play
around the data and use the data to discover the relationship between units sold and price and also other
factors that affect units in order to give a good prediction to maximize total revenue. The bad thing is that
the dataset is so big that we cannot analyze the whole dataset at the same time. Hence, we decided to
zoom in and focus just one specific store with one specific product, and then zoom out to several stores
with several products.
2
0
20
40
60
80
100
14
-Ja
n-0
9
14
-Fe
b-0
9
14
-Ma
r-0
9
14
-Ap
r-0
9
14
-Ma
y-0
9
14
-Ju
n-0
9
14
-Ju
l-0
9
14
-Au
g-0
9
14
-Se
p-0
9
14
-Oct
-09
14
-No
v-0
9
14
-De
c-0
9
14
-Ja
n-1
0
14
-Fe
b-1
0
14
-Ma
r-1
0
14
-Ap
r-1
0
14
-Ma
y-1
0
14
-Ju
n-1
0
14
-Ju
l-1
0
14
-Au
g-1
0
14
-Se
p-1
0
14
-Oct
-10
14
-No
v-1
0
14
-De
c-1
0
14
-Ja
n-1
1
14
-Fe
b-1
1
14
-Ma
r-1
1
14
-Ap
r-1
1
14
-Ma
y-1
1
14
-Ju
n-1
1
14
-Ju
l-1
1
14
-Au
g-1
1
14
-Se
p-1
1
14
-Oct
-11
14
-No
v-1
1
14
-De
c-1
1
UNITS
2 Statement of Research Problem
By intuition, we were curious about discount, so we simply used base price minus price as discount.
Then, we used Excel to do some essential but vital analysis based on those data sheet. We got the basic
features of data by observing and calculating the maximum, minimum and range of different variables
like units, visits, prices and discount. In addition, we created some useful graphs like histogram,
frequency distribution, scattered plot and time series of those variables. We found that different stores
have different store size and 22 stores do not even have the data of parking space like store 11757 (KY)
Mainstream, store 15531 (OH) Value, Store 4503(TX) upscale and store 4245(IN) Mainstream. Based on
our experiences, Kroger Company not only owns large area of free parking space but also has its own gas
stations in the parking area. Therefore, for those stores without parking space, we considered that it might
due to missing data because Kroger is a supermarket, and parking space is vital element for a store to
attract its customers.
Another interesting finding is that units sold of same UPC might be different at different times. For
example, Table 2.1 includes several selected data of UPC 1111009477 in Store 623 with fixed price
$0.99.
WEEK_END_DATE STORE_NUM UPC UNITS PRICE
4-Mar-09 623 1111009477 57 $ 0.99
26-Aug-09 623 1111009477 46 $ 0.99
2-Dec-09 623 1111009477 28 $ 0.99
30-Dec-09 623 1111009477 36 $ 0.99
9-Mar-11 623 1111009477 29 $ 0.99
30-Mar-11 623 1111009477 14 $ 0.99
For fixed store number (623), UPC code (1111009477) and price ($0.99), units sold is different at
different times. Table 2.1 shows that units sold during week 4-Mar-09 is 57 bags, but units sold during
week 30-Mar-11 is 14 bags. We considered that there should be some factors affect units sold with
unchanged price, store number and UPC.
What’s more, we drew several time series graphs for selected stores and UPC, and it is very obvious
that units sold vary a lot over time for the same UPC at the same store. For example, Figure 2.1 is a time
series graph of UPC 1111009477 in Store 2505.
Table 2.1
Figure 2.1
3
In Figure 2.1, sales of items vary a lot over times based on time series. We observed that there are three
peaks: The first one is between 14-Dec-09 and 14-Jan-10; the second one is between 14-Dec-10 and 14-
Jan-11; and the third one is around 14-Dec-11. The sales significantly increased on Dec 14th for each
year. Conclusion: project is challenging.
Based on those findings, we want to find a focus and define our main purpose for our project. Since
the information we were given of sales of Kroger Company, the largest supermarket chain in USA by
revenue, we stated that the purpose of our team is to maximize total profit. However, we were not given
any costs for any of these stores, so we simplified the purpose to maximize total revenue. Since Kroger
set up their own price for every products, and we don’t know how they determine the price level, we will
consider that quantity is the dependent variable, and price is the independent variable. We wonder what
independent variables will have impact on our dependent variable, units sold. By intuition, our main focus
is the relationship between units sold and price, and we will also consider the relationship between units
sold and other factors such as display, holiday, feature, and discount by building up an appropriate model.
3 Discussion of Mathematical Fields Used in Working On the Problem
In this section, we will discuss our two fundamental mathematical tools to analysis our main focus, the
relationship between units sold and price and the relationship between units sold and holiday. We will
discuss linear regression model in section 3.1 and hypothesis testing in section 3.2. Both discussions will
be theoretical.
3.1 Linear Regression
In statistic, linear regression is an approach for modeling the relationship between a scalar dependent
variable y and one or more explanatory variables (or independent variable) denoted X. The case of one
explanatory variable is called simple linear regression. For more than one explanatory variable, the
process is called multiple linear regression. Simple linear regression functions by assuming that the
variables x and y have a linear relationship within the given set of data. (Stock, James H. Introduction to
Econometrics.) There are three main reasons why we want to use linear regression model to analysis the
relationship between units sold and other factors. The first reason is that linear relationship is the simplest
relationship between two variables. Therefore, we can first assume that units had the linear relationship
with other factors like price. For example, we assume that the linear equation for units and price,
units=slope * price + constant; linear equation for discount, units=slope*discount +constant. By nature
logic, as price increases, units sold should decreases. And as discount increases, the sales will increases.
Therefore, we expect that the slope would be negative for price and positive for discount. Secondly, after
simple regression, we can use multiple regression which allows us to figure out the relationships of units
with more than one explanatory variables. The last reason is for future prediction. According to the linear
regression based on data of three years, we can use the linear equation to predict the which factor has
more and effects on units and which factor may lose its power to explain the units sold as time goes by.
Those analysis are main evidence for us to set future prices and make racial decision and appropriate
adjustment.
We want to introduce an important theorem called “Gauss-Markov Theorem” to help explain our data
and solve our problems. In statistics, the Gauss–Markov theorem was named after Carl Friedrich Gauss
and Andrey Markov, and it assumes that the errors have expectation zero and are uncorrelated and each
error has equal variances in a linear regression model. By Gauss-Markov theorem, given the classical
assumption, the OLS (Ordinary Least Squares) estimators are BLUE (Best Linear Unbiased Estimator).
The theorem says that OLS estimators have minimum variance among linear and unbiased estimators and
hence Gauss-Markov does not say that they are the best of all possible estimators. The theorem does not
depend on any distributional assumptions like normality.
Classical Assumptions includes five parts:
A.1: In the population model, Y is related to X and error e as: Y=𝛽1 + 𝛽2*X+ e, which means the
population is linear.
4
A.2: (𝑥𝑖, 𝑦𝑖), i=1, 2, 3…,n, are random samples from the population model.
A.3: The sample outcome of X, {𝑥1,𝑥2, … … . 𝑥𝑛 } are not all the same values. There are some variation in
the values of xi.
A.4: E (𝑒𝑖|X)=0, for i=1,2…,n, which shows unbiasedness
A.5: Variance (𝑒𝑖)=𝜎2, i=1,2…,n, equal variance of all error 𝑒𝑖, which shows homoskedasticity.
(Stock, James H. 2011 Introduction to Econometrics. Chapter 4.5 The Sampling Distribution of the OLS
Estimators)
3.2 Hypothesis Testing
A statistical hypothesis testing is a method of statistical inference used for testing a statistical hypothesis.
Hypothesis testing contains two hypothesis: the null hypothesis and the alternative hypothesis. The null
hypothesis (Ho) corresponds to the idea that an observed difference is due to chance. The alternative
hypothesis (H1) is that the observed difference is real. In order to measure the difference between the real
case and what is expected on the null hypothesis, a test statistics is used (Freedman, Pisani and Purves.
Statistics. Page 477-481). Since the whole dataset is only a sample dataset, we will use t-test as the test
statistic (we are only able to calculate the sample standard deviation for the given data set). The test
statistics should give us a p-value after calculating by the computer (we can also do this by looking up the
table). We assume that the null hypothesis is right, p-value is not the chance of the null hypothesis being
right. What should we do after calculating the p-value? We should assign our significance level. The
significant level is the criterion used for rejecting the null hypothesis. Normally, the significant level is
denoted by α. The last step for hypothesis testing is how we should interpret it. We call it the decision
rule. Generally, the decision rule is interpreted as following: “If p-value < α, we reject Ho and conclude
that the difference is real; otherwise, we fail to reject Ho.”
For our hypothesis testing, we would like to test whether holiday increases units sold of products.
Furthermore, we would like to test holiday increases the average of units sold. The actual reason we
will introduce in section IV because the reason relates to our observation and thinking process. We set our
null hypothesis and alternative hypothesis:
Ho: µholiday ≤ µnonholiday
H1: µholiday > µnonholiday
µholiday denotes the average units sold on holiday, and µnonholiday denotes the average units sold not
on holiday.
We are required to calculate our t-score for our desired test statistic (t-test) for our hypothesis testing.
For our data set, units sold on holiday and units sold not on holiday have unequal sample sizes and
unequal sample variances, hence we will use the following formula to calculate the t-score.
𝑡 =𝑋1 − 𝑋2
S(𝑋1 − 𝑋2)
𝑋1 is the average of units sold on holiday. 𝑋2 is the average of units sold not on holiday. S(𝑋1 − 𝑋2) is the
standard deviation of (𝑋1 − 𝑋2) .
The formula of calculating S(𝑋1 − 𝑋2) is
S(𝑋1 − 𝑋2) = √𝑆1
2
𝑛1 +
𝑆22
𝑛2
𝑆1 is the standard deviation of units sold on holiday, 𝑆2 is the standard deviation of units sold not on
holiday. 𝑛1 is the sample size of units sold on holiday. 𝑛2 is the sample size of units sold not on holiday.
5
Here, we are assuming that 𝑋1 and 𝑋2 are independent. Otherwise, S(𝑋1 − 𝑋2) needs not be given by the
formulas. Example: 𝑋1 = 𝑋2. Then S(𝑋1 − 𝑋2) = 0.
Since we will use t-test, we should be aware of degree of freedom:
degree of freedom =(𝑆1
2
𝑛1 +𝑆2
2
𝑛2 )2
(𝑆1
2
𝑛1 )2
𝑛1 − 1+
(𝑆2
2
𝑛2 )2
𝑛2 − 1
(PASS Sample Size Software, chapter 425)
We had required mathematical tools and formulas to do our hypothesis test. The further question is
how we could calculate the p-value. At first, we wanted to use MATLAB because it is a useful
mathematical laboratory, and we are using it in our M466 class. However, one big issue of MATLAB is
that it takes dataset as matrices. We had already known that units sold on holiday and units sold not on
holiday have unequal sample sizes. We were not able to do some operations of linear algebra such as dot
product because some operations of linear algebra require units sold on holiday and units sold not on
holiday have equal sample size. Therefore, we decided to use R-studio. R-studio does not have
requirement of sample sizes, and we can see the running process of data on R-studio.
4 The Procedure of Using Required Mathematical Tools
In this section, we will discuss our steps of solving the problem by using linear regression and hypothesis
testing. We will summarize our observations and our analysis. We will introduce our thinking process
related to linear regression and several practical issues in section 4.1. In section 4.2, we will introduce our
procedure of using hypothesis testing. At the end of section 4.2, we will illustrate that our hypothesis
testing indicates some different results compared with linear regression model.
4.1 Linear Regression
Before regression, we defined and added some new variables we were interested to do research like
HOLIDAY_WEEK and DISCOUNT. For HOLIDAY_WEEK, we searched of popular traditional
holidays. music and sports days and finally added some these important days including Super
bowl(football), Oscar(movie), Grammy, NBA final(basketball), MLB final(baseball), VMA(MTV Video
Music Awards), AMA (American music awards), Christmas. Therefore, the definition for
HOLIDAY_WEEK is subjective. And for DISCOUNT, since we were only given BASE_PRICE and
PRICE, we defined BASE_PRICE minus PRICE as DISCOUNT for each product. To simplify our
process, we focus on one store for analysis, so we can just use the same process for other stores. Since
there were only 4 categories, we picked one UPC for each categories and then chose one store which
included 4 categories to run regression.
For example,
Store 367 UPC: 1111087395 FROZEN PIZZA;
Store 367 UPC: 1111038078 MOURTHWASHES;
Store 623 UPC: 1111009477 MINI TWIST PRETZELS;
Store 367 UPC: 1111085350 COLD CEREAL;
We used same method and regressions for the four products. For each UPC, we ran the simple
regressions and multiple regressions. At the beginning, we were not sure which predictors have effect
and how much variation of them can explain the variation of units sold. And we also want to get the most
fitted regression model. We tried using variables including PRICE, BASE_PRICE, DISCOUNT,
HOLIDAY_WEEK, FEATURE, DISPLAY, TPR_ONLY, VISITS and HHS in simple regression model.
However, according to the values of R-square, confidence interval and p-value, the BASE_PRICE and
TPR_ONLY are not helpful for our regression model because their corresponding R-square are 0%,
6
0.33% and p-value is 0.946, 0.478 from data store 367 UPC:1111087395 FROZEN PIZZA. And we did
more for other stores and products, the results are similar. The 0 R-Square indicates that the variation of
those regressions are not helpful to explain the variation of units sold. And the big p-value shows they are
not significant. In addition, HHS means the number of purchasing households and VISITS means the
number of unique purchases (baskets) that included the product. As the number of purchasing household
and their unique purchases that included the product increase, the units sold of the certain product will
definitely increase. HHS and VISITS are 99% correlated with units sold, so they are not the regressions
we are interested to do research. Then, we started to run simple regression (By STATA: Data Analysis
and Statistical Software) like below:
Example1
Store: 367 UPC: 1111038078 PL BL MINT ANTSPTC RINSE PRIVATE LABEL ORAL HYGIENE
PRODUCTS
MOUTHWASHES (ANTISEPTIC) 500 ML
. reg UNITS DISCOUNT
Source| SS df MS Number of obs = 138
-------------+------------------------------ F( 1, 136) = 4.01
Model| 27.3332081 1 27.3332081 Prob > F = 0.0473
Residual| 928.036357 136 6.82379674 R-squared = 0.0286
-------------+------------------------------ Adj R-squared = 0.0215
Total| 955.369565 137 6.97350048 Root MSE = 2.6122
UNITS| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
DISCOUNT| 1.062551 .530906 2.00 0.047 .0126524 2.11245
_cons| 3.833433 .2614647 14.66 0.000 3.316371 4.350496
------------------------------------------------------------------------------
The example I displayed is to explore the relationship between units sold and DISCOUNT for store
367 UPC: 1111038078, MOUTHWASHERES, which is one of categories, Oral hygiene products. There
are three main reasons for choosing this example. Firstly, for all MOUTHWASHES, only two regressors
PRICE and DISCOUNT are useful for our model to explain our sales. And a new problem occurred when
we ran this simple regression (Example1) and multiple regression (Example2). Secondly, it is nature to
think discount can help increase sales. However, based on the data from example1, the fact is not as what
people thought. The third reason is to compare the change of coefficient when we run regression with
PRICE and DISCOUNT separately and when we ran these two regressors together.
From the data, R-square is 2.86%, which means 2.86% of variation of units sold of mouthwashes
(500ml) in store 367, on average, can be explained by the variation of DISCOUNT. Therefore, based the
low R-square, we can conclude that DISCOUNT is not a good predictor for this certain product in this
store. The coefficient is about 1.06, which can be interpreted that as the discount of the product increases
by 1$, on average, the units sold will increase by 1.06. P-value is 0.047.
Another example we listed below is multiple regression for the same product in the same store. We
were faced with some new problems and interesting findings during running this regression. For example,
we ran a regression of MOUTHWASHERES using STATA, a note: “DISPLAY omitted because of
collinearity; HOLIDAY omitted because of collinearity” automatically appeared. Collinearity (or
multicollinear) is a phenomenon in which 2 or more predictor variables in a multiple regression model are
highly correlated.
Example2
Store:367 UPC:1111038078 PL BL MINT ANTSPTC RINSE PRIVATE LABEL ORAL HYGIENE
PRODUCTS
MOUTHWASHES (ANTISEPTIC) 500 ML
. reg UNITS PRICE DISPLAY FEATURE DISCOUNT HOLIDAY
7
note: DISPLAY omitted because of collinearity
note: HOLIDAY omitted because of collinearity
Source| SS df MS Number of obs = 21
-------------+------------------------------ F( 3, 17) = 0.90
Model| 13.5837657 3 4.5279219 Prob > F = 0.4622
Residual| 85.6543296 17 5.03848997 R-squared = 0.1369
-------------+------------------------------ Adj R-squared = -0.0154
Total| 99.2380952 20 4.96190476 Root MSE = 2.2447
------------------------------------------------------------------------------
UNITS| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
PRICE| -4.737992 3.027061 -1.57 0.136 -11.12453 1.648547
DISPLAY| 0 (omitted)
FEATURE| 1.269057 3.149502 0.40 0.692 -5.375811 7.913926
DISCOUNT| -1.900531 2.268131 -0.84 0.414 -6.685868 2.884806
HOLIDAY| 0 (omitted)
_cons| 11.13651 4.748398 2.35 0.031 1.118263 21.15475
------------------------------------------------------------------------------
For this multiple regression model, we used 5 predictors, PRICE, DISPLAY, FEATURE, DISCOUNT
and HOLIDAY_WEEK. However, the R-square is still low at 13.7%. And we observed that the p-values
are very large (over 0.1), which means PRICE, FEATURE and DISCOUNT were not significant in this
model. And then, we tried removing DISPLAY and HOLIDAY, and run regression of UNITS, PRICE,
FEATURE and DISCOUNT. The R-square is 12% and the DISCOUNT is still not significant with 0.382
p-value.
Some other interesting finding were received by comparing the results of run regression of PRICE and
DISPLAY together with separately. When we ran simple regression with units sold and PRICE, the
coefficient of PRICE was -2.206235 with R-square 9.26% and p-value 0. When we ran simple regression
with units sold and DISCOUNT, the coefficient of discount was +1.062551 with R-square 2.86% and p-
value 0. 047. However, if we ran multiple regression using both of PRICE and DISCOUNT as regressors,
the result changed a lot. (See Example3)
Examples3
Store:367 UPC:1111038078 PL BL MINT ANTSPTC RINSE PRIVATE LABEL ORALHYGIENE
PRODUCTS
MOUTHWASHES (ANTISEPTIC) 500 ML
reg UNITS PRICE DISCOUNT
Source| SS df MS Number of obs = 138
-------------+------------------------------ F( 2, 135) = 6.92
Model| 88.7867091 2 44.3933545 Prob > F = 0.0014
Residual| 866.582856 135 6.41913227 R-squared = 0.0929
-------------+------------------------------ Adj Rsquared = 0.0795
Total| 955.369565 137 6.97350048 Root MSE = 2.5336
------------------------------------------------------------------------------
UNITS| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
PRICE| -2.314459 .7480219 -3.09 0.002 -3.793816 -.8351017
DISCOUNT| -.1544378 .647959 -0.24 0.812 -1.43590 1.127026
_cons| 7.570749 1.234216 6.13 0.000 5.129849 10.01165
------------------------------------------------------------------------------
The coefficient of DISCOUNT becomes -.1544378(negative value), which means if the discount of the
product increases, the units sold will decreases, which sounds not reasonable. And compared with the R-
8
square 9.26% of simple regression using PRICE, the R-square 9.29% merely increases 0.03%, which
suggests the variation of DISCOUNT does not help to explain the variation of units sold. In addition, the
big p-value 0.812 of DISCOUNT shows that it is not significant. The reason for the result is that the two
regressors PRICE and DISCOUNT are highly correlated, which means that have similar movement with
each other.
In addition, we also did the regression for same UPC in different stores. We used the same explanatory
variables (PRICE, DISCOUNT, DISPLAY, FEATURE, and HOLIDAY_WEEK) to run regression. And
when we looked at units sold of FROZEN PIZZA UPC: 1111087395 in three different stores 367, 2505,
387, we found that, for same products in different stores, there were different data.
Example4
Store: 367/2505/387 UPC: 1111087395 FROZEN PIZZA
In store 367, R-square is 31.69% for PRICE, 44.57% for DISPLAY, 11.50% for FEATURE and p-values
are 0.
In store 2505, R-square is 43.20% for PRICE, 50.39% for DISPLAY, 39.56% for FEATURE and p-
values are 0.
In store 387, R-square is 42.20% for PRICE, 42.99% for DISPLAY, 25.66% for FEATURE and p-values
are 0.
We could get some results by comparing different data. For example, 31.39% shows that PRICE has
less effect on units sold of FROZEN PIZZA in store 367 than the other two stores. And 50.39% means
Display has more influence on sales of FROZEN PIZZA for store 2505 than the other two.
4.2 Hypothesis Testing
Before hypothesis testing, we first defined holiday. We defined Super Bowl, Grammy, Oscar, NBA Final,
VMA, MLB Final, AMA and Christmas to be holiday. All of these special dates except for Christmas are
Americans’ habits. Almost everyone watches the live of NBA, Super Bowl, etc. Christmas is a traditional
western holiday. Some other traditional holidays such as Easter, St. Patrick, we did not include them
because they are not as popular as Christmas.
After defining holidays, we randomly selected several products and stores because we could not do
hypothesis testing for all products in all stores, One major issue is that how we determine which stores
and which products that we will use for hypothesis testing. We cannot look at data and randomly choose
them because the way that we manually randomly choose data has bias. We decided to choose 10
products among four categories, so we gave each UPC code a number from 1 to n, where n is total
number of products among categories, and used excel command =RANDBETWEEN(1,n). This excel
command randomly gave us an integer between 1 to n, and then we chose the UPCs according to the
randomly selected number. After choosing the required UPCs, we would need to randomly select stores.
We gave each store a number from 1 to 79 and used the same randomly selected method with excel
command =RANDBETWEEN(1,79).
We randomly selected UPC codes, and then we randomly selected 10 stores according to the UPC
code that we chose. After observing the distribution of the selected data, it seems that units sold of
products on holiday is different from units sold of product not on holiday. Hence, we made a guess that
holiday affect units sold.
In order to simplify our hypothesis testing, we randomly reselected one product, frozen pizza with
UPC code 1111087395, and 10 stores that sell this product. We compared units sold on holiday and units
sold not on holiday of our selected product in different selected stores, and we found that units sold of
products on holiday is different from units sold of product not on holiday. Furthermore, we observed that
units sold of products on holiday is more than units sold of production not on holiday for most of stores,
but for some stores units sold of products on holiday is less than units sold of production not on holiday,
such as units sold in store 613, 623 and 2495, but the situation that units sold of products on holiday is
less than units sold of production not on holiday happens less frequently than the situation that units sold
of products on holiday is more than units sold of production not on holiday. However, these were our
9
conclusions based on our observations, and we were not certain that our conclusions were right by only
observing the data. Hence, we claim that holiday increases units sold and would like to use hypothesis
testing to prove our claims. Later on, we drew the frequency curve for units sold on holiday and units sold
not holiday. For example, Figure 4.2.1 and Figure 4.2.2 are frequency curves of units sold on holiday and
units sold not holiday for store 387 and Store 2277. The blue line represents units sold on holiday, and the
orange line represents units sold not on holiday.
The average units sold on holiday is higher than average units sold not holiday in
most of time, but not all the time. We observed that the average (mean = 15.30) of units sold on holiday is
more than the average (mean = 10.41) units sold not on holiday for store 387. However, the average
(mean = 10.70) of units sold on holiday is less than the average (mean = 15.30) units sold not on holiday
for store 2277. There are 8 stores that the average units sold on holiday is higher than average units sold
not holiday in our randomly selected 10 stores. We claim that the average units sold on holiday is higher
than average units sold not holiday frequently. Hence, we changed our test to testing holiday increases the
average of units.
Next, we were able to set our null hypothesis and alternative hypothesis:
Ho: µholiday ≤ µnonholiday
H1: µholiday > µnonholiday
µholiday denotes the average units sold on holiday, and µnonholiday denotes the average units sold not
on holiday.
We then calculated test statistics used R-studio to calculate the p-value. Since our focus is µholiday >
µnonholiday, our hypothesis testing is a 2 sample t-test with different sample size and right-tail test.
Therefore, the formula for p-value should be:
p-value = 1 - probability corresponding to our t-score.
Table 4.2.1 shows t-scores and p-values for 4 stores.
Store Number t-score p-value
387 0.89 19.02%
623 -0.64 73.66%
2277 -1.16 87.34%
2496 4.59 0.003%
The last step is to state our decision rule. For example: suppose we chose α=5%.
Normally, statisticians choose 5% to be the significance value, but it varies depending on situations.
a. For store 387, p-value=19.02% > α=5%, we fail to reject Ho.
Figure 4.2.1 Figure 4.2.2
Table 4.2.1
10
b. For store 623, p-value=73.66% > α=5%, we fail to reject Ho.
c. For store 2277, p-value=87.34% >α=5%, we fail to reject Ho.
d. For store 2495, p-value=0.003%< α=5%, we reject Ho and conclude that µholiday > µnonholiday..
After doing hypothesis testing, we summarized that holiday does not increase average units sold
significantly. The significant level we chose is α=5%. We showed four stores above, and only one store
has significant evidence to prove µholiday > µnonholiday, which implies that holiday increases average
units sold. Hence, we concluded that we do not have significant evidence that holiday increases average
units sold in this products. There exists some relationship between holiday and units sold, but units sold
also depends on the store, the UPC code, etc. Most of these hypothesis testing suggest that holiday has a
weak influence on sales.
5 The Solution of linear regression and hypothesis testing
For some of regression model, not all expected value of error for each x is 0, which violated the classical
assumption of OLS estimators. In classical assumption (A.4), E(ei|Xi)=0 shows unbiasedness. Therefore,
some estimators of parameters in our regression models are biased and the regression model is not fitted,
hence those models are not meaningful for us to do further research. For example, for both simple and
multiple regressions, collinearlity problems occur for HOLIDAY_WEEK and FEATURE of
MOUTHWAHSERS for store 367. The same problem with the multiple regression model using
DISOCUTN and PRICE of MOUTHWAHSERS for store 367. During the creation of regression models,
collinearity can occur. Although the overall model still can have good predictability, the collinearity will
cause invalid results for individual predictors. However, we still need to include them to run multiple
regression as our solution because highly correlated variables does not mean they are perfectly correlated
and when we add DISCOUNT to run regression, the coefficient of DISCOUNT changed a lot. If we just
ignore the effect of DISCOUNT and remove it from multiple regression model, we will overestimate the
effect of PRICE.
From the linear regression model, we observed that the dummy variable HOLIDAY_WEEK does not
fit a regression line. The correlation of units sold and HOLIAY_WEEK is very low which implies that the
linear relationship between units sold and HOLDAY_WEEK is very weak.
For hypothesis testing, we chose α=5% and we rejected 1 of 10 store and conclude that we do not have
significant evidence µholiday > µnonholiday. We recall that null hypothesis testing indicates that the
difference between two means are due to chance, introduced in section 3.2. For our case, we may
conclude that holiday affects units sold is due to chance, specifically holiday increases units sold is due to
chance. The reason that we may conclude but not confidently conclude is that we only did hypothesis
testing for on UPC in 10 stores, so we cannot draw a conclusion among all UPCs and all stores. There
exists some relationship between holiday and units sold that holiday affect units sold. Nevertheless, this
relationship is weak.
In summary, we conclude that linear relationship cannot describe the relationship between units sold
and holiday, and our linear regression model does not fit to predict how holiday affect units sold.
However, this does not mean that there does not exist relationship between units sold and holiday. In
summary, we concluded that there exists relationship between units sold and holiday, but it is not linear
relationship.
6 The implications of the research
In conclusion, although the linear regression suggests the holiday week does not have influence on units
sold of a product, but that does not mean there is no relationship between them. Based on the data of
hypothesis testing, the increase of units sold affected by HOLIDAY_WEEK is not significant. Therefore,
11
we might conclude that the factor of HOLIDAY_WEEK may affect the units sold for some categories of
product like small snacks but the effect is weak.
The first limitation is that we are unable to run all simple and multivariable regression for all different
products in different store. Therefore, what we can do is that we picked three stores and three products
based on category, which allows us to compare one product in different stores and different product in the
same store.
The second limitation is inaccurate definition for new added variables, HOLIDAY_WEEK and
DISCOUNT. And the problem occurs that there are several discounts are negative numbers, which sounds
unreasonable, but we still used them when run our regression.
The third limitation is that we cannot guarantee that our hypothesis testing is optimal. We only did
hypothesis testing for one UPC and 10 stores. Are we certain that our selected stores have reasonable
data? We did not consider outliers in our randomly selected sample. Hence, outliers may affect our
hypothesis testing.
The last limitation is our limited mathematical knowledge. None of us knows how to use other
mathematical or statistical methods to deal with the unknown relationship between holiday week and
units sold.
Acknowledge: Thanks Indiana Statistical Consulting Center for technology supports. Thanks Professor
Kevin Pilgrim for being our mentor. Thanks Professor Natasha Blitvic and Professor Bibo Jiang for being
our consultants.
12
Citation
Dunnhumby Official Website. http://www.dunnhumby.com/us/what-we-do
"Select Your Location." What We Do. N.p., n.d. Web. 23 Apr. 2015
Freeman, David A. "Statistical Models: Theory and Practice" 2 Edition ed.
N.p.: Cambridge UP;, April 27, 2009. Print.
Freedman, David, Robert Pisani, and Roger Purves. “Statistics”.
New York: Norton, 1978. Print.
"Sample Size & Power." NCSS. N.p., n.d. Web. 23 Apr. 2015.
Stock, James H. “Introduction to Econometrics.”
Upper Saddle River, NJ: Pearson Education, 2011. Print.
Wooldridge, Jeffrey M. “Introductory Econometrics: A Modern Approach.”
Mason, OH: South Western, Cengage Learning, 2009. Print.
Anawis, Mark A. "Dealing with Collinearity."
Scientific Computing. N.p., 15 Mar. 2013. Web. 23 Apr. 2015.
"Stata: Data Analysis and Statistical Software." Data Analysis and Statistical Software.
Web. 16 Apr. 2015.