isds 473_project 1_final_2

13
Data Analysis Project for Monthly Total Houses Sold in the United States Presented to Dr. Zerom

Upload: claire-chen

Post on 17-Aug-2015

30 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ISDS 473_Project 1_final_2

Data Analysis Project for Monthly Total Houses Sold in

the United States

Presented to

Dr. Zerom

Prepared byLing Wang

Szu Tung ChenSarun Sangarungroj

Spring 2015

Page 2: ISDS 473_Project 1_final_2

Table of Contents

Executive Summary...................................................................................................................2

1. Professional Use of Forecast Pro XE.............................................................................3

2. Exploring the Data.........................................................................................................5

3. Conclusion.....................................................................................................................8

1

Page 3: ISDS 473_Project 1_final_2

Executive Summary

We are given a time series data of “Monthly total houses sold in the United States for the period January 1978 through July 2007”, and our job is analyzing the data pattern and forecasting the monthly house sold in order to help managers make decisions.

By using Expert Selection model, the forecast data in the fitted period is closed to the actual data. However, the gap between the forecast data and actual data is gradually increasing in the holdout period. We further examine the forecast data and calculate the Fit Measures of MAPE and RMSE and accuracy measures of MAPE and GMRAE. We found out that the big gap between the forecast data and actual data may be caused by the Housing Market Crash in 2007. Due to this financial crisis, the housing sales suddenly dropped but ForecastPro could not foresee this, hence came out an inaccurate forecast. We suggest that the managers should increase holdout period to 3-4 years, or use another forecasting model to get more accurate results.

From crude visualization, we can only tell that the graph is not stationary. To identify the trend and seasonality, we use autocorrelation method to displays correlation value of the monthly data. In order to further illustrate the trend and seasonal pattern, we use differencing to remove one effect of the pattern and exam if another pattern exists. From the column graph, we found that the highest numbers of houses sold each year is in March. The possible reason is that people try to reduce their tax payment before April and house loan could be a good way to do that. After experimenting with simple and seasonal differencing in autocorrelation functions, we can conclude that the time series is correlated with a slow decreasing trend and annual seasonality.

We also asked to make the time series stationary. Since this time series data has both trend and seasonality, both simple and seasonal differencing are taken to remove any possible trends and seasonality. For any further forecast, we recommend that the manager should pick forecasting models that could predict time series with both trends and seasonality in it.

The following report explains each step we took from the start to the conclusion including the analysis in details.

2

Page 4: ISDS 473_Project 1_final_2

1. Professional Use of Forecast Pro XE

We are now analyzing the data of monthly total houses sold in the United States for the period from January 1978 to July 2007. After withholding data from 2005 Aug to 2007 July, we use Expert Selection Method in Forecast Pro to generate forecasts. We can see that the red line in Exhibit 1-1 is forecast result and the green line is actual data. We also apply the same forecasting model for fitted period.

Exhibit 1-1:Monthly Total Houses Sold in the United States from January 1978 to July 2007_______________________________________________________________________

By visualizing Exhibit 1-1, we can have a general understanding of our forecasting result by using Expert Selection Model. The comparisons of forecast data and actual data in fitted period and holdout period can be simply found.

In the fitted period, the graph shows an impressive historical fit of our method. The fitted forecasts (in red) are a near-perfect mirror of the actual data (in green). However, in the holdout period, the forecasts (in red) gradually moves far away from the actual data (in green), and even drops out of the lower confidence interval (in blue). Therefore, it seems like the Expert Selection model is a very good fit in fitted period but not very accurate when we apply it for holdout period.

To further examine our understanding, we compare the forecast data with the actual data, and calculate forecast errors by using mean absolute percentage error (MAPE) and root mean square error (RMSE) for fitted period, and using MAPE and the geometric mean relative absolute error (GMRAE) for holdout period. The results are in Exhibit 1-1.

3

Page 5: ISDS 473_Project 1_final_2

Exhibit 1-2Results of MAPE and RMSE of Fit Measures and MAPE and GMRAE of Accuracy Measures ______________________________________________________________________________

MAPE RMSE MAPE GMRAE

Fit Measures

6.34% 4.59 Accuracy Measures 32.41% 0.75

Mean absolute percentage error (MAPE) measures the deviation as a percentage of the actual data. For the error of fit measures, the value of MAPE tells us that the monthly total houses sold forecast errors of the Expert Selection Method are 6.34% on average. Comparing to the MAPE of accuracy measures, this value is quite low, only one fifth of the MAPE of accuracy measures. RMSE measures squared forecast error and it recognizes that large errors are disproportionately more ‘’expensive’’ than small errors. The RMSE indicates that on average the forecast error in fitted period is off by 5 compared to the actual data, which is accepted in general.

For the error of accuracy measures, the value of MAPE tells us that the forecast error is 32.41% of the actual number of houses sold, which is extremely high. The result of GMRAE is 75%, which means that the forecast errors by using Expert Selection Method are 75% of the corresponding forecast errors by using the Random Walk model. This value is in normal range, because in general the GMRAE is lower than 1, which assumes the model is better than naïve model.

From both visualization and error analysis, we generate a similar result that the forecasts in holdout period is not as accurate as it in fitted period by using Expert Selection Model.  This result may be caused by the housing market crash in 2007. The Housing Market Crash of 2007 was the worst housing crash in U.S. history. It was the cause of the financial crisis. This crisis nearly caused the U.S. to experience another depression like the Great Depression. United States housing prices experienced a major market correction after unsustainable valuations peaked in2005-2006. Some early signs of trouble were beginning when some types of housing loans started to go into default in 2006. Once the credit markets froze in the summer 2007, things began to be serious and the entire housing market fell dramatically.

This crisis was hard to forecast, because it completely changes the house market data trend. Our forecast model cannot follow the change. It still use the original pattern to predict the data, thus generate an inaccurate forecast.

In order to better forecast the number of monthly house sold after 2007, we suggest that another forecasting model can be used rather than the Expert Selection Model. If the managers still prefer to use Expert Selection Model, they can apply it in a longer holdout period to increase its accuracy, such as three or four years holdout periods.

4

Page 6: ISDS 473_Project 1_final_2

Exhibit 1-3Historical Value and Forecast Value within Holdout period (In Thousands of Units)______________________________________________________________________________

Based on the measure values we got from Forecast Pro and Exhibit 1-2, we concludes that the forecast is not accurate enough even though it is kind of accurate for the first 5 months.

2. Exploring the Data

Exhibit 2-1: Monthly Total House Sold from January 1978 to July 2007______________________________________________________________________________

To select a forecast model which can better predict the monthly house sold, we must first identify the data pattern on Exhibit 2-1.

By visually inspecting the data, we can see that the time series is definitely not stationary; but it is hard to identify a trend or any seasonal patterns. This time series declines in the first 5 years, and then grow to almost the same average level for around 10 years. After that, the time series has a slightly increase for around 15 years and then it starts to drop again.

Instead of showing historical data in a green line, we use column to highlight the monthly data by different colors in Exhibit 2-2.

5

Page 7: ISDS 473_Project 1_final_2

Exhibit 2-2: Yearly Total Houses Sold in US from 1997 to 2006______________________________________________________________________________

Exhibit 2-2 shows a significant seasonal pattern for the time series data. The highest amount of total houses sold in US appears on every March. It declines slowly from April and then drop to the bottom on December. The reason causes the peak house sales on March can be the tax season. Usually, people try to reduce their tax payment before April, and houses loan can be a good way to deduct tax. In addition, the low points from November to January can be caused by the holiday season. During this period, people show a strong purchasing power on gifts or traveling instead of house.

However, only summarized the data by visualization is not credible enough. We decide to use autocorrelation to further examine the data. Autocorrelation is the correlation between a variable lagged one or more periods and itself. If autocorrelation is high for the first few periods and slowly tends towards zero, it would indicate a strong trend in the data. Therefore, it’s a more significant approach to prove if a trend or seasonality exists in the time series data.

Exhibit 2-3: Autocorrelation Function ____________________________________________________________________________

Exhibit 2-3 shows the Autocorrelation Function calculated by Forecast Pro XE. From the autocorrelation function, we can see a negative trend. The autocorrelation coefficient value gradually declines from more than 0.9 in lag1 to around 0.4 in lag 48. In addition, it seems like there is a little seasonality pattern in the data because there are spikes in lag 12, lag 24, lag 36, and lag 48. However, it is not clear that if there is seasonality pattern existed.   

6

Page 8: ISDS 473_Project 1_final_2

To clearly illustrate the trend and seasonal pattern in the time series, we use the differencing method to remove the effect of one pattern and verify if another pattern exists.  

We first exam the existence of a trend. We take the first order differencing for seasonality to remove possible seasonal effects from the autocorrelation function. This is shown in Exhibit 2-4.

Exhibit 2-4: Autocorrelation Function with Seasonal Differencing______________________________________________________________________________

From this correlogram, we can see that from lag 1 to lag 12, the autocorrelation dramatically drops from 0.8 to zero after removing all the seasonal effects. From lag 13 to lag 35, the autocorrelation slightly increases three times, but decreases back to zero eventually. For the rest of the intervals, it even drops to the negative value. In other words, this time series can be considered to be correlated and this correlation shows a clear decreasing trend.

Second, to identify if there is a seasonality pattern in this time series, we take the first order simple differencing to remove possible trend effects from the autocorrelation function. This is shown in Exhibit 2-5.

Exhibit 2-5: Autocorrelation Function with Simple Differencing______________________________________________________________________________

7

Page 9: ISDS 473_Project 1_final_2

From the above correlogram, we can clearly see significant positive spikes every 12 lags, which indicates a clear annual seasonality. These spikes are also statistically different than zero. Between the positive spikes, there are also 5 negative spikes that tend to have the same pattern but they are not significant enough unlike the positive spikes every 12 lags.

To make the time series stationary, we need to get rid of both trends and seasonality. To achieve that, we take the first order simple and seasonal differencing as shown in Exhibit 2-6.

Exhibit 2-6: Autocorrelation Function with both Simple and Seasonal Differencing______________________________________________________________________________

From the above correlogram, there are only 4 autocorrelations left at lag 10, 12, 27, and 34. Compared to the original ACF in Exhibit 2-3, we can conclude that the data no longer has any trend or seasonality, thus it is now stationary.

As a result, by using autocorrelation analysis to explore the time series data, we can conclude that there is a clear trend and significant annual seasonal pattern for monthly house sold in the U.S from January 1978 to July 2007. Based on this result, we suggest that managers should apply a forecasting model, which can predict a time series data with both trends and seasonality patterns.

3. Conclusion

Expert Selection model with 2-year holdout period is not an accurate forecasting model due to the unpredictable financial crisis which caused a sudden drop in house sales in the holdout period. Increasing the holdout period from 2 years to 3-4 years shows a much better forecasting result.

After we take first order for both simple and seasonal differencing, we identified that this time series is non-stationary, and correlated with a slow decreasing trend and annual seasonality. Moreover, from the column graph, we found out that the peak of housing sales each year is in March because people tend to avoid taxes before April, so they invest in house loans and buy houses instead.

8

Page 10: ISDS 473_Project 1_final_2

To make the time series stationary, both simple and seasonal differencing should be applied to remove any existed trends and seasonality. Lastly, the managers should pick forecast models with the ability to predict time series data that has trends and seasonality.

9