ds-240 final project movie budgets vs. success · 2018-12-08 · ds-240 final project – movie...

8
DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis proposed: Is there any correlation between budget spent to produce a movie and a movie’s success? The first step, other than finding a solid data set, is determining how to measure success. Several variables are strong possibilities. First, straight “Revenue” which is the simplest metric that is widely reported and common in most data sets. Second, the Popularity of a movie which is available through several of the movie database sites. Third candidate, Vote_Average from movie viewers from several of the movie database sites. Fourth and Fifth options could be calculated variables of either a profit (Revenue-Budget) or ROI (Revenue/Budget) type variables. All five will be explored in the following analysis. Data Set Selection & Cleansing After review of multiple movie data sets, there were several potential candidates in a wide range of sizes. The selected data set is a middle-sized data set posted on www.kaggle.com from the tmdb movie database. (Link: https://www.kaggle.com/kevinmariogerard/tmdbmovies) It has 10.9k rows with 21 columns listed below: - Id (tmdb_id) (Qual.) - Imdb_id (Qual.) - Popularity (tmdb site data) (Quant.) - Budget (Quant.) - Revenue (Quant.) - Original_title (Qual.) - Cast (Qual.) - Homepage (Qual.) - Director (Qual.) - Tagline - Keywords (Qual.) - Overview (Qual.) - Runtime (Quant.) - Genres (Qual.) - Production_companies (Qual.) - Release_date (Quant.) - Vote_count (tmdb site data) (Quant.) - Vote_average (tmdb site data) (Quant.) - Release_year (Qual.) - Budget_adj (Quant.) - Revuene_adj (Quant.) As part of data cleansing, several columns were dropped that would not be necessary for this analysis. There were several areas that did have some holes (NULL values or zeros) but mostly at the low tail end of the data which is going to be dropped. The threshold settled on was any movie with a recorded budget or revenue lower than $1000 was removed. The net result, only two values remained as NULL or zero. Those values were looked up manually and updated. The cleaned data set landed at about 3.8k rows with 13 columns remaining. The selected columns are show below with two added derived values (*): - Id (tmdb_id) - Imdb_id - Popularity (tmdb site data) - Budget - Revenue - Original_title - Runtime - Genres - Production_companies - Release_date - Vote_count (tmdb site data) - Vote_average (tmdb site data) - Release_year - *Pure_Gain (Revenue-Budget) (Quant.) - *Retrun_Ratio (Revenue/Budget) (Quant.)

Upload: others

Post on 27-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis

DS-240 Final Project – Movie Budgets vs. Success

Introduction

Hypothesis / Problem Statement

The question or hypothesis proposed: Is there any correlation between budget spent to produce a movie and a

movie’s success? The first step, other than finding a solid data set, is determining how to measure success.

Several variables are strong possibilities. First, straight “Revenue” which is the simplest metric that is widely

reported and common in most data sets. Second, the Popularity of a movie which is available through several of

the movie database sites. Third candidate, Vote_Average from movie viewers from several of the movie

database sites. Fourth and Fifth options could be calculated variables of either a profit (Revenue-Budget) or ROI

(Revenue/Budget) type variables. All five will be explored in the following analysis.

Data Set Selection & Cleansing

After review of multiple movie data sets, there were several potential candidates in a wide range of sizes. The

selected data set is a middle-sized data set posted on www.kaggle.com from the tmdb movie database. (Link:

https://www.kaggle.com/kevinmariogerard/tmdbmovies) It has 10.9k rows with 21 columns listed below:

- Id (tmdb_id) (Qual.)

- Imdb_id (Qual.)

- Popularity (tmdb site data) (Quant.)

- Budget (Quant.)

- Revenue (Quant.)

- Original_title (Qual.)

- Cast (Qual.)

- Homepage (Qual.)

- Director (Qual.)

- Tagline

- Keywords (Qual.)

- Overview (Qual.)

- Runtime (Quant.)

- Genres (Qual.)

- Production_companies (Qual.)

- Release_date (Quant.)

- Vote_count (tmdb site data) (Quant.)

- Vote_average (tmdb site data) (Quant.)

- Release_year (Qual.)

- Budget_adj (Quant.)

- Revuene_adj (Quant.)

As part of data cleansing, several columns were dropped that would not be necessary for this analysis. There

were several areas that did have some holes (NULL values or zeros) but mostly at the low tail end of the data

which is going to be dropped. The threshold settled on was any movie with a recorded budget or revenue lower

than $1000 was removed. The net result, only two values remained as NULL or zero. Those values were looked

up manually and updated. The cleaned data set landed at about 3.8k rows with 13 columns remaining. The

selected columns are show below with two added derived values (*):

- Id (tmdb_id)

- Imdb_id

- Popularity (tmdb site data)

- Budget

- Revenue

- Original_title

- Runtime

- Genres

- Production_companies

- Release_date

- Vote_count (tmdb site data)

- Vote_average (tmdb site data)

- Release_year

- *Pure_Gain (Revenue-Budget) (Quant.)

- *Retrun_Ratio (Revenue/Budget) (Quant.)

Page 2: DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis

Description of Data

Once the data set was loaded into R-studio as movies.csv, a summary function was executed to produce a high-

level data summary. The following excerpt shows results for previously selected “success” variables and others.

Key Variable Ranges for our success variables and others (from summary(movies)):

Derived Variable Ranges from summary function:

Next up, a look at the selected methodology and how it unfolded for this analysis.

Methodology

For this analysis, the main focus will be on the Quantitative methodology. Below is an outline of the approach

through the three phases of data analysis.

Raw Data:

- Develop Hypothesis

- Data Selection

- Cleanse the data

Quantitative Analysis:

- Selection of significant variables

- List assumptions / Qualitative Factors

- Visual inspection and summary review of data

- Statistical analysis of significant variables

- Check for data Bias

Meaningful Information:

- Visualization development

- Present Findings

- Modify Hypothesis (if needed) and repeat

- Implement Solution / Results

Page 3: DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis

Key Base Assumptions

Some key assumptions are being made with this data set. First and foremost, assumptions are being made

about the accuracy of the budget and revenue. The assumption that dropping low end budget and revenue files

will have minimal to no effect on the analysis outcome. Genre is also not being included as a delineator for this

analysis and it is assumed that this will have no appreciable impact. Finally, initially the assumption is being

made that there are minimal impacts from qualitative metrics and that the quantitative metrics will be enough

to achieve a reasonable correlation.

Preliminary Analysis

Visual Inspection and Summary Data Review

Looking first at how the data is distributed by release year (Release_Year) grouped by decade (see chart below).

The most recent decade (2010’s), as noted from the summary function, only includes years 2010-2015 so 60% of

the other segments. This explains the dip in # of movies released in that decade. It would be anticipated that

the trend would continue if it were a full decade of data (assumption).

With the hypothesis statement: Is there any correlation between budget spent and movie success? Let’s take a

look at the data visual ggplot2 graphs for budget verse each of the “success” metrics.

Page 4: DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis

Revenue, Popularity and Profit (Pure-Gain) are showing some directional correlation tendencies while

Vote_Average is showing a very slight correlation if any. It is obvious that the ROI (Return_Ratio) is being

skewed by the two extreme outliers from a couple of very successful low budget films (Paranormal Activity and

The Blair Witch Project). These two movies both had budgets between 10-15k but made 12,000 and 9,000 times

more than the films budget, respectively. The next closest value was 700 times.

Correlation and Linear Regression Analysis

Let’s dig a little deeper into a data correlation matrix and some linear regression models for budget and revenue

against the other success metrics (shown on the next page).

Page 5: DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis

Some moderate, almost strong, correlations are seen in the 0.5-0.7 range with the key success variables

highlighted in light yellow above. As expected, the Revenue to Budget is the best at 0.69 followed closely by

Revenue to Popularity at 0.61.

From a linear regression model standpoint, lets look at the budget first, as our main analysis variable, then

revenue.

51% of the Budget is explained by the selected variables. The selected “success” variables are also showing

strong significance with varied levels of impact on explaining the budget.

The revenue linear regression model shows similar variable significance (on the following page).

Page 6: DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis

The revenue model comes in the highest of all the linear regression models in the analysis and has the highest R-

squared at 61% of Revenue explained by the selected variables.

Qualitative Impact Factors

With 61% being the highest linear regression results, consideration must be given to what qualitative factors

may explain the other 40-50% of the revenue and budget numbers. Several factors should be investigated in

deeper studies. Here are some of the qualitative metrics.

- Time of year the movie is released?

- Competition at time of release?

- Script Quality?

- Target Audience?

- Genre?

- Production Company?

- Inflation / Economic Cycles?

- Director?

- Other Human Behavior Factors?

- Marvel Magic? Disney Magic?

Budget Distribution Analysis

With budget being at the root of the hypothesis, looking deeper at the budget segmentation is the next area of

analysis. Below is the Budget distribution with mean line shown in red. Budget summary data shown as well.

Page 7: DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis

By breaking the budget numbers into quartiles and looking at budget verses profit (Pure_Gain), trends can be

uncovered at the low-, mid-range, or high-end segmentation. Below is the assumed “Movie Success Scale” and

two views of the quartile distribution (percentage and count) with stated scale applied.

By applying this five-point scale we can glean a few things from the above three charts.

1. Higher the Budget, slightly higher the chance of profitability

2. Not as clear cut in the Loss mid-range segment of the scale.

3. Bombed and Smash Hit segments may be skewed due to the way the scale has been set up since for

larger budget films it would be harder to get to the x5 level.

4. Only 72% of movies analyzed turn a profit

5. May be influenced by how movies secure funding (extra commitment to fund a$50MM+ film?)

Results

There was a moderate to strong correlation of the key success metrics with budget (0.5-0.7 range). The P-values

show there is a strong significance (***) with both budget and revenue linear regression models. But the R-

squared values were a little soft at 0.51 and 0.61 for budget and revenue, respectively, compared to other

values seen in statistical analysis studies. So, is this a bad thing? Not necessarily.

With some online investigation, several sources noted that when dealing with “Human Behavior” typically the R-

squared values are lower than 50%. The excerpt below from a blog called “Adventures in statistics” summarized

it best. (Link to blog provided below the excerpt.)

Page 8: DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis

http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-

assess-the-goodness-of-fit

Several variables and impactful decisions like “Which movie to see?” definitely qualifies as a “human behavior”

type variable. So, 51%-61% is not looking so bad now, actually very good!

Conclusion

In conclusion, there are correlations between budget spend and movie success, just clouded and somewhat

capped quantitatively due to human behavior factors. We can account for 61% of the Revenue numbers and

51% of the Budget numbers with our quantitative “success” variables.