business analytics with sas em on imdb data set - group 7 - final presentation
TRANSCRIPT
Business Analytics using SAS Enterprise Miner for
Group 7:Rahul PrasadKenny McDowellDeepika Gadhella ThulasiramRushit Sanjay ShahVatsal Ajmera
Agenda:Background and Motivation
Data Description
Data Pre-processing
Modeling in SAS EM
Classifier Evaluation
Conclusion – Business Implications
Questions ?
Background and Motivation• Movie- the most popular source of entertainment• Global box office revenue statistics:
• Reliance on audience of the social media and movie rating websites• Our Focus- i) Movie’s rating(the most important movie attribute) ii) Gross money that a movie makes (interesting trends related to the
commercial and financial success of movies)• Built models to predict the IMDB rating’s before the release of the movie based on different predictors• Conclusions and insights will give people in movie business to produce high rated movies and make the
process of choosing a movie easier for the average movie watcher•
Data Description• Our quest for data of IMDB website led us to second hand data from kaggle.com
• Motivation to choose the Data set – Rich Data content
• File Format made it easier to interpret and edit data
• An Interesting fact – “human faces in primary poster”
• Data contained characteristics for 5043 movies spanning across 100 years in 66 countries
• 2399 unique director names with 1000+ actors/actresses
• Data received had few missing values which was cleansed in the data pre-processing stage
• Data set contained attributes with nominal, interval and text data type
Attribute list:
Data Pre-processing
• Originally, our data set contained information on 5043 movies i.e. 5043 rows and 28 predictors
• Cleaned the dataset by elimination of records with missing values in attributesMissing values: Gross, Budget, Aspect ratio and Content ratingElimination of unimportant predictors: Color and IMDB_link
• Post cleanup of all the missing value records, we got dataset of 3,754 to work on
• Missing values on 1,288 rows, about 25.5% of the dataset – still kept the data set rich
Reason for not using Impute or Replacement node of SAS EM:Rich Data set about 3754 records after elimination
The attributes with missing values are of the likes of Gross, Budget, Content Rating, and Fb likes which are factual in nature and imputing these values may disrupt the underlying natural logic that would be used for prediction leading to inaccurate predictions
It was just not feasible to research the missing values and manually fix the missing field
Modeling in SAS EM
1. Objective
2. Analysis of Predictors
3. Predictor Transformations
4. Train and Valid Data sets
5. Predictive Models
6. Model Evaluation
Classifier Evaluation Variant of Hold Out; based on data partitioning in the ratio of 1:1 between training and validation
data Model evaluation criterion dependent on the target data type Model evaluation based on Average Squared Error for Validation data set Neural Network emerged as the champion
Conclusion – Business Implications• Our analysis led to very interesting insights for the critical and commercial success
of a movie
• The success of a movie has correlations with social media presence, user reviews, actor/director popularity, movie duration and budget
• Social Media Presence: Crucial marketing strategyData showed strong correlations between social media brand value like on Facebook and
high movie ratingsSocial media presence on websites like Facebook, YouTube, and Instagram will result in
higher movie ratings and thus better business at the box-office
• User Reviews: This influences a prospective viewer’s decision in the entertainment or “experience” industryPeople are relying heavily on user ratings when deciding to see a new or old movie
Interesting insights:o Actor/Director Popularity – negatively correlated to the movie ratingo Duration – longer the movie better the ratingso Budget – weak correlation with movie rating but strongly related to the commercial success
Questions ?