online news popularity analysis
TRANSCRIPT
WEB ANALYTICS - ONLINE NEWS POPULARITY
TEAM – 11
KRUTIKA DEDHIA
KINJAL GADA
ANKUR VORA
ADVANCES IN DATA SCIENCES AND ARCHITECTURE
- PROF. SRIKANTH KRISHNAMURTHY
INTRODUCTION
• The dataset summarizes a set of features about articles published by Mashable, a well-known news website over a period of two years.
• The objective is to predict the number of shares depending on the features if the article to be published would be popular on the internet or no.
GOALS
• Create and evaluate regression, classification and clustering models in Microsoft Azure Machine Learning Studio.
• Deploy the models as a web service to generate a REST API.
• Build the interactive web interface to predict the results.
DATASET
• Data Source : UCI ML Repository
https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
• Number of attributes: 61
• Number of records: 39,645
• Dependent variable: Number of shares
DATA MODIFICATION
• Type of Data : 1 – business, 2 – lifestyle, 3 – entertainment, 4 - social media, 5 – technology, 6 – world
• Extracted the date from the URL column.
• Day of week : 0 – Sunday, 1 – Monday, 2 – Tuesday, 3 – Wednesday, 4 – Thursday, 5 – Friday, 6 – Saturday
• Web Scraping : Topics, Channel, Author
PROCESS
• Created training models for regression, classification and clustering in Azure ML.
• Created predictive experiment for the above trained models.
• Deployed the models as a web service and generated a REST API.
• Designed UI using Java Spring MVC, HTML, Bootstrap, Ajax along with user validations.
MACHINE LEARNING ALGORITMHS
REGRESSION MODELS
• Used Azure ML regression modules
• Decision Forest, Neural Network, Poisson Regression and Boosted Decision Tree
• Best Model: Random Forest based on lowest RMSE value
RANDOM FOREST
CLASSIFICATION MODELS
• Used Azure ML classification components Two Class Decision Forest, Two Class Neural Network and Two Class Boosted Decision Tree
• Added attribute isPopular :
• Shares <= 1400 : high popular
• Shares > 1400 : less popular
• Best Model : Two Class Boosted Decision Tree Based on the high Accuracy and
AUC value
TWO CLASS BOOSTED DECISION CLASSIFICATION
CLUSTERING MODELS
• Used K-means Clustering
• No of clusters used is 3 (k = 3).
• Determines the distance of articles based on a few parameters from the centroid of clusters.
DEMO
• Web User Interface
ANALYSIS
TABLEAU ANALYSIS
TABLEAU ANALYSIS
TABLEAU ANALYSIS
CHALLENGES
• Formatting data after Web Scraping.
• Understanding the variables like keywords, subjectivity.
• Finding relation between variables and feature selection for modelling.
LINKS
• URL – http://sample-env-1.xhmp4ynr7g.us-east-1.elasticbeanstalk.com/
• Github – https://github.com/voraankur/ADS/tree/master/Final%20Project
REFERENCES
• https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
• https://repositorium.sdum.uminho.pt/bitstream/1822/39169/1/main.pdf
CONTRIBUTION
• Ankur – Regression Models, Web Interface
• Kinjal – Data cleaning, Web Scraping, Clustering, Report
• Krutika – Classification Models, Presentation, Tableau Analysis
THANK YOU