online news popularity analysis

WEB ANALYTICS - ONLINE NEWS POPULARITY

TEAM – 11

KRUTIKA DEDHIA

KINJAL GADA

ANKUR VORA

ADVANCES IN DATA SCIENCES AND ARCHITECTURE

- PROF. SRIKANTH KRISHNAMURTHY

INTRODUCTION

• The dataset summarizes a set of features about articles published by Mashable, a well-known news website over a period of two years.

• The objective is to predict the number of shares depending on the features if the article to be published would be popular on the internet or no.

GOALS

• Create and evaluate regression, classification and clustering models in Microsoft Azure Machine Learning Studio.

• Deploy the models as a web service to generate a REST API.

• Build the interactive web interface to predict the results.

DATASET

• Data Source : UCI ML Repository

https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

• Number of attributes: 61

• Number of records: 39,645

• Dependent variable: Number of shares

DATA MODIFICATION

• Type of Data : 1 – business, 2 – lifestyle, 3 – entertainment, 4 - social media, 5 – technology, 6 – world

• Extracted the date from the URL column.

• Day of week : 0 – Sunday, 1 – Monday, 2 – Tuesday, 3 – Wednesday, 4 – Thursday, 5 – Friday, 6 – Saturday

• Web Scraping : Topics, Channel, Author

PROCESS

• Created training models for regression, classification and clustering in Azure ML.

• Created predictive experiment for the above trained models.

• Deployed the models as a web service and generated a REST API.

• Designed UI using Java Spring MVC, HTML, Bootstrap, Ajax along with user validations.

MACHINE LEARNING ALGORITMHS

REGRESSION MODELS

• Used Azure ML regression modules

• Decision Forest, Neural Network, Poisson Regression and Boosted Decision Tree

• Best Model: Random Forest based on lowest RMSE value

RANDOM FOREST

CLASSIFICATION MODELS

• Used Azure ML classification components Two Class Decision Forest, Two Class Neural Network and Two Class Boosted Decision Tree

• Added attribute isPopular :

• Shares <= 1400 : high popular

• Shares > 1400 : less popular

• Best Model : Two Class Boosted Decision Tree Based on the high Accuracy and

AUC value

TWO CLASS BOOSTED DECISION CLASSIFICATION

CLUSTERING MODELS

• Used K-means Clustering

• No of clusters used is 3 (k = 3).

• Determines the distance of articles based on a few parameters from the centroid of clusters.

DEMO

• Web User Interface

ANALYSIS

TABLEAU ANALYSIS

CHALLENGES

• Formatting data after Web Scraping.

• Understanding the variables like keywords, subjectivity.

• Finding relation between variables and feature selection for modelling.

LINKS

• URL – http://sample-env-1.xhmp4ynr7g.us-east-1.elasticbeanstalk.com/

• Github – https://github.com/voraankur/ADS/tree/master/Final%20Project

REFERENCES

• https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

• https://repositorium.sdum.uminho.pt/bitstream/1822/39169/1/main.pdf

https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

CONTRIBUTION

• Ankur – Regression Models, Web Interface

• Kinjal – Data cleaning, Web Scraping, Clustering, Report

• Krutika – Classification Models, Presentation, Tableau Analysis

THANK YOU

online news popularity analysis

Data & Analytics