project mcnulty

Viral or Bust!Popularity Classification on News and Entertainment Media

The Data:

-40,000+ articles scraped from mashable.com

-Scraped and pre-processed with attention to linguistic features of each article

-56 resulting features to consider

The Data:

Among the 56 features, topics are:

-Words

-Publication Time

-Digital Media Aspects

Goal:Create a model that will distinguish between popular and unpopular news

Exploring the Data:

Exploring the Data: Rate of +/- Words

Exploring the Data: +/- Polarity

Exploring the Data: Global Subjectivity

Exploring the Data: Self-reference Links

Exploring the Data: LDA Rank

Initial Analysis:

Model Accuracy Precision Recall F1

kNN 0.566000 0.594047 0.590866 0.592452

Naive Bayes 0.479654 0.623277 0.064094 0.116236

RandomForest 0.608804 0.640564 0.694331 0.666364

LogisticReg 0.591984 0.617579 0.668346 0.641960

SVC 0.533967 0.533928 1.000000 0.697104

Feature Reduction

● Principal Component Analysis to find distribution of variance in the data

Feature Reduction

● Eliminated features below variance threshhold (.8)● Ran GridSearch Random and Grid Search CV on Random Forest to find the ideal

parameters● Ran GridSearch CV with additional specified parameters and graph by feature importance

Most Important Features

Rank Feature

1 Average Keyword Score

2 Data Channel is Entertainment

3 Closeness to LDA topic 2

4 Average Token Length

5 Published on Weekend

6 Closeness to LDA topic 4

7 Data Channel is Technology

8 Max Keyword Score

9 Data Channel is World

Final Results:

Model Accuracy Precision Recall F1

kNN 0.562236 0.581848 0.591262 0.566142

Naive Bayes 0.523288 0.660920 0.140122 0.231222

RandomForest 0.662240 0.662520 0.695117 0.668421

LogisticReg 0.614035 0.638523 0.566057 0.600111

SVC 0.531645 0.532263 1.000000 0.697104

Final Results and Findings:

Small but consistent gain in accuracy:● Data well-processed● Correlation between features is minimal

Conclusion and Next Steps

● In spite of the difficulty in separating data, selected model performed fairly well

● In the future, would like to rely less on sentiment analysis and focus on word vector correlations

project mcnulty

Documents

adam mcnulty€¦ · 3/7/2020 · adam resume-03-07-2020...

group 13 heavy lift cargo plane stephen mcnulty richard-marc...

mark mcnulty photos

mcnulty appraisal report

charles luyckx gary campbell - mcnulty foundation

christine mcnulty submitted to the faculty of the

mtech13: "mobile direct marketing" - ted mcnulty

raymond j. mcnulty president the learning criteria

argonne integrated imaging initiative: the sum is greater...

lyndi steward mcnulty september 16 2010 brief bio

paul mcnulty phd thesis

mcnulty 30€¦ · mcnulty leadership 2 side flyer...

except as indicated, all material © 2015, eric j. mcnulty...

chris mcnulty - managed metadata and taxonomies

mcnulty consulting is a digital media consultancy focused

dr rachael mcnulty - newby psychological services ltd ·...

the mcnulty memo - update on doj and sec cooperation...

roberts bank terminal 2 project · roberts bank terminal 2...

project cortex: frequently asked questions...2020/10/14 ·...

daniel mcnulty: daniel mcnulty: nepatins@ciesc.k12.in.us all...