final project - cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · final project...

Final Project Analyzing Reddit Data to Determine Popularity

Project Background: The Problem

Problem: Predict post popularity where the target/label is based on a transformed score metric

Algorithms / Models Applied: • SVC • Random Forests • Logistic Regression

Project Background: The DataData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit

Data Structure (22 Columns): •created_utc - Float •score - Integer •domain - Text • id - Integer •title - Text •author - Text •ups - Integer •downs - Integer •num_comments - Integer •permalink (aka the reddit link) - Text •self_text (aka body copy) - Text

• link_flair_text - Text •over_18 - Boolean •thumbnail - Text •subreddit_id - Integer •edited - Boolean • link_flair_css_class - Text •author_flair_css_class - Text • is_self - Boolean •name - Text •url - Text •distinguished - Text

Project Background: The Data - RemovedData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit

Data Structure: • created_utc - Float • score - Integer • domain - Text • id - Integer • title - Text • author - Text • ups - Integer • downs - Integer • num_comments - Integer • permalink (aka the reddit link) - Text • self_text (aka body copy) - Text

• link_flair_text - Text • over_18 - Boolean • thumbnail - Text • subreddit_id - Integer • edited - Boolean • link_flair_css_class - Text • author_flair_css_class - Text • is_self - Boolean • name - Text • url - Text • distinguished - Text

Reviewing the Data: Subreddit Topics

AnimalsWithoutNecks

BirdsBeingDicks

CemeteryPorn

CoffeeWithJesus

datasets dataisbeautiful

FortPorn

learnpython MachineLearning

misleadingthumbnails

Otters

PenmanshipPorn

PowerWashingPorn

ShowerBeerStonerPhilosophy

talesfromtechsupport

TreesSuckingAtThings

Reviewing the Data: Top Domains

Domain'Count'

imgur.com)

youtube.com)

reddit.com)

flickr.com)

soundcloud.com)

quickmeme.com)

i.minus.com)

twi6er.com)

amazon.com)

qkme.com)

vimeo.com)

wikipedia.org)

ny;mes.com)

guardian.co.uk)

bbc.co.uk)

Imgur: 773,969

YouTube: 188,526

Reddit: 25,445

Flickr : 17,854

Soundcloud: 10,397

Reviewing the Data: Most Have No Body Text

Posts rely primarily on the title and some related media content from the aforementioned domains - link, gif image, video, etc.

Over 1.6 million posts had no body copy/text or approximately 74% of all posts contained a NaN value

Reviewing the Data: Time Based Data

50000"

100000"

150000"

200000"

250000"

300000"

January"

February"

March"

April"

August"

September"

October"

November"

December"

Winter Months Saw a Dip, Fall Could be Underrepresented Given Data Pulled in August

Tuesday is Slightly the Favorite Day to Post, While the Weekend Sees a Dip

50000"

100000"

150000"

200000"

250000"

300000"

350000"

400000"

Monday"

Tuesday"

Wednesday"

Thursday"

Friday"

Saturday"

Sunday"

Reddit While You Work: Post Volume Picks up Around 9/10am, Peeking at 12pm Until Dropping off Throughout the Afternoon

20000"

40000"

60000"

80000"

100000"

120000"

140000"

160000"

12am" 1am" 2am" 3am" 4am" 5am" 6am" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm"

0"20000"40000"60000"80000"

100000"120000"140000"160000"180000"200000"

50)99"

100)199"

200)299"

300)399"

400)499"

500)999"

1000)4999"

5000)9999"

10000+"

Score&Counts&

Reviewing the Data: Determining Popularity

~15% of posts

Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel

Analyzing the Data: Issues

Issue: Given the size of the initial data set (2.5 million rows) and how it expanded upon transformation (CountVectorizer and TFIDF) to almost 100,000 columns, resulted in issues in processing the data locally on my machine. In the end I was only able to get about 1% of the data to run through the algorithms • Even with this smaller sub set of data processes could take anywhere from 30 minutes to several hours, making playing

around with the data extremely hard

Future: Explore platforms that are better at handling large data sets such as PySpark. Tried to process the data with PySpark but ran into technical issues that I couldn’t address in time

Analyzing the Data: SVC

0.88000$

0.89000$

0.90000$

0.91000$

0.92000$

0.93000$

0.94000$

Linear$ Poly$ Sigmoid$ RBF$

Accuracy'

0.92000%

0.92200%

0.92400%

0.92600%

0.92800%

0.93000%

0.93200%

0.93400%

0.93600%

0.93800%

0.001% 0.01% 0.1%

Accuracy'w/'Linear'Kernel'

Linear = .9368 C Value of .1 = .9363

from sklearn import svm

Analyzing the Data: Regression Trees

from sklearn import ensemble

0.885%

0.895%

0.905%

0.915%

0.925%

5% 10% 20% 50% 100% 125% 150%

N Estimators = 125 .922

Max Depth = 250 .924

0.885%

0.895%

0.905%

0.915%

0.925%

5% 40% 100% 150% 200% 250% 300%

Analyzing the Data: Logistic

0.925&

0.935&

0.945&

0.001& 0.01& 0.1& 1& 10& 50&

C Value of 1 = .9471 L1 = .947733 L2 = .947066

0.9469&

0.94695&

0.947&

0.94705&

0.9471&

0.94715&

0.9472&

0.94725&

0.9473&

0.94735&

0.9474&

L1& L2&

from sklearn import linear_model

Totally Crushing It!

Analyzing the Data: Classification Report

Random Forests

Logistic Regression

Soooo Not Crushing It

Feature Reduction: Accuracy

Random Forests - Reduced Features

Logistic Regression - Reduced Features

SVC - Reduced Features

Random Forests - All Features

Logistic Regression - All Features

SVC - All Features

94.71%

93.63%

Feature Reduction: Classification Report

Random Forests - All Features

Logistic Regression - All Features

SVC - All Features

Random Forests - Reduced Features

Logistic Regression - Reduced Features

SVC - Reduced Features

Next StepsDealing with the processing issues: • Learn and try our PySpark

Answer some additional questions: • Reevaluate how I handle the domains

• I originally bucketed domains by their frequency/occurrence in the data set however given the originating domain of the content and the title are the majority of the “post” and the top 15 domains make up the vast majority of the post I want to focus on posts from those ~15 domains to get a better picture on how they explicitly affect popularity

• Run the data with varying n_grams levels • I tried them but they expanded the columns to hundreds of thousands which just seemed to

freeze, so hopefully something like PySpark will help with the processing • Predict sub-reddit/category questions:

• Can I predict category of a post? • Do certain subreddits produce more overall popular content than others? Bears With

Beaks vs. ggggg (what ever the hell that is)

APPENDIX

0"20000"40000"60000"80000"

100000"120000"140000"160000"180000"200000"

50)99"

100)199"

200)299"

300)399"

400)499"

500)999"

1000)4999"

5000)9999"

10000+"

Score&Counts&

Reviewing the Data: Reevaluate Popularity

~12% of posts

Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel

~8% of posts

Analyzing the Data: SVC

0.688%

0.692%

0.694%

0.696%

0.698%

0.702%

0.704%

0.706%

0.708%

0.001% 0.01% 0.1% 1% 10% 50%

C Value of .1 = 0.7077 Accuracy Score

Confusion Matrix

Analyzing the Data: Random Forest

5$ 10$ 20$ 50$ 100$ 125$

N Estimators of 100 = 0.8218

Max Depth of 200 = 0.8247

0.775%

0.785%

0.795%

0.805%

0.815%

0.825%

40% 100% 150% 200% 250%

Confusion MatrixAccuracy Score

Analyzing the Data: Logistic

C =1, Penalty = L2

0.815%

0.825%

0.835%

0.845%

0.001% 0.01% 0.1% 1% 10% 50%

C of 1 = .8453

Confusion Matrix

final project - cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · final project...

Documents

case study reddit

reddit for hire

how to use reddit

reddit guide.pdf

new submissions from reddit

@reddit · reddit is where social media and pop culture...

reddit final

nytrobox nb-4999 t usb

reddit cookbook

reddit marketing playbook - the marketer's guide to reddit

wedding budget reddit

how to market on reddit, even though reddit hates marketing...

reddit workshop

the reddit study guide

bs 4999-147

ketogains calculator reddit v2.1

{ reddit william mcintyre. reddit. (n) a place to trade...

reddit usability testing

reddit for brands

reddit genie