final project - cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · final project...

Post on 24-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Final Project Analyzing Reddit Data to Determine Popularity

2

Project Background: The Problem

Problem: Predict post popularity where the target/label is based on a transformed score metric

Algorithms / Models Applied: • SVC • Random Forests • Logistic Regression

3

Project Background: The DataData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit

Data Structure (22 Columns): •created_utc - Float •score - Integer •domain - Text • id - Integer •title - Text •author - Text •ups - Integer •downs - Integer •num_comments - Integer •permalink (aka the reddit link) - Text •self_text (aka body copy) - Text

• link_flair_text - Text •over_18 - Boolean •thumbnail - Text •subreddit_id - Integer •edited - Boolean • link_flair_css_class - Text •author_flair_css_class - Text • is_self - Boolean •name - Text •url - Text •distinguished - Text

4

Project Background: The Data - RemovedData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit

Data Structure: • created_utc - Float • score - Integer • domain - Text • id - Integer • title - Text • author - Text • ups - Integer • downs - Integer • num_comments - Integer • permalink (aka the reddit link) - Text • self_text (aka body copy) - Text

• link_flair_text - Text • over_18 - Boolean • thumbnail - Text • subreddit_id - Integer • edited - Boolean • link_flair_css_class - Text • author_flair_css_class - Text • is_self - Boolean • name - Text • url - Text • distinguished - Text

5

Reviewing the Data: Subreddit Topics

AnimalsWithoutNecks

BirdsBeingDicks

CemeteryPorn

CoffeeWithJesus

datasets dataisbeautiful

FortPorn

learnpython MachineLearning

misleadingthumbnails

Otters

PenmanshipPorn

PowerWashingPorn

ShowerBeerStonerPhilosophy

talesfromtechsupport

TreesSuckingAtThings

6

Reviewing the Data: Top Domains

Domain'Count'

imgur.com)

youtube.com)

reddit.com)

flickr.com)

soundcloud.com)

quickmeme.com)

i.minus.com)

twi6er.com)

amazon.com)

qkme.com)

vimeo.com)

wikipedia.org)

ny;mes.com)

guardian.co.uk)

bbc.co.uk)

Imgur: 773,969

YouTube: 188,526

Reddit: 25,445

Flickr : 17,854

Soundcloud: 10,397

7

Reviewing the Data: Most Have No Body Text

Posts rely primarily on the title and some related media content from the aforementioned domains - link, gif image, video, etc.

Over 1.6 million posts had no body copy/text or approximately 74% of all posts contained a NaN value

8

Reviewing the Data: Time Based Data

0"

50000"

100000"

150000"

200000"

250000"

300000"

January"

February"

March"

April"

May"

June"

July"

August"

September"

October"

November"

December"

Winter Months Saw a Dip, Fall Could be Underrepresented Given Data Pulled in August

9

Reviewing the Data: Time Based Data

Tuesday is Slightly the Favorite Day to Post, While the Weekend Sees a Dip

0"

50000"

100000"

150000"

200000"

250000"

300000"

350000"

400000"

Monday"

Tuesday"

Wednesday"

Thursday"

Friday"

Saturday"

Sunday"

10

Reviewing the Data: Time Based Data

Reddit While You Work: Post Volume Picks up Around 9/10am, Peeking at 12pm Until Dropping off Throughout the Afternoon

0"

20000"

40000"

60000"

80000"

100000"

120000"

140000"

160000"

12am" 1am" 2am" 3am" 4am" 5am" 6am" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm"

0"20000"40000"60000"80000"

100000"120000"140000"160000"180000"200000"

50)99"

100)199"

200)299"

300)399"

400)499"

500)999"

1000)4999"

5000)9999"

10000+"

Score&Counts&

11

Reviewing the Data: Determining Popularity

~15% of posts

Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel

12

Analyzing the Data: Issues

Issue: Given the size of the initial data set (2.5 million rows) and how it expanded upon transformation (CountVectorizer and TFIDF) to almost 100,000 columns, resulted in issues in processing the data locally on my machine. In the end I was only able to get about 1% of the data to run through the algorithms • Even with this smaller sub set of data processes could take anywhere from 30 minutes to several hours, making playing

around with the data extremely hard

Future: Explore platforms that are better at handling large data sets such as PySpark. Tried to process the data with PySpark but ran into technical issues that I couldn’t address in time

13

Analyzing the Data: SVC

0.88000$

0.89000$

0.90000$

0.91000$

0.92000$

0.93000$

0.94000$

Linear$ Poly$ Sigmoid$ RBF$

Accuracy'

0.92000%

0.92200%

0.92400%

0.92600%

0.92800%

0.93000%

0.93200%

0.93400%

0.93600%

0.93800%

0.001% 0.01% 0.1%

Accuracy'w/'Linear'Kernel'

Linear = .9368 C Value of .1 = .9363

from sklearn import svm

14

Analyzing the Data: Regression Trees

from sklearn import ensemble

0.885%

0.89%

0.895%

0.9%

0.905%

0.91%

0.915%

0.92%

0.925%

5% 10% 20% 50% 100% 125% 150%

N Estimators = 125 .922

Max Depth = 250 .924

0.885%

0.89%

0.895%

0.9%

0.905%

0.91%

0.915%

0.92%

0.925%

0.93%

5% 40% 100% 150% 200% 250% 300%

15

Analyzing the Data: Logistic

0.925&

0.93&

0.935&

0.94&

0.945&

0.95&

0.001& 0.01& 0.1& 1& 10& 50&

C Value of 1 = .9471 L1 = .947733 L2 = .947066

0.9469&

0.94695&

0.947&

0.94705&

0.9471&

0.94715&

0.9472&

0.94725&

0.9473&

0.94735&

0.9474&

L1& L2&

from sklearn import linear_model

16

Totally Crushing It!

17

Analyzing the Data: Classification Report

Random Forests

Logistic Regression

SVC

18

Soooo Not Crushing It

19

Feature Reduction: Accuracy

Random Forests - Reduced Features

Logistic Regression - Reduced Features

SVC - Reduced Features

Random Forests - All Features

Logistic Regression - All Features

SVC - All Features

94.71%

92.4%

93.63%

94.3%

94.5%

95.2%

20

Feature Reduction: Classification Report

Random Forests - All Features

Logistic Regression - All Features

SVC - All Features

Random Forests - Reduced Features

Logistic Regression - Reduced Features

SVC - Reduced Features

21

Next StepsDealing with the processing issues: • Learn and try our PySpark

Answer some additional questions: • Reevaluate how I handle the domains

• I originally bucketed domains by their frequency/occurrence in the data set however given the originating domain of the content and the title are the majority of the “post” and the top 15 domains make up the vast majority of the post I want to focus on posts from those ~15 domains to get a better picture on how they explicitly affect popularity

• Run the data with varying n_grams levels • I tried them but they expanded the columns to hundreds of thousands which just seemed to

freeze, so hopefully something like PySpark will help with the processing • Predict sub-reddit/category questions:

• Can I predict category of a post? • Do certain subreddits produce more overall popular content than others? Bears With

Beaks vs. ggggg (what ever the hell that is)

APPENDIX

22

0"20000"40000"60000"80000"

100000"120000"140000"160000"180000"200000"

50)99"

100)199"

200)299"

300)399"

400)499"

500)999"

1000)4999"

5000)9999"

10000+"

Score&Counts&

23

Reviewing the Data: Reevaluate Popularity

~12% of posts

Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel

~8% of posts

24

Analyzing the Data: SVC

0.688%

0.69%

0.692%

0.694%

0.696%

0.698%

0.7%

0.702%

0.704%

0.706%

0.708%

0.71%

0.001% 0.01% 0.1% 1% 10% 50%

C Value of .1 = 0.7077 Accuracy Score

Confusion Matrix

25

Analyzing the Data: Random Forest

0.77$

0.78$

0.79$

0.8$

0.81$

0.82$

0.83$

5$ 10$ 20$ 50$ 100$ 125$

N Estimators of 100 = 0.8218

Max Depth of 200 = 0.8247

0.775%

0.78%

0.785%

0.79%

0.795%

0.8%

0.805%

0.81%

0.815%

0.82%

0.825%

0.83%

40% 100% 150% 200% 250%

Confusion MatrixAccuracy Score

26

Analyzing the Data: Logistic

C =1, Penalty = L2

0.81%

0.815%

0.82%

0.825%

0.83%

0.835%

0.84%

0.845%

0.85%

0.001% 0.01% 0.1% 1% 10% 50%

C of 1 = .8453

Confusion Matrix

top related