ee219 project 5 popularity prediction on twitter winter 2018 · how interactive twitter users are...

EE219 Project 5

Popularity Prediction on Twitter

Winter 2018

Bakari Hassan, 705035029

Agam Tomar, 704775462

William Leach, 705034360

Germán García, 605068402

Contents Popularity Prediction .................................................................................................................................... 3

Problem 1.1: Statistics for each hashtag ................................................................................................... 3

Problem 1.2: Linear Regression model fit on given features .................................................................... 5

Problem 1.3: Linear Regression Model fit on extracted features ............................................................. 6

Problem 1.4: k-Fold Cross Validation on extracted best features ............................................................ 9

The results of performing cross validation on each hashtag: ............................................................... 9

The results of performing cross validation on each hashtag in Period 1 (Before Feb. 1, 8:00 a.m.): . 10

The results of performing cross validation on each hashtag in Period 2 (Between Feb. 1, 8:00 am

and Feb. 1, 8:00 a.m.): ........................................................................................................................ 11

The results of performing cross validation on each hashtag in Period 3 (After Feb. 1, 8:00 p.m.): ... 12

Problem 1.5: Testing of models .............................................................................................................. 14

Problem 2: Fan Base Prediction .............................................................................................................. 15

Preprocessing the tweets.................................................................................................................... 15

Binary Classifiers ................................................................................................................................. 15

Problem 3: Define Your Own Project ...................................................................................................... 20

Popularity Prediction

Problem 1.1: Statistics for each hashtag In this problem, we first calculated the following statistics for each hashtag (average number of

tweets per hour, average number of followers of users posting the tweets, and average number

of retweets). We gave the table below shows the statistics.

Table 1: Hashtag Statistics

As required, we collected the number of tweets in hour over time for two specific hashtags

(#SuperBowl and #NFL) and plot a histogram with 1-hour window.

Figure 1: Number of tweets in hour for #SuperBowl

Figure 2: Number of tweets in hour for #NFL

Problem 1.2: Linear Regression model fit on given features For all but of the six models created (one for each hashtag), more than half the features passed the p-

value benchmark of being less than or equal to 0.05. All features passing the test are highlighted in

green below in Table 2. The only exception was for #patriots in which only two passed the test (number

of tweets and total number of retweets). None of the features passed the p-value benchmark for all

models, meaning the selection of features seems to be quite good, as they are all considered useful for

one or more of the models. T-test scores were not exceptionally high, with the number of tweets

receiving a double-digit score several times. @sb49 and @superbowl received the best r squared scores

meaning the model’s explained was high relative to the total data variance and that most of the

randomness is captured by the residuals. The lowest RMSE was for #gopatriots. However, its r squared

score is mediocre, meaning the model would probably have trouble with unseen datasets and over-

fitting may easily occur.

Table 2: Summary of model performance for all hashtags

Hashtag R squared RMSE Feature t-test score p-value

#nfl 0.56 583.5

Number of tweets 5.15 0.00

Total number of retweets -0.10 0.92

Sum of the number of followers 2.31 0.02

Maximum number of followers -1.93 0.05

Time of day 4.68 0.00

#patriots 0.71 2355.9




Maximum number of followers 1.55 0.12


#sb49 0.85 3878.5





Time of day -0.86 0.39

#superbowl 0.86 6609.5



Sum of the number of followers -19.34 0.00


Number of posts with media -1.98 0.05

#gohawks 0.51 937.2



Sum of the number of followers -3.88 0.00



#gopatriots 0.61 193.5




Maximum number of followers -3.51 0.00


Problem 1.3: Linear Regression Model fit on extracted features Five features were selected for this problem:

1. Total number of posts in a foreign language

2. Combined total number of post likes for all users

3. Combined total number of impressions1 for each user

4. Harmonic average of all users’ average post rate for the lifetimes of their accounts

5. Total number of posts that included a media attachment2

The total number of posts in a foreign language was of interest due to the international popularity of the

Super Bowl and the language diversity within the United States. We wonder whether the percentage of

non-English Tweeters’ activity influences Twitter activity. The combined number of post likes indicates

how interactive Twitter users are during that window. The combined number of impressions is a

collective measure of user influence/visibility. By taking the harmonic average of users’ post rates (posts

per hour), we can get a sense of how active users have been throughout their entire use of Twitter

regarding original posts. Finally, total number of posts with media attachments helps determine

whether posts with media attached tend to encourage or discourage Tweet activity.

Features were first filtered by the requirement that their p-value is less than or equal to 0.05. Then,

amongst the features satisfying that criteria, they were ranked in descending order by the absolute

value of the t-test score. The harmonic average of each user’s average lifetime posting rate never

achieved a p-value less than or equal to 0.05, while all others did satisfy that criterion at least once. In all

cases except for #sb49, at least 3 features passed the p-value test. For #sb49, only two features had p-

values low enough for selection.

When the predicted values are plotted against the values of the three top features, a linear relationship

is expected with the effective slope representing the overall weight assigned to that feature vector. The

closer all points are to a single line, the smaller the error and randomness that reside in the model. As

1 Impressions is the number of unique users who have seen the host user’s tweets in their Twitter stream 2 Tweet metadata do not specify what type of media is attached. Photos, audio, video, and gifs are all considered “photos.”

Feature Hashtag Average

rank Overall

rank #nfl #patriots #sb49 #superbowl #gohawks #gopatriots

Number of posts with media

2 2 5 1 1 2 2.17 1

Number of posts in foreign language

3 1 1 4 2 3 2.33 2

Number of user likes

1 5 2 3 5 1 2.83 3

Number of user impressions

4 3 4 2 3 4 3.33 4

Average post rate 5 4 3 5 4 5 4.33 5

expected, for the top feature (number of posts with media), the relationship is linear with few points at

large standoff distances from a linear fit as shown in Figure 3.

Figure 3

The plot for the second feature (number of posts in foreign language) is similar, but with a smaller slope and increased residual error (Figure 4).

Figure 4

And finally, the trend becomes noticeably more chaotic for the number of user likes shown in Figure 5. The lack of linearity suggests this feature may provide varied performance depending on the hashtag and dataset. So, there is less correlation on the final prediction and this feature.

Figure 5

The hourly prediction of all Tweet data using these features resulted in a coefficient of determination

equal to 0.74.

Problem 1.4: k-Fold Cross Validation on extracted best features We performed cross validation for models (Table 3) with features (Table 4).

Table 3: Table of Regression Models we used in Cross validation

S. No. Regression Models

0 LinearRegression

1 LogisticRegression

2 MLPRegressor

3 Ridge

4 BayesianRidge

5 kNN Regressor

6 Support Vector Regression (SVR) using radial basis function kernel

Table 4: List of features with their F-value and p-value

Features F-Value p-value

longTweet 3400.16454287 0.00000000

userID 3304.16300387 0.00000000

statuses_count 2715.84548810 0.00000000

tweetCount 2634.26209141 0.00000000

lang 1868.91214511 0.00000000

retweetCount 1653.63704932 0.00000000

favoriteCount 1627.61611160 0.00000000

followerSum 280.85821316 0.00000000

rankingScore 91.37921546 0.00000000

impressionCount 1.66588400 0.19711578

Table 4 suggests that longTweet (size of tweet > 100), userID, statuses_count are the top three features

with highest F-value and p value less than 0.05 which makes them significant.

The results of performing cross validation on each hashtag: Table 5: MAE value table after performing cross validation on each hashtag and combined hashtag for whole data

Regressors gopatriots gohawks nfl sb49 patriots superbowl combined

5 35.22 91.93 142.67 1325.99 400.45 1231.60 2618.70

6 37.32 190.88 263.10 1420.26 482.28 1389.92 3258.73

4 37.62 216.00 136.53 1090.24 1186.73 1289.34 4631.31

3 36.04 216.93 140.29 1219.07 1369.20 1757.90 5012.99

0 36.03 216.93 140.29 1219.08 1369.22 1757.90 5013.00

1 429.70 2718.54 815.31 30680.40 3821.00 17370.96 27673.25

2 6685.31 22789.28 34424.94 620126.31 22778.62 386025.40 380186.21

Figure 6: MAE value plot for hashtags (Y-axis on log scale)

After performing cross validation with all features and different regression models (Table 5), we found

that kNN Regressor (k-nearest neighbors) performed the best in predicting the tweet count for the nest

hour in individual hashtags and combined hashtags. Linear regression and Ridge regression are

performing the same. Neural network regression is performing the worst since we are using only 1 layer

with 100 hidden units and data size is also small for training of the network. We can increase the

performance of neural networks by increasing the number of epochs for training and increasing layer

size and units (if we have the limitation on the data) otherwise increasing the training data will help

improve the prediction.

Then we moved to prediction of tweet count in three time intervals:

The results of performing cross validation on each hashtag in Period 1 (Before Feb. 1, 8:00 a.m.):

Table 6: MAE value table after performing cross validation on each hashtag and combined hashtag for period 1


5 8.80 75.67 98.50 49.37 115.45 188.39 444.96

4 14.21 196.99 80.20 50.05 158.08 214.77 706.27

3 14.89 196.20 81.03 51.19 166.84 250.57 727.47

0 14.90 196.20 81.03 51.19 166.85 250.57 727.47

6 10.66 136.04 162.02 103.46 167.38 248.84 896.11

1 111.54 1252.23 297.00 968.63 275.63 1881.03 2844.44

2 2510.83 12162.52 22648.36 238459.08 28340.40 29686.48 593798.11

1.00

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

10000000.00

For Whole Time Data

gopatriots gohawks nfl sb49 patriots superbowl combined

Cross validation with all features on different regression models yields that kNN is performing the best

for individual hashtags and combined hashtags.

Figure 7: MAE value plot for hashtags for period 1 (Y-axis on log scale)

The results of performing cross validation on each hashtag in Period 2 (Between Feb. 1, 8:00 am

and Feb. 1, 8:00 a.m.): Table 7: MAE value table after performing cross validation on each hashtag and combined hashtag for period 2


4 1169.80 3552.73 2222.86 36957.72 13996.46 51628.20 82227.57

1 1657.05 3897.15 3323.65 33685.95 15031.05 76986.80 86606.25

5 890.65 2625.21 2355.09 33025.91 14435.55 49612.86 90519.71

6 1460.45 3344.55 5220.65 31521.85 12097.55 120109.70 99498.55

3 5712.46 1800.86 3891.22 236101.97 45591.78 1151948.6 713835.23

0 5734.50 1802.68 3891.29 236363.71 45591.72 1151698.8 713848.41

2 477141.74 909171.2 1833006 36943729 7119938 18957788 76358613

Cross validation with all features on different regression models yields that Bayesian Ridge, kNN and

Linear regression are performing the best for individual hashtags and combined hashtags.

1.00

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

1 2 3 4 5 6 7

Before Feb. 1, 8:00 a.m.



The results of performing cross validation on each hashtag in Period 3 (After Feb. 1, 8:00 p.m.): Table 8: MAE value table after performing cross validation on each hashtag and combined hashtag for period 3


5 4.63 29.55 168.78 154.95 106.61 288.34 557.66

4 2.25 94.69 125.63 128.60 80.51 199.20 563.95

3 2.72 1066.78 128.73 134.21 81.57 260.43 732.90

0 3.26 3689.54 128.74 134.21 81.58 260.43 732.91

6 4.82 34.77 294.73 344.69 142.34 585.42 1256.85

1 4.79 80.65 308.74 1238.26 381.84 1182.00 1549.60

2 1795.18 45454.38 516471.6 563140.37 103110.0 265952.45 2019312.4

Cross validation with all features on different regression models yields that Bayesian Ridge and kNN are

performing the best for individual hashtags and combined hashtags.

1.00

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

10000000.00

100000000.00

1000000000.00

Between Feb 1, 8:00 am and 8:00 pm



1.00

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

10000000.00

After Feb. 1, 8:00 p.m.


Problem 1.5: Testing of models

In this part we test the best features and model we found in the previous sections from

analyzing the training data. We found good features for each hashtag, and for various time periods

around the superbowl. Next, we want to use the overall best model from 1.4 to test how well we can

predict popularity for a tweet in any of the hashtags from any of the time periods.

Instead of using tweets from one hour to predict the next, we use tweets from up to a five hour

window to predict the next. This helps in two ways. One way is that we have a larger amount of time to

pull features from, which gives the model a better picture of the statistical distribution it is predicting

from. The second reason it helps is that we also concatenate the features from each hour instead of just

summing them up. For example with the feature Number of Retweets, instead of summing up the

number for all 5 hours, we sum up the number for each of the 5 hours and concatenate them. This

results in an N by (num_features * 5) dimensional feature array, where N is the number of windows,

num_features is how many features we extract, and 5 is how many hours we extract over. This

concatenation gives us better time precision for our features and can reveal the temporal nature of the

features over a time window. If the features at the beginning of the window are high, but the end of the

window are low, this may be a sign the popularity is dying not growing. This works in the opposite

situation as well.

From part 1.4 we found that LongTweet, UserID, statuses_count, tweetCount, english or not,

favorite count, follower sum, and ranking score features all had p values less than 0.05 and High F-

values, making them suitable candidates to use in regression. We also found that K nearest neighbors

with a k value of 5 worked best overall with the aggregated hashtags and across the 3 time periods. In

testing we used these features with K nearest neighbors on 5 hour windows to predict the last hour in

each test file. We also used first post date as the time reference.

Table 9: Error for each sample in the test file

Sample 1 2 3 4 5 6 7 8 9 10

Prediction 0 6523 311 26 202 2785 3 107 1794 2

Actual 178 8292

3

523 201 210 37293 120 11 2790 61

Absolute Error

730 76670

212 175 8 34508 117 96 996 59

From the results we can see that the classifier performs well in some cases and not in others.

Some reasons for this are insufficient data. For example in samples 1, 7, 8 and 10 there were only 100 to

a few 100 tweets which is not much data to pull features from and accurately predict the popularity.

Also in some of the samples from period 2, there is an extreme jump in the amount of tweets and a

divergence from the normal pattern of tweeting. This is due to the superbowl being in that time period,

and everyone tweeting about the game in a way that causes outliers in the data. Lastly some of the

samples were predicted well with low error. These were the samples where there were at least a few

thousand tweets to predict from and weren't part of the outliers in the superbowl. For these tweets the

classifier trained on the aggregated data performed well.

Problem 2: Fan Base Prediction In this section, the textual content of the tweets will be used to predict the location for the user. The

problem only considers the tweets that include the hashtag #superbowl in them and where tweeted

form either the state of Washington or Massachusetts

Preprocessing the tweets Since the user’s location is an input designed by the user, a preprocessing step is needed before

applying the algorithms to the data. This is due the following reasons:

• The number of places where people tweet from is enormous. And by applying a simple filter like

the tweets containing the string ‘, MA’ was not enough since it was considering locations like

‘Kuala Lumpur, MALAYSIA’

• There are some places that contain the name of the states that we are interested in but are not

related to them at all. That is the case of ‘Washington DC’, ‘Washingtonville’ or ‘Washington

Heights, NY’ for example.

• There are some places that would match a simple filter like the one already mentioned, but are

not proper locations, for example: ‘Where I come from, MARS’

For the previous reasons, the preprocessing selected was set of simple filters with the strings of interest

like ‘Washington’, ‘WA’, ‘Massachusetts’, ‘MA’, followed by a frequency analysis of the locations to

determine the most frequent locations and discard the ones that we do not want, for instance:

'BOSTON, MA HOME OF TRUE PREP'

'Seattle, WA aka The TwoOhSix'

After this filtering, considering the 100 most common locations from Massachusetts, 11388 tweets are

considered for this state. On the other hand, using the 180 most common locations from Washington,

11162 tweets are considered for this state.

Binary Classifiers

After the preprocessing mentioned in the section above, the data is transformed into a TF-IDF feature

matrix to be processed by the classifiers. The election of the best parameters for each algorithm has

been made based on a 5- fold validation recall.

Random Forest classifier

A random forest model is trained for different configurations of parameters. The results are shown in

the table below.

For each number of latent factors, the parameter configuration that performed better is shown in

orange. Finally, the parameters selected are shown in blue. In this case the election was max features 2,

n estimators 150 and latent factor 100 because even though it was not the setting with the best

performance, it presented a good performance with the lowest variance across the 5-fold validator.

Table 10: Parameter tuning random forest

Latent factor 3 4 5 10 20 50 100 200

max features

n estimators

recall

2 20 0.764 0.772 0.775 0.776 0.777 0.771 0.758 0.731

2 40 0.768 0.775 0.780 0.782 0.784 0.784 0.778 0.755

2 80 0.770 0.778 0.782 0.783 0.788 0.787 0.783 0.771

2 150 0.771 0.779 0.785 0.785 0.786 0.789 0.790 0.783

3 20 0.761 0.770 0.774 0.775 0.776 0.776 0.764 0.745

3 40 0.768 0.773 0.779 0.782 0.782 0.780 0.780 0.766

3 80 0.769 0.777 0.781 0.783 0.786 0.790 0.789 0.782

3 150 0.771 0.778 0.781 0.784 0.788 0.791 0.789 0.791

SVM classifier

An SVM model is trained for different configuration of parameters. The results are shown in the table

below.

Table 11: Parameter tuning SVM

Latent factor 3 4 5 10 20 50 100 200

C 𝜸 recall

1 0.001 0.516 0.516 0.516 0.515 0.516 0.516 0.515 0.515

1 0.0001 0..500 0.500 0.500 0.500 0.500 0.500 0.500 0.500

10 0.001 0.758 0.757 0.757 0.751 0.756 0.758 0.759 0.759

10 0.0001 0.516 0.516 0.516 0.515 0.516 0.516 0.515 0.515

100 0.001 0.770 0.770 0.771 0.768 0.771 0.774 0.778 0.783

100 0.0001 0.758 0.757 0.757 0.751 0.756 0.758 0.759 0.759

1000 0.001 0.773 0.773 0.757 0.774 0.778 0.780 0.786 0.791

1000 0.0001 0.770 0.770 0.771 0.768 0.771 0.774 0.778 0.783

Neural network classifier

An Neural Network model is trained for different configuration of parameters. The results are shown in

the table below.

Table 12: Parameter tuning NN

Latent factor 3 4 5 10 20 50 100 200

𝜶 𝒍𝒂𝒚𝒆𝒓𝒔 recall

0.0001 (100,) 0.778 0.779 0.783 0.784 0.788 0.792 0.784 0.783

0.0001 (50, 50) 0.778 0.778 0.783 0.784 0.788 0.778 0.777 0.769

0.0001 (50, 50, 50) 0.779 0.780 0.784 0.785 0.785 0.781 0.779 0.769

0.001 (100,) 0.778 0.779 0.783 0.785 0.789 0.792 0.790 0.786

0.001 (50, 50) 0.779 0.778 0.783 0.787 0.787 0.787 0.783 0.768

0.001 (50, 50, 50) 0.777 0.779 0.783 0.785 0.788 0.783 0.769 0.770

0.01 (100,) 0.776 0.777 0.783 0.784 0.790 0.791 0.791 0.788

0.01 (50, 50) 0.779 0.779 0.783 0.787 0.786 0.792 0.776 0.763

0.01 (50, 50, 50) 0.781 0.778 0.783 0.785 0.786 0.790 0.776 0.768

0.1 (100,) 0.778 0.778 0.784 0.784 0.788 0.788 0.791 0.794

0.1 (50, 50) 0.778 0.779 0.784 0.783 0.787 0.788 0.793 0.788

0.1 (50, 50, 50) 0.776 0.778 0.781 0.781 0.788 0.781 0.789 0.772

To visualize the performance of the different algorithms together, a ROC curve is plotted for the test

data.

Figure 10: ROC 3 algorithms

Since the performance is very similar between the algorithms, the figure below shows a zoom of the

previous one in a zone that the 3 algorithms show some differences.

Figure 11: Zoom ROC 3 algorithms

The figure above shows how the neural network performs better than the other 2 algorithms for any

chosen threshold.

The following figures show the confusion matrices for the three different algorithms.

Figure 12: Confusion matrix random forest

Figure 13: Confusion matrix SVM

Figure 14: Confusion matrix NN

As the Figure 11: Zoom ROC 3 algorithmsFigure 11 and Figure 14 show, the neural network classifier is

the one that gives the highest percentage of correct classifications. These results are even clearer in the

Table 13.

Table 13: Performance metrics

Algorithm Accuracy Precision Recall

Random Forest 0.8006 0.7564 0.8929

SVM 0.8020 0.9188 0.9188

Neural Network 0.8037 0.7478 0.9227

Problem 3: Define Your Own Project Problem Definition: Suppose you are traveling abroad during the Super Bowl in a remote area with no television. Although your phone has weak reception, it is sufficient for low data-rate communication. You’d like to find a way to know the game score using your cell phone. It turns out you have the ability to stream Twitter metadata for #superbowl, #sb49, and #nfl. However, you must choose the data wisely due to limited bandwidth, and you must avoid text analysis/filtering to save the battery. The problem is stated as follows: Given a stream of tweet metadata absent of partisan hashtags, location data, and Tweet text content, can quasi-real-time estimation of game scores and final scores be completed confidently? Implementation: The game is assessed in time windows several minutes in length. Each window is considered a single sample and is represented by one row in the feature matrix. Given several inputs for the current window, the model attempts to determine the change in score for the two teams during that window3. Two separate models are trained, with one focused on the Patriots and the other on the Seahawks. The models can distinguish field goals from touchdowns. However, they are not trained to recognize safeties or two-point conversions since those did not occur during the Super Bowl. All time windows have start times during the game with end times that may occur during the halftime show or the end of the game. For the windows with valid start times, but invalid end times, all invalid Tweets are filtered and removed prior to training. #nfl, #sb49, and #superbowl provide roughly 1.5 million tweets that are reduced to 50,000 when halftime show, as well as pre- and post-Super Bowl tweets are removed. The overall estimated game scoring history is produced by plotting the cumulative sum of the score changes predicted for each time window. For the purposes of this project, it is assumed that all scores are independent of each other. In reality, this is not the case and scores are affected by momentum, strategy, and the crowd. Features: Four features were selected for each hashtag. Each feature is listed and detailed below:

1. Number of tweets during time window 2. Mean of user IDs during time window 3. User ID variance during time window 4. Zipf exponent for user ID popularity distribution during time window

Number of tweets during window The number of tweets during each time window is an indicator of game activity and reflects major occurrences. Figure 15 below shows the number of tweets (unfiltered) for 15-minute windows over the duration of the game. There is a large up-tick for the halftime show, as well as the last-minute, game-ending interception by the Patriots. For each team’s scores, there is a delayed increase in number of tweets other than the two scores immediately after the halftime show. This is likely affected by Tweet activity related to the halftime show.

3 Each estimate is rounded to arrive at an integer value for the change in score for that time window.

Figure 15

Mean of user IDs & User ID variance during time window User ID is an indirect representation of the age of each user’s Twitter account. By taking the average, it is an attempt to infer a group of users’ favorite team based on which users are active after a certain team’s score. The variance of user IDs active around the time of a team’s score represents the range of users’ account ages (perhaps one team’s fans span a larger time frame of Twitter’s existence). Justification for the selection of user ID mean and variance was not straightforward, as there are no truth data for the chosen hashtags. However, user ID distribution insight can be gained via the user IDs associated with #gohawks, #gopatriots, and #patriots. The user IDs associated with those three hashtags are plotted below in Figure 16. There is a clear distinction between the users who use Patriots hashtags and those who use the Seahawks hashtag. The assumption is that this holds true for the other hashtags.

Figure 16

Zipf exponent for user ID popularity distribution during time window The Zipf distribution exponent captures the popularity distribution of users active during a time window. Perhaps one team’s Twitter users are more influential or visible on Twitter, affecting the social media equality distribution. Users ranked by influence are plotted below in Figure 17. This metric is expected to

play a larger role for long-duration windows since the number of active users will increase, and the distribution will approach Zipf’s law.

Figure 17

With four features selected for each of the three hashtags, there were 12 features in total. Ideally, these

features are not highly intercorrelated, as it will help avoid linearly dependent features and ill-

conditioning. The correlation between each feature was assessed and plotted in Figure 18 as a

correlation matrix below:

Figure 18

There was a high level of correlation between the #sb49 Zipf exponent and the Zipf exponents for

#superbowl and #nfl. The #nfl user ID variance was also highly correlated (0.9) with #nfl mean user ID,

similar to #sb49 user ID variance and #sb49 tweet count. Finally, the correlation of #sb49 user ID

variance and #sb49 mean user ID was 1.0. Although these features are highly correlated, a majority of

the features have moderate to low correlation. Once the 12 features are calculated for each window,

they’re combined to create the feature matrix β in Equation 1 below:

Equation 1

Game score truth data: A challenge was finding a recording of the game that includes all commercials, so the game clock events could be mapped to local time. None were found. But luckily Twitter posted an interactive map of geographical tweet activity as the game transpired4, which enabled truth data generation. Score summary:

Quarter Game Clock True Clock (EST) Time Diff. Score

Patriots Seahawks

1 12:00 Sun Feb 01 18:30:00 2015 0:00 0 0

2 9:47 Sun Feb 01 19:13:00 2015 0:43 7 0

2 2:16 Sun Feb 01 19:36:00 2015 0:23 7 7

2 0:31 Sun Feb 01 19:49:00 2015 0:13 14 7

2 0:02 Sun Feb 01 19:59:00 2015 0:10 14 14

Halftime N/A Sun Feb 01 20:11:00 2015 0:12 14 14

Halftime N/A Sun Feb 01 20:32:00 2015 0:21 14 14

3 11:09 Sun Feb 01 20:38:00 2015 0:39 14 17

3 4:54 Sun Feb 01 20:54:00 2015 0:16 14 24

4 7:55 Sun Feb 01 21:28:00 2015 0:34 21 24

4 2:02 Sun Feb 01 21:48:00 2015 0:20 28 24

4 0:00 Sun Feb 01 22:05:00 2015 0:17 28 24

4 http://com.cartodb.visualizations.s3.amazonaws.com/v/superbowl/index.html?vis=c5539e5c-aa92-11e4-be3b-0e853d047bba&utn=srogers,superbowl15_events&t=@Patriots,B40903%7C@Seahawks,229A00

http://com.cartodb.visualizations.s3.amazonaws.com/v/superbowl/index.html?vis=c5539e5c-aa92-11e4-be3b-0e853d047bba&utn=srogers,superbowl15_events&t=@Patriots,B40903%7C@Seahawks,229A00

http://com.cartodb.visualizations.s3.amazonaws.com/v/superbowl/index.html?vis=c5539e5c-aa92-11e4-be3b-0e853d047bba&utn=srogers,superbowl15_events&t=@Patriots,B40903%7C@Seahawks,229A00

Time window size implies a sampling frequency, which involves tradeoffs between estimation triviality, and information capture available for quasi-real-time estimations. If the time window is the length of the game, then the likelihood of all scores occurring within that window is 1; if the window is too small, very few tweets are available for estimation purposes, resulting in mean and variance metrics losing their ability to generalize randomness Therefore, the effect of window duration on training set predictions was explored prior to final assessment. Both the RMSE and coefficient of determination are proportional to the window size, meaning there is an ideal window duration that balances the error and the explained variance. Due to limited time, two window durations were chosen to understand their effects on performance. The chosen durations are 5 minutes and 15 minutes, indicated by star markers in Figure 19 below.

Figure 19

Model Selection: Prior to model selection, the feature matrix condition number was calculated to determine whether it was well- or ill-conditioned. Singular value decomposition was used to obtain the features matrix’s singular values:

Equation 2

Due to the large magnitude of the condition number, it was determined that ridge regression would be necessary. Therefore, both linear and nonlinear models were considered. Ordinary least squares regression, ridge regression, and k-neighbors regression were chosen. Additionally, due to the large difference in magnitude of the tweet counts (∼1E+03), user IDs (∼1E+08), and the zipf parameter (∼1E-01), regression was run on both non-normalized and normalized data. Performance Assessment:

k b( ) =s

maxb( )

smin

b( )=

5´1018

1´10-2»1´1020 ≫ 1Þ ill-conditioned

An alternative method of data visualization is utilized for this problem in replacement of a truth-prediction scatter plot. Since the goal is to see how the scores change over the duration of the game, the cumulative score over time is used to achieve the truth-prediction scatter plot’s information display, as well as how the scores change from beginning to end.

Figure 20

Prior to cross validation, the training set was used for both training and testing of the ordinary regression model to gain a sense of the problem statement’s feasibility. The ability of the regressor to estimate the score in quasi-real time seems promising, with Figure 21 showing little to no final score error (1-point overestimation for Seahawks) and close tracking throughout the game. Additionally, the Fréchet distance was used as a metric to measure the relative “closeness” of the curve tracking throughout the game. The Fréchet distance between two curves A and B (truth and prediction) is defined as follows:

Equation 3

where d is the distance function of the vector space (Frobenius norm). Overall, the Fréchet distances for the Patriots and Seahawks were similar, with Seahawks having a slightly larger distance due to the prediction being greater than the truth scores during all points of the game. This is not the case for the Patriots whose score estimation curves regularly intersect the truth curve. Although the Fréchet distance is useful in this case, it was dropped as a final metric due to the RMSE already being a suitable metric for understanding how well the scores match. Additionally, the Fréchet distance varied by very little between test runs that did not have drastically different plots.

F A,B( ) = infa ,b

maxtÎ 0,1éë ùû

d A a(t)( ),B b(t)( )( ){ }

According to this plot, the problem appears to be possible given the four features for each dataset and a window duration of 15 minutes. Actual performance will likely depend upon over-fitting and whether the test sets resemble the information contained in the training sets. Cross validation: Five-fold cross validation was run on the training dataset with balanced time window sample populations. For ridge and k-neighbors folds, performance for a range of regularization parameters and neighbor count was assessed. RMSE was utilized as the main metric to decide on relative model performance as opposed to the coefficient of determination due to our interest in the actual scores and not general data trends. Regression performance on all cases are summarized below in Table 14. Table 14: Regression performance summary

Regression model

Window Duration

(min)

Team name

Feature Pre-processing

Optimal alpha

Optimal neighbor

count

Mean RMSE

(points)

Ordinary Least Squares

5

Patriots None N/A N/A 11.83

Normalization N/A N/A 34.38

Seahawks None N/A N/A 21.91


15

Patriots None N/A N/A 48.52


Seahawks None N/A N/A 32.52


Ridge

5

Patriots None 10 N/A 11.84

Normalization 0.001 N/A 20.13

Seahawks None 10 N/A 2.03

Normalization 1 N/A 1.82

15

Patriots None 10 N/A 102.90

Normalization 0.01 N/A 3.09

Seahawks None 10 N/A 78.34

Normalization 1 N/A 2.80

Figure 21

K Neighbors

5

Patriots None N/A 42 2.01

Normalization N/A 13 2.03

Seahawks None N/A 42 1.81


15

Patriots None N/A 13 3.05


Seahawks None N/A 14 2.76


Ordinary Least Squares Regression: For the ordinary least squares regression method, overall performance was dismal. Score estimation for both the Patriots and Seahawks was highly inaccurate, with large RMSE values.

Figure 22

One trend that was noticed in the estimation plots in Figure 22 and Figure 23 is that normalization tends to result in monotonically increasing changes in score for every time window. Conversely, for the non-normalized feature matrix cases, steps are not monotonic, showing both increases and decreases of varying magnitude.

Figure 23

Ridge Regression: In general, score-tracking performance was unsatisfactory using ridge regression. Root mean squared error generally increased with decreasing regularization parameter with a few exceptions. RMSE values were much higher for the data without normalization for pre-processing (shown in Figure 24 and Figure 25 below). Those values were extremely high for the non-normalized 15-minute window cross validation runs. The magnitude of the errors for the Patriots is apparent in the score estimation plot above in Figure 23 for 15-minute windows, with the cumulative error causing a final score estimation of 3000 points. An interesting case is Patriots regression RMSE for 15-minute windows. For the range of regularization parameters considered, its RMSE is unaffected.

5-minute window

15-minute window

Figure 24

Quasi real-time estimations via ridge regression for no normalization was very similar to the ordinary regression model. While the ridge regression 15-minute time window estimates were identical to the ordinary regression estimates, the magnitude of deviation for the 5-minute time window was reduced by a factor of two, which was accompanied by an RMSE reduction by a factor of 10 (from 21.91 to 2.03 points) in Figure 26. Additionally, the overall trend for the Seahawks remained the same with a decrease from the beginning of the game into negative points and then an increase around 125 minutes into the game, followed by a steep descent until the end of the game.

Figure 26

Similar to the normalized features for ordinary regression, normalization of features for ridge regression caused mostly monotonic steps in score estimates from window-to-window (Figure 27). A particularly interesting case was for the 5-minute windows in which the Patriots’ and the Seahawks’ score estimates over time were identical. While this estimate was always above the actual game scores, this was not the case for 15-minute windows, wherein estimates were consistently within the range of truth scores. Additionally, the final score for the Seahawks was correct, while there was a small error of 3 points for the Patriots, which would equate to a missed field goal.

5-minute window

15-minute window

Figure 25

Figure 27

In general, score-tracking performance was unsatisfactory using ridge regression. However, for normalized features over 15-minute time windows, final score predictions are very accurate. Therefore, if the goal is just to know the final game score, a simple ridge regression model is a candidate model pending test performance on previous Super Bowls. K-Neighbors Regression The nonlinear model, K-neighbors regression, had the best overall performance with a mean RMSE of 2.4 points over all eight cases compared to 48 points for ordinary least squares and 27.9 points for ridge regression. K-neighbors regression was run for neighbor counts ranging from one to one minus the number of samples. The optimal number of neighbors ranged from 3 to 43 depending on the case. Figure 28 and Figure 29 show that for 5-minute windows, the Patriots and Seahawks RMSE follow similar trends overall, but involve slight deviations for a lower number of neighbors.

5-minute window

15-minute window

Figure 28

5-minute window

15-minute window

Figure 29

For ridge regression, monotonically increasing steps in the score were only seen for normalized data. However, in this case, the trend is present in all test runs shown in Figure 30 and Figure 31. Five-minute time windows consistently underperformed relative to the 15-minute window cases with the estimated score always greater than the actual score, and large errors in the final game score estimate that are almost twice the actual game scores.

Figure 30

Normalization did not have an impact on score estimation for the Seahawks relative to ridge regression. The RMSE and score predictions over time were almost identical for both. Results for normalized k-neighbors was similar to those of the non-normalized k-neighbors feature set. Therefore, for k-neighbors regression, normalization does not play an important role in performance and 15-minute windows is significantly better than that of the 5-minute windows.

Figure 31

Summary & Improvements: Recall the posed problem: Given a stream of tweet metadata absent of partisan hashtags, location data, and Tweet text content, can quasi-real-time estimation of game scores and final scores be completed confidently? The answer to the latter is a strong yes. The k-neighbors regression model had very low RMSE and consistently made accurate end-of-game score estimates. This is quite impressive given the low level of complexity. There is no strong winner for which model is best to predict Patriots’ scores versus Seahawks’ scores throughout the game. However, there is potential for improvement (discussed below). Although the latter part is not a confident “yes,” the attempt is still considered a success considering the focus of the posed problem is to see how well scores could be estimated with minimal processing and model complexity. It was realized early-on that improved performance was attainable by using team-specific hashtag features or content analysis to find words such as “field goal,” “touchdown,” “score,” etc. Ordinary least squares performance is very low for both the Patriots and the Seahawks regardless of time window duration. Ridge regression should only be used with normalization. And in that case, both the Patriots and Seahawks estimations were similar. K-neighbors has the lowest RMSE for Seahawks’ scores. Areas of consideration for estimation improvement include the use of additional nonlinear models. The neural net could potentially provide great results for this task, as the tweet volume is directly affected by events such as kickoff (when viewership is high) and the halftime show. The halftime show Tweets could be filtered via text analysis by removing tweets with “Katy,” “Perry,” and “Show.” Additionally, sentiment could provide more context for scores. But both sentiment and filtering require text analysis, which adds significant resource utilization for processing. Additionally, if the number of tweets for each time is unbalanced, that could case bias in the future performance.

ee219 project 5 popularity prediction on twitter winter 2018 · how interactive twitter users are...

Documents