ee219 project 5 popularity prediction on twitter winter 2018 · how interactive twitter users are...
TRANSCRIPT
EE219 Project 5
Popularity Prediction on Twitter
Winter 2018
Bakari Hassan, 705035029
Agam Tomar, 704775462
William Leach, 705034360
Germán García, 605068402
Contents Popularity Prediction .................................................................................................................................... 3
Problem 1.1: Statistics for each hashtag ................................................................................................... 3
Problem 1.2: Linear Regression model fit on given features .................................................................... 5
Problem 1.3: Linear Regression Model fit on extracted features ............................................................. 6
Problem 1.4: k-Fold Cross Validation on extracted best features ............................................................ 9
The results of performing cross validation on each hashtag: ............................................................... 9
The results of performing cross validation on each hashtag in Period 1 (Before Feb. 1, 8:00 a.m.): . 10
The results of performing cross validation on each hashtag in Period 2 (Between Feb. 1, 8:00 am
and Feb. 1, 8:00 a.m.): ........................................................................................................................ 11
The results of performing cross validation on each hashtag in Period 3 (After Feb. 1, 8:00 p.m.): ... 12
Problem 1.5: Testing of models .............................................................................................................. 14
Problem 2: Fan Base Prediction .............................................................................................................. 15
Preprocessing the tweets.................................................................................................................... 15
Binary Classifiers ................................................................................................................................. 15
Problem 3: Define Your Own Project ...................................................................................................... 20
Popularity Prediction
Problem 1.1: Statistics for each hashtag In this problem, we first calculated the following statistics for each hashtag (average number of
tweets per hour, average number of followers of users posting the tweets, and average number
of retweets). We gave the table below shows the statistics.
Table 1: Hashtag Statistics
As required, we collected the number of tweets in hour over time for two specific hashtags
(#SuperBowl and #NFL) and plot a histogram with 1-hour window.
Figure 1: Number of tweets in hour for #SuperBowl
Figure 2: Number of tweets in hour for #NFL
Problem 1.2: Linear Regression model fit on given features For all but of the six models created (one for each hashtag), more than half the features passed the p-
value benchmark of being less than or equal to 0.05. All features passing the test are highlighted in
green below in Table 2. The only exception was for #patriots in which only two passed the test (number
of tweets and total number of retweets). None of the features passed the p-value benchmark for all
models, meaning the selection of features seems to be quite good, as they are all considered useful for
one or more of the models. T-test scores were not exceptionally high, with the number of tweets
receiving a double-digit score several times. @sb49 and @superbowl received the best r squared scores
meaning the model’s explained was high relative to the total data variance and that most of the
randomness is captured by the residuals. The lowest RMSE was for #gopatriots. However, its r squared
score is mediocre, meaning the model would probably have trouble with unseen datasets and over-
fitting may easily occur.
Table 2: Summary of model performance for all hashtags
Hashtag R squared RMSE Feature t-test score p-value
#nfl 0.56 583.5
Number of tweets 5.15 0.00
Total number of retweets -0.10 0.92
Sum of the number of followers 2.31 0.02
Maximum number of followers -1.93 0.05
Time of day 4.68 0.00
#patriots 0.71 2355.9
Number of tweets 21.15 0.00
Total number of retweets -5.68 0.00
Sum of the number of followers 1.38 0.17
Maximum number of followers 1.55 0.12
Time of day 1.42 0.16
#sb49 0.85 3878.5
Number of tweets 32.29 0.00
Total number of retweets -7.23 0.00
Sum of the number of followers 0.73 0.46
Maximum number of followers 4.24 0.00
Time of day -0.86 0.39
#superbowl 0.86 6609.5
Number of tweets 27.36 0.00
Total number of retweets -2.21 0.03
Sum of the number of followers -19.34 0.00
Time of day 9.43 0.00
Number of posts with media -1.98 0.05
#gohawks 0.51 937.2
Number of tweets 9.27 0.00
Total number of retweets -5.23 0.00
Sum of the number of followers -3.88 0.00
Maximum number of followers 1.99 0.05
Time of day 1.79 0.07
#gopatriots 0.61 193.5
Number of tweets 1.26 0.21
Total number of retweets -2.93 0.00
Sum of the number of followers 3.22 0.00
Maximum number of followers -3.51 0.00
Time of day 2.15 0.03
Problem 1.3: Linear Regression Model fit on extracted features Five features were selected for this problem:
1. Total number of posts in a foreign language
2. Combined total number of post likes for all users
3. Combined total number of impressions1 for each user
4. Harmonic average of all users’ average post rate for the lifetimes of their accounts
5. Total number of posts that included a media attachment2
The total number of posts in a foreign language was of interest due to the international popularity of the
Super Bowl and the language diversity within the United States. We wonder whether the percentage of
non-English Tweeters’ activity influences Twitter activity. The combined number of post likes indicates
how interactive Twitter users are during that window. The combined number of impressions is a
collective measure of user influence/visibility. By taking the harmonic average of users’ post rates (posts
per hour), we can get a sense of how active users have been throughout their entire use of Twitter
regarding original posts. Finally, total number of posts with media attachments helps determine
whether posts with media attached tend to encourage or discourage Tweet activity.
Features were first filtered by the requirement that their p-value is less than or equal to 0.05. Then,
amongst the features satisfying that criteria, they were ranked in descending order by the absolute
value of the t-test score. The harmonic average of each user’s average lifetime posting rate never
achieved a p-value less than or equal to 0.05, while all others did satisfy that criterion at least once. In all
cases except for #sb49, at least 3 features passed the p-value test. For #sb49, only two features had p-
values low enough for selection.
When the predicted values are plotted against the values of the three top features, a linear relationship
is expected with the effective slope representing the overall weight assigned to that feature vector. The
closer all points are to a single line, the smaller the error and randomness that reside in the model. As
1 Impressions is the number of unique users who have seen the host user’s tweets in their Twitter stream 2 Tweet metadata do not specify what type of media is attached. Photos, audio, video, and gifs are all considered “photos.”
Feature Hashtag Average
rank Overall
rank #nfl #patriots #sb49 #superbowl #gohawks #gopatriots
Number of posts with media
2 2 5 1 1 2 2.17 1
Number of posts in foreign language
3 1 1 4 2 3 2.33 2
Number of user likes
1 5 2 3 5 1 2.83 3
Number of user impressions
4 3 4 2 3 4 3.33 4
Average post rate 5 4 3 5 4 5 4.33 5
expected, for the top feature (number of posts with media), the relationship is linear with few points at
large standoff distances from a linear fit as shown in Figure 3.
Figure 3
The plot for the second feature (number of posts in foreign language) is similar, but with a smaller slope and increased residual error (Figure 4).
Figure 4
And finally, the trend becomes noticeably more chaotic for the number of user likes shown in Figure 5. The lack of linearity suggests this feature may provide varied performance depending on the hashtag and dataset. So, there is less correlation on the final prediction and this feature.
Figure 5
The hourly prediction of all Tweet data using these features resulted in a coefficient of determination
equal to 0.74.
Problem 1.4: k-Fold Cross Validation on extracted best features We performed cross validation for models (Table 3) with features (Table 4).
Table 3: Table of Regression Models we used in Cross validation
S. No. Regression Models
0 LinearRegression
1 LogisticRegression
2 MLPRegressor
3 Ridge
4 BayesianRidge
5 kNN Regressor
6 Support Vector Regression (SVR) using radial basis function kernel
Table 4: List of features with their F-value and p-value
Features F-Value p-value
longTweet 3400.16454287 0.00000000
userID 3304.16300387 0.00000000
statuses_count 2715.84548810 0.00000000
tweetCount 2634.26209141 0.00000000
lang 1868.91214511 0.00000000
retweetCount 1653.63704932 0.00000000
favoriteCount 1627.61611160 0.00000000
followerSum 280.85821316 0.00000000
rankingScore 91.37921546 0.00000000
impressionCount 1.66588400 0.19711578
Table 4 suggests that longTweet (size of tweet > 100), userID, statuses_count are the top three features
with highest F-value and p value less than 0.05 which makes them significant.
The results of performing cross validation on each hashtag: Table 5: MAE value table after performing cross validation on each hashtag and combined hashtag for whole data
Regressors gopatriots gohawks nfl sb49 patriots superbowl combined
5 35.22 91.93 142.67 1325.99 400.45 1231.60 2618.70
6 37.32 190.88 263.10 1420.26 482.28 1389.92 3258.73
4 37.62 216.00 136.53 1090.24 1186.73 1289.34 4631.31
3 36.04 216.93 140.29 1219.07 1369.20 1757.90 5012.99
0 36.03 216.93 140.29 1219.08 1369.22 1757.90 5013.00
1 429.70 2718.54 815.31 30680.40 3821.00 17370.96 27673.25
2 6685.31 22789.28 34424.94 620126.31 22778.62 386025.40 380186.21
Figure 6: MAE value plot for hashtags (Y-axis on log scale)
After performing cross validation with all features and different regression models (Table 5), we found
that kNN Regressor (k-nearest neighbors) performed the best in predicting the tweet count for the nest
hour in individual hashtags and combined hashtags. Linear regression and Ridge regression are
performing the same. Neural network regression is performing the worst since we are using only 1 layer
with 100 hidden units and data size is also small for training of the network. We can increase the
performance of neural networks by increasing the number of epochs for training and increasing layer
size and units (if we have the limitation on the data) otherwise increasing the training data will help
improve the prediction.
Then we moved to prediction of tweet count in three time intervals:
The results of performing cross validation on each hashtag in Period 1 (Before Feb. 1, 8:00 a.m.):
Table 6: MAE value table after performing cross validation on each hashtag and combined hashtag for period 1
Regressors gopatriots gohawks nfl sb49 patriots superbowl combined
5 8.80 75.67 98.50 49.37 115.45 188.39 444.96
4 14.21 196.99 80.20 50.05 158.08 214.77 706.27
3 14.89 196.20 81.03 51.19 166.84 250.57 727.47
0 14.90 196.20 81.03 51.19 166.85 250.57 727.47
6 10.66 136.04 162.02 103.46 167.38 248.84 896.11
1 111.54 1252.23 297.00 968.63 275.63 1881.03 2844.44
2 2510.83 12162.52 22648.36 238459.08 28340.40 29686.48 593798.11
1.00
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
10000000.00
For Whole Time Data
gopatriots gohawks nfl sb49 patriots superbowl combined
Cross validation with all features on different regression models yields that kNN is performing the best
for individual hashtags and combined hashtags.
Figure 7: MAE value plot for hashtags for period 1 (Y-axis on log scale)
The results of performing cross validation on each hashtag in Period 2 (Between Feb. 1, 8:00 am
and Feb. 1, 8:00 a.m.): Table 7: MAE value table after performing cross validation on each hashtag and combined hashtag for period 2
Regressors gopatriots gohawks nfl sb49 patriots superbowl combined
4 1169.80 3552.73 2222.86 36957.72 13996.46 51628.20 82227.57
1 1657.05 3897.15 3323.65 33685.95 15031.05 76986.80 86606.25
5 890.65 2625.21 2355.09 33025.91 14435.55 49612.86 90519.71
6 1460.45 3344.55 5220.65 31521.85 12097.55 120109.70 99498.55
3 5712.46 1800.86 3891.22 236101.97 45591.78 1151948.6 713835.23
0 5734.50 1802.68 3891.29 236363.71 45591.72 1151698.8 713848.41
2 477141.74 909171.2 1833006 36943729 7119938 18957788 76358613
Cross validation with all features on different regression models yields that Bayesian Ridge, kNN and
Linear regression are performing the best for individual hashtags and combined hashtags.
1.00
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
1 2 3 4 5 6 7
Before Feb. 1, 8:00 a.m.
gopatriots gohawks nfl sb49 patriots superbowl combined
Figure 8: MAE value plot for hashtags for period 2 (Y-axis on log scale)
The results of performing cross validation on each hashtag in Period 3 (After Feb. 1, 8:00 p.m.): Table 8: MAE value table after performing cross validation on each hashtag and combined hashtag for period 3
Regressors gopatriots gohawks nfl sb49 patriots superbowl combined
5 4.63 29.55 168.78 154.95 106.61 288.34 557.66
4 2.25 94.69 125.63 128.60 80.51 199.20 563.95
3 2.72 1066.78 128.73 134.21 81.57 260.43 732.90
0 3.26 3689.54 128.74 134.21 81.58 260.43 732.91
6 4.82 34.77 294.73 344.69 142.34 585.42 1256.85
1 4.79 80.65 308.74 1238.26 381.84 1182.00 1549.60
2 1795.18 45454.38 516471.6 563140.37 103110.0 265952.45 2019312.4
Cross validation with all features on different regression models yields that Bayesian Ridge and kNN are
performing the best for individual hashtags and combined hashtags.
1.00
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
10000000.00
100000000.00
1000000000.00
Between Feb 1, 8:00 am and 8:00 pm
gopatriots gohawks nfl sb49 patriots superbowl combined
Figure 9: MAE value plot for hashtags for period 3 (Y-axis on log scale)
1.00
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
10000000.00
After Feb. 1, 8:00 p.m.
gopatriots gohawks nfl sb49 patriots superbowl combined
Problem 1.5: Testing of models
In this part we test the best features and model we found in the previous sections from
analyzing the training data. We found good features for each hashtag, and for various time periods
around the superbowl. Next, we want to use the overall best model from 1.4 to test how well we can
predict popularity for a tweet in any of the hashtags from any of the time periods.
Instead of using tweets from one hour to predict the next, we use tweets from up to a five hour
window to predict the next. This helps in two ways. One way is that we have a larger amount of time to
pull features from, which gives the model a better picture of the statistical distribution it is predicting
from. The second reason it helps is that we also concatenate the features from each hour instead of just
summing them up. For example with the feature Number of Retweets, instead of summing up the
number for all 5 hours, we sum up the number for each of the 5 hours and concatenate them. This
results in an N by (num_features * 5) dimensional feature array, where N is the number of windows,
num_features is how many features we extract, and 5 is how many hours we extract over. This
concatenation gives us better time precision for our features and can reveal the temporal nature of the
features over a time window. If the features at the beginning of the window are high, but the end of the
window are low, this may be a sign the popularity is dying not growing. This works in the opposite
situation as well.
From part 1.4 we found that LongTweet, UserID, statuses_count, tweetCount, english or not,
favorite count, follower sum, and ranking score features all had p values less than 0.05 and High F-
values, making them suitable candidates to use in regression. We also found that K nearest neighbors
with a k value of 5 worked best overall with the aggregated hashtags and across the 3 time periods. In
testing we used these features with K nearest neighbors on 5 hour windows to predict the last hour in
each test file. We also used first post date as the time reference.
Table 9: Error for each sample in the test file
Sample 1 2 3 4 5 6 7 8 9 10
Prediction 0 6523 311 26 202 2785 3 107 1794 2
Actual 178 8292
3
523 201 210 37293 120 11 2790 61
Absolute Error
730 76670
212 175 8 34508 117 96 996 59
From the results we can see that the classifier performs well in some cases and not in others.
Some reasons for this are insufficient data. For example in samples 1, 7, 8 and 10 there were only 100 to
a few 100 tweets which is not much data to pull features from and accurately predict the popularity.
Also in some of the samples from period 2, there is an extreme jump in the amount of tweets and a
divergence from the normal pattern of tweeting. This is due to the superbowl being in that time period,
and everyone tweeting about the game in a way that causes outliers in the data. Lastly some of the
samples were predicted well with low error. These were the samples where there were at least a few
thousand tweets to predict from and weren't part of the outliers in the superbowl. For these tweets the
classifier trained on the aggregated data performed well.
Problem 2: Fan Base Prediction In this section, the textual content of the tweets will be used to predict the location for the user. The
problem only considers the tweets that include the hashtag #superbowl in them and where tweeted
form either the state of Washington or Massachusetts
Preprocessing the tweets Since the user’s location is an input designed by the user, a preprocessing step is needed before
applying the algorithms to the data. This is due the following reasons:
• The number of places where people tweet from is enormous. And by applying a simple filter like
the tweets containing the string ‘, MA’ was not enough since it was considering locations like
‘Kuala Lumpur, MALAYSIA’
• There are some places that contain the name of the states that we are interested in but are not
related to them at all. That is the case of ‘Washington DC’, ‘Washingtonville’ or ‘Washington
Heights, NY’ for example.
• There are some places that would match a simple filter like the one already mentioned, but are
not proper locations, for example: ‘Where I come from, MARS’
For the previous reasons, the preprocessing selected was set of simple filters with the strings of interest
like ‘Washington’, ‘WA’, ‘Massachusetts’, ‘MA’, followed by a frequency analysis of the locations to
determine the most frequent locations and discard the ones that we do not want, for instance:
'BOSTON, MA HOME OF TRUE PREP'
'Seattle, WA aka The TwoOhSix'
After this filtering, considering the 100 most common locations from Massachusetts, 11388 tweets are
considered for this state. On the other hand, using the 180 most common locations from Washington,
11162 tweets are considered for this state.
Binary Classifiers
After the preprocessing mentioned in the section above, the data is transformed into a TF-IDF feature
matrix to be processed by the classifiers. The election of the best parameters for each algorithm has
been made based on a 5- fold validation recall.
Random Forest classifier
A random forest model is trained for different configurations of parameters. The results are shown in
the table below.
For each number of latent factors, the parameter configuration that performed better is shown in
orange. Finally, the parameters selected are shown in blue. In this case the election was max features 2,
n estimators 150 and latent factor 100 because even though it was not the setting with the best
performance, it presented a good performance with the lowest variance across the 5-fold validator.
Table 10: Parameter tuning random forest
Latent factor 3 4 5 10 20 50 100 200
max features
n estimators
recall
2 20 0.764 0.772 0.775 0.776 0.777 0.771 0.758 0.731
2 40 0.768 0.775 0.780 0.782 0.784 0.784 0.778 0.755
2 80 0.770 0.778 0.782 0.783 0.788 0.787 0.783 0.771
2 150 0.771 0.779 0.785 0.785 0.786 0.789 0.790 0.783
3 20 0.761 0.770 0.774 0.775 0.776 0.776 0.764 0.745
3 40 0.768 0.773 0.779 0.782 0.782 0.780 0.780 0.766
3 80 0.769 0.777 0.781 0.783 0.786 0.790 0.789 0.782
3 150 0.771 0.778 0.781 0.784 0.788 0.791 0.789 0.791
SVM classifier
An SVM model is trained for different configuration of parameters. The results are shown in the table
below.
Table 11: Parameter tuning SVM
Latent factor 3 4 5 10 20 50 100 200
C 𝜸 recall
1 0.001 0.516 0.516 0.516 0.515 0.516 0.516 0.515 0.515
1 0.0001 0..500 0.500 0.500 0.500 0.500 0.500 0.500 0.500
10 0.001 0.758 0.757 0.757 0.751 0.756 0.758 0.759 0.759
10 0.0001 0.516 0.516 0.516 0.515 0.516 0.516 0.515 0.515
100 0.001 0.770 0.770 0.771 0.768 0.771 0.774 0.778 0.783
100 0.0001 0.758 0.757 0.757 0.751 0.756 0.758 0.759 0.759
1000 0.001 0.773 0.773 0.757 0.774 0.778 0.780 0.786 0.791
1000 0.0001 0.770 0.770 0.771 0.768 0.771 0.774 0.778 0.783
Neural network classifier
An Neural Network model is trained for different configuration of parameters. The results are shown in
the table below.
Table 12: Parameter tuning NN
Latent factor 3 4 5 10 20 50 100 200
𝜶 𝒍𝒂𝒚𝒆𝒓𝒔 recall
0.0001 (100,) 0.778 0.779 0.783 0.784 0.788 0.792 0.784 0.783
0.0001 (50, 50) 0.778 0.778 0.783 0.784 0.788 0.778 0.777 0.769
0.0001 (50, 50, 50) 0.779 0.780 0.784 0.785 0.785 0.781 0.779 0.769
0.001 (100,) 0.778 0.779 0.783 0.785 0.789 0.792 0.790 0.786
0.001 (50, 50) 0.779 0.778 0.783 0.787 0.787 0.787 0.783 0.768
0.001 (50, 50, 50) 0.777 0.779 0.783 0.785 0.788 0.783 0.769 0.770
0.01 (100,) 0.776 0.777 0.783 0.784 0.790 0.791 0.791 0.788
0.01 (50, 50) 0.779 0.779 0.783 0.787 0.786 0.792 0.776 0.763
0.01 (50, 50, 50) 0.781 0.778 0.783 0.785 0.786 0.790 0.776 0.768
0.1 (100,) 0.778 0.778 0.784 0.784 0.788 0.788 0.791 0.794
0.1 (50, 50) 0.778 0.779 0.784 0.783 0.787 0.788 0.793 0.788
0.1 (50, 50, 50) 0.776 0.778 0.781 0.781 0.788 0.781 0.789 0.772
To visualize the performance of the different algorithms together, a ROC curve is plotted for the test
data.
Figure 10: ROC 3 algorithms
Since the performance is very similar between the algorithms, the figure below shows a zoom of the
previous one in a zone that the 3 algorithms show some differences.
Figure 11: Zoom ROC 3 algorithms
The figure above shows how the neural network performs better than the other 2 algorithms for any
chosen threshold.
The following figures show the confusion matrices for the three different algorithms.
Figure 12: Confusion matrix random forest
Figure 13: Confusion matrix SVM
Figure 14: Confusion matrix NN
As the Figure 11: Zoom ROC 3 algorithmsFigure 11 and Figure 14 show, the neural network classifier is
the one that gives the highest percentage of correct classifications. These results are even clearer in the
Table 13.
Table 13: Performance metrics
Algorithm Accuracy Precision Recall
Random Forest 0.8006 0.7564 0.8929
SVM 0.8020 0.9188 0.9188
Neural Network 0.8037 0.7478 0.9227
Problem 3: Define Your Own Project Problem Definition: Suppose you are traveling abroad during the Super Bowl in a remote area with no television. Although your phone has weak reception, it is sufficient for low data-rate communication. You’d like to find a way to know the game score using your cell phone. It turns out you have the ability to stream Twitter metadata for #superbowl, #sb49, and #nfl. However, you must choose the data wisely due to limited bandwidth, and you must avoid text analysis/filtering to save the battery. The problem is stated as follows: Given a stream of tweet metadata absent of partisan hashtags, location data, and Tweet text content, can quasi-real-time estimation of game scores and final scores be completed confidently? Implementation: The game is assessed in time windows several minutes in length. Each window is considered a single sample and is represented by one row in the feature matrix. Given several inputs for the current window, the model attempts to determine the change in score for the two teams during that window3. Two separate models are trained, with one focused on the Patriots and the other on the Seahawks. The models can distinguish field goals from touchdowns. However, they are not trained to recognize safeties or two-point conversions since those did not occur during the Super Bowl. All time windows have start times during the game with end times that may occur during the halftime show or the end of the game. For the windows with valid start times, but invalid end times, all invalid Tweets are filtered and removed prior to training. #nfl, #sb49, and #superbowl provide roughly 1.5 million tweets that are reduced to 50,000 when halftime show, as well as pre- and post-Super Bowl tweets are removed. The overall estimated game scoring history is produced by plotting the cumulative sum of the score changes predicted for each time window. For the purposes of this project, it is assumed that all scores are independent of each other. In reality, this is not the case and scores are affected by momentum, strategy, and the crowd. Features: Four features were selected for each hashtag. Each feature is listed and detailed below:
1. Number of tweets during time window 2. Mean of user IDs during time window 3. User ID variance during time window 4. Zipf exponent for user ID popularity distribution during time window
Number of tweets during window The number of tweets during each time window is an indicator of game activity and reflects major occurrences. Figure 15 below shows the number of tweets (unfiltered) for 15-minute windows over the duration of the game. There is a large up-tick for the halftime show, as well as the last-minute, game-ending interception by the Patriots. For each team’s scores, there is a delayed increase in number of tweets other than the two scores immediately after the halftime show. This is likely affected by Tweet activity related to the halftime show.
3 Each estimate is rounded to arrive at an integer value for the change in score for that time window.
Figure 15
Mean of user IDs & User ID variance during time window User ID is an indirect representation of the age of each user’s Twitter account. By taking the average, it is an attempt to infer a group of users’ favorite team based on which users are active after a certain team’s score. The variance of user IDs active around the time of a team’s score represents the range of users’ account ages (perhaps one team’s fans span a larger time frame of Twitter’s existence). Justification for the selection of user ID mean and variance was not straightforward, as there are no truth data for the chosen hashtags. However, user ID distribution insight can be gained via the user IDs associated with #gohawks, #gopatriots, and #patriots. The user IDs associated with those three hashtags are plotted below in Figure 16. There is a clear distinction between the users who use Patriots hashtags and those who use the Seahawks hashtag. The assumption is that this holds true for the other hashtags.
Figure 16
Zipf exponent for user ID popularity distribution during time window The Zipf distribution exponent captures the popularity distribution of users active during a time window. Perhaps one team’s Twitter users are more influential or visible on Twitter, affecting the social media equality distribution. Users ranked by influence are plotted below in Figure 17. This metric is expected to
play a larger role for long-duration windows since the number of active users will increase, and the distribution will approach Zipf’s law.
Figure 17
With four features selected for each of the three hashtags, there were 12 features in total. Ideally, these
features are not highly intercorrelated, as it will help avoid linearly dependent features and ill-
conditioning. The correlation between each feature was assessed and plotted in Figure 18 as a
correlation matrix below:
Figure 18
There was a high level of correlation between the #sb49 Zipf exponent and the Zipf exponents for
#superbowl and #nfl. The #nfl user ID variance was also highly correlated (0.9) with #nfl mean user ID,
similar to #sb49 user ID variance and #sb49 tweet count. Finally, the correlation of #sb49 user ID
variance and #sb49 mean user ID was 1.0. Although these features are highly correlated, a majority of
the features have moderate to low correlation. Once the 12 features are calculated for each window,
they’re combined to create the feature matrix β in Equation 1 below:
Equation 1
Game score truth data: A challenge was finding a recording of the game that includes all commercials, so the game clock events could be mapped to local time. None were found. But luckily Twitter posted an interactive map of geographical tweet activity as the game transpired4, which enabled truth data generation. Score summary:
Quarter Game Clock True Clock (EST) Time Diff. Score
Patriots Seahawks
1 12:00 Sun Feb 01 18:30:00 2015 0:00 0 0
2 9:47 Sun Feb 01 19:13:00 2015 0:43 7 0
2 2:16 Sun Feb 01 19:36:00 2015 0:23 7 7
2 0:31 Sun Feb 01 19:49:00 2015 0:13 14 7
2 0:02 Sun Feb 01 19:59:00 2015 0:10 14 14
Halftime N/A Sun Feb 01 20:11:00 2015 0:12 14 14
Halftime N/A Sun Feb 01 20:32:00 2015 0:21 14 14
3 11:09 Sun Feb 01 20:38:00 2015 0:39 14 17
3 4:54 Sun Feb 01 20:54:00 2015 0:16 14 24
4 7:55 Sun Feb 01 21:28:00 2015 0:34 21 24
4 2:02 Sun Feb 01 21:48:00 2015 0:20 28 24
4 0:00 Sun Feb 01 22:05:00 2015 0:17 28 24
4 http://com.cartodb.visualizations.s3.amazonaws.com/v/superbowl/index.html?vis=c5539e5c-aa92-11e4-be3b-0e853d047bba&utn=srogers,superbowl15_events&t=@Patriots,B40903%7C@Seahawks,229A00
Time window size implies a sampling frequency, which involves tradeoffs between estimation triviality, and information capture available for quasi-real-time estimations. If the time window is the length of the game, then the likelihood of all scores occurring within that window is 1; if the window is too small, very few tweets are available for estimation purposes, resulting in mean and variance metrics losing their ability to generalize randomness Therefore, the effect of window duration on training set predictions was explored prior to final assessment. Both the RMSE and coefficient of determination are proportional to the window size, meaning there is an ideal window duration that balances the error and the explained variance. Due to limited time, two window durations were chosen to understand their effects on performance. The chosen durations are 5 minutes and 15 minutes, indicated by star markers in Figure 19 below.
Figure 19
Model Selection: Prior to model selection, the feature matrix condition number was calculated to determine whether it was well- or ill-conditioned. Singular value decomposition was used to obtain the features matrix’s singular values:
Equation 2
Due to the large magnitude of the condition number, it was determined that ridge regression would be necessary. Therefore, both linear and nonlinear models were considered. Ordinary least squares regression, ridge regression, and k-neighbors regression were chosen. Additionally, due to the large difference in magnitude of the tweet counts (∼1E+03), user IDs (∼1E+08), and the zipf parameter (∼1E-01), regression was run on both non-normalized and normalized data. Performance Assessment:
k b( ) =s
maxb( )
smin
b( )=
5´1018
1´10-2»1´1020 ≫ 1Þ ill-conditioned
An alternative method of data visualization is utilized for this problem in replacement of a truth-prediction scatter plot. Since the goal is to see how the scores change over the duration of the game, the cumulative score over time is used to achieve the truth-prediction scatter plot’s information display, as well as how the scores change from beginning to end.
Figure 20
Prior to cross validation, the training set was used for both training and testing of the ordinary regression model to gain a sense of the problem statement’s feasibility. The ability of the regressor to estimate the score in quasi-real time seems promising, with Figure 21 showing little to no final score error (1-point overestimation for Seahawks) and close tracking throughout the game. Additionally, the Fréchet distance was used as a metric to measure the relative “closeness” of the curve tracking throughout the game. The Fréchet distance between two curves A and B (truth and prediction) is defined as follows:
Equation 3
where d is the distance function of the vector space (Frobenius norm). Overall, the Fréchet distances for the Patriots and Seahawks were similar, with Seahawks having a slightly larger distance due to the prediction being greater than the truth scores during all points of the game. This is not the case for the Patriots whose score estimation curves regularly intersect the truth curve. Although the Fréchet distance is useful in this case, it was dropped as a final metric due to the RMSE already being a suitable metric for understanding how well the scores match. Additionally, the Fréchet distance varied by very little between test runs that did not have drastically different plots.
F A,B( ) = infa ,b
maxtÎ 0,1éë ùû
d A a(t)( ),B b(t)( )( ){ }
According to this plot, the problem appears to be possible given the four features for each dataset and a window duration of 15 minutes. Actual performance will likely depend upon over-fitting and whether the test sets resemble the information contained in the training sets. Cross validation: Five-fold cross validation was run on the training dataset with balanced time window sample populations. For ridge and k-neighbors folds, performance for a range of regularization parameters and neighbor count was assessed. RMSE was utilized as the main metric to decide on relative model performance as opposed to the coefficient of determination due to our interest in the actual scores and not general data trends. Regression performance on all cases are summarized below in Table 14. Table 14: Regression performance summary
Regression model
Window Duration
(min)
Team name
Feature Pre-processing
Optimal alpha
Optimal neighbor
count
Mean RMSE
(points)
Ordinary Least Squares
5
Patriots None N/A N/A 11.83
Normalization N/A N/A 34.38
Seahawks None N/A N/A 21.91
Normalization N/A N/A 19.86
15
Patriots None N/A N/A 48.52
Normalization N/A N/A 202.15
Seahawks None N/A N/A 32.52
Normalization N/A N/A 14.39
Ridge
5
Patriots None 10 N/A 11.84
Normalization 0.001 N/A 20.13
Seahawks None 10 N/A 2.03
Normalization 1 N/A 1.82
15
Patriots None 10 N/A 102.90
Normalization 0.01 N/A 3.09
Seahawks None 10 N/A 78.34
Normalization 1 N/A 2.80
Figure 21
K Neighbors
5
Patriots None N/A 42 2.01
Normalization N/A 13 2.03
Seahawks None N/A 42 1.81
Normalization N/A 42 1.81
15
Patriots None N/A 13 3.05
Normalization N/A 3 3.06
Seahawks None N/A 14 2.76
Normalization N/A 13 2.81
Ordinary Least Squares Regression: For the ordinary least squares regression method, overall performance was dismal. Score estimation for both the Patriots and Seahawks was highly inaccurate, with large RMSE values.
Figure 22
One trend that was noticed in the estimation plots in Figure 22 and Figure 23 is that normalization tends to result in monotonically increasing changes in score for every time window. Conversely, for the non-normalized feature matrix cases, steps are not monotonic, showing both increases and decreases of varying magnitude.
Figure 23
Ridge Regression: In general, score-tracking performance was unsatisfactory using ridge regression. Root mean squared error generally increased with decreasing regularization parameter with a few exceptions. RMSE values were much higher for the data without normalization for pre-processing (shown in Figure 24 and Figure 25 below). Those values were extremely high for the non-normalized 15-minute window cross validation runs. The magnitude of the errors for the Patriots is apparent in the score estimation plot above in Figure 23 for 15-minute windows, with the cumulative error causing a final score estimation of 3000 points. An interesting case is Patriots regression RMSE for 15-minute windows. For the range of regularization parameters considered, its RMSE is unaffected.
5-minute window
15-minute window
Figure 24
Quasi real-time estimations via ridge regression for no normalization was very similar to the ordinary regression model. While the ridge regression 15-minute time window estimates were identical to the ordinary regression estimates, the magnitude of deviation for the 5-minute time window was reduced by a factor of two, which was accompanied by an RMSE reduction by a factor of 10 (from 21.91 to 2.03 points) in Figure 26. Additionally, the overall trend for the Seahawks remained the same with a decrease from the beginning of the game into negative points and then an increase around 125 minutes into the game, followed by a steep descent until the end of the game.
Figure 26
Similar to the normalized features for ordinary regression, normalization of features for ridge regression caused mostly monotonic steps in score estimates from window-to-window (Figure 27). A particularly interesting case was for the 5-minute windows in which the Patriots’ and the Seahawks’ score estimates over time were identical. While this estimate was always above the actual game scores, this was not the case for 15-minute windows, wherein estimates were consistently within the range of truth scores. Additionally, the final score for the Seahawks was correct, while there was a small error of 3 points for the Patriots, which would equate to a missed field goal.
5-minute window
15-minute window
Figure 25
Figure 27
In general, score-tracking performance was unsatisfactory using ridge regression. However, for normalized features over 15-minute time windows, final score predictions are very accurate. Therefore, if the goal is just to know the final game score, a simple ridge regression model is a candidate model pending test performance on previous Super Bowls. K-Neighbors Regression The nonlinear model, K-neighbors regression, had the best overall performance with a mean RMSE of 2.4 points over all eight cases compared to 48 points for ordinary least squares and 27.9 points for ridge regression. K-neighbors regression was run for neighbor counts ranging from one to one minus the number of samples. The optimal number of neighbors ranged from 3 to 43 depending on the case. Figure 28 and Figure 29 show that for 5-minute windows, the Patriots and Seahawks RMSE follow similar trends overall, but involve slight deviations for a lower number of neighbors.
5-minute window
15-minute window
Figure 28
5-minute window
15-minute window
Figure 29
For ridge regression, monotonically increasing steps in the score were only seen for normalized data. However, in this case, the trend is present in all test runs shown in Figure 30 and Figure 31. Five-minute time windows consistently underperformed relative to the 15-minute window cases with the estimated score always greater than the actual score, and large errors in the final game score estimate that are almost twice the actual game scores.
Figure 30
Normalization did not have an impact on score estimation for the Seahawks relative to ridge regression. The RMSE and score predictions over time were almost identical for both. Results for normalized k-neighbors was similar to those of the non-normalized k-neighbors feature set. Therefore, for k-neighbors regression, normalization does not play an important role in performance and 15-minute windows is significantly better than that of the 5-minute windows.
Figure 31
Summary & Improvements: Recall the posed problem: Given a stream of tweet metadata absent of partisan hashtags, location data, and Tweet text content, can quasi-real-time estimation of game scores and final scores be completed confidently? The answer to the latter is a strong yes. The k-neighbors regression model had very low RMSE and consistently made accurate end-of-game score estimates. This is quite impressive given the low level of complexity. There is no strong winner for which model is best to predict Patriots’ scores versus Seahawks’ scores throughout the game. However, there is potential for improvement (discussed below). Although the latter part is not a confident “yes,” the attempt is still considered a success considering the focus of the posed problem is to see how well scores could be estimated with minimal processing and model complexity. It was realized early-on that improved performance was attainable by using team-specific hashtag features or content analysis to find words such as “field goal,” “touchdown,” “score,” etc. Ordinary least squares performance is very low for both the Patriots and the Seahawks regardless of time window duration. Ridge regression should only be used with normalization. And in that case, both the Patriots and Seahawks estimations were similar. K-neighbors has the lowest RMSE for Seahawks’ scores. Areas of consideration for estimation improvement include the use of additional nonlinear models. The neural net could potentially provide great results for this task, as the tweet volume is directly affected by events such as kickoff (when viewership is high) and the halftime show. The halftime show Tweets could be filtered via text analysis by removing tweets with “Katy,” “Perry,” and “Show.” Additionally, sentiment could provide more context for scores. But both sentiment and filtering require text analysis, which adds significant resource utilization for processing. Additionally, if the number of tweets for each time is unbalanced, that could case bias in the future performance.