a two step ranking solution for twitter user engagement
TRANSCRIPT
A Two Step Ranking Solution for Twitter User Engagement
Behnoush Abdollahi, Mahsa Badami, Gopi Chand Nutakki, Wenlong Sun, Olfa Nasraoui
Knowledge Discovery and Web Mining Lab
University of Louisvillehttp://webmining.spd.louisville.edu
Challenge@Recsys 2014 1
Outline
• Introduction
• Challenges
• Summary of our approach
• Preprocessing
• Two step ranking model
• Neighborhood-based Repairing for Tweets with
Predicted Zero Engagement
• Results
• Lessons learned
• Conclusion
Challenge@Recsys 2014 2
Introduction• Data: extended version of the MovieTweetings
dataset o collected from the users of the IMDb iOS app that rate movies and share
the rating on Twitter
• Predicting user engagement for:
o #favorites
o #retweets
• Evaluation: nDCG@10 metric
learning to rank approach
data statistics:
Challenge@Recsys 2014 3
Challenges
• High dimensionality
• Power law distribution of the user engagement
• Missing values
• Outliers
• Imbalanced engagement distribution
o more than 95% of engagement = zero
• Different from many standard prediction and recommendation problems:
o user egagement may be affected by
• implicit profile information of the user on Twitter
• movie preference data by other users
• movie content data
Challenge@Recsys 2014 4
Summary of our approach
Challenge@Recsys 2014 5
Preprocessing
• Data cleaning and completiono Removing
• special characters from text-based features,
• empty,
• redundant,
• invalid values such as movie ratings exceeding 10
o Filling missing data such as user id by looking at the nearest tweets
o Using country code as a location feature
• if missing converted other similar geographical features to the country code
Challenge@Recsys 2014 6
Preprocessing
• Feature Extraction + Engineering
from Twitter + IMDb
Challenge@Recsys 2014 7
Features
Context-based
Text-based
Tweet-Movie similarity
Context-based featureso User Profile(Twitter)
• # of user’s followers/friends
• # of users that are mentioned/ replied to by the current user
• same features for each re-tweeted tweet
o Movie Profile (IMDb)
• imdb movie plot, director, actors, genre, languages, countries
• # of times in which a movie has been tagged in a tweet
• average rating for a movie
• Movie total re-tweet/favorite count number of users who have
rated a particular movie
o Twitter Profile
• tweet/retweet flag
• time delay between the tweet and movie release date
• seasonality (Christmas, Halloween time,…)
• time features: certain times of the day (day or night), certain days
of the week (week days or weekends)
Challenge@Recsys 2014 8
Text-based features
Challenge@Recsys 2014 9
extracted bag of words, then select the most relevant based on Mutual
Information Gain Ratio with target
movie plot
movie genre
user description
hashtag
Tweet-Movie Similarity• Using common lower dimensional latent space
• Joint latent space is learned
o using NMF
o to capture the similarity between a tweet and the movie that it mentions
o to handle the problems of:
• sparsity,
• high dimensionality in bag of word features,
• poor semantics
o based on tweet & movie features, such as:
• hashtags,
• user description,
• movie genres,
• movie plot, ...
Challenge@Recsys 201410
Tweet-Movie Similarity1. Building a semantic tweet representation by factoring the tweet matrix X1:
where: n1 = #tweets, m1 = #features and f1 = #factors
2. Mapping the movie data X2 into the latent space defined by B1 to compute the
movie coefficient matrix A2
where: n2 = #movies
3. Computing the similarity using the dot product of the corresponding rows of A1
and A2
Challenge@Recsys 2014 11
1X (n1 ´m1) = A1(n1 ´ f1)´B1( f1 ´m1)
2X (n2 ´m1) = A2(n2 ´ f1)´B1( f1 ´m1)
2sim(X1,X ) = A1 ´AT2
Tw
eet
s
Words
Tw
eet
s
Latent Factors
Lat
ent
Fac
tors
Words
x
Two step ranking model
1. Cost sensitive classifier
o Classify tweets into zero engagement and multiple non-zero classes
o Imbalanced data
used cost sensitive classification
2. Ranker
o List-wise, point-wise, pair-wise approaches
o Predict the relevance values in Information Retrieval tasks
o Engagement values considered as grades/labels
• y =1, 2,…, l
Challenge@Recsys 2014 12
Cost sensitive classifier• More than 95% of the data have zero engagement value
o classical classification methods tend to misclassify the minority class
(non-zeros engagements ) as the majority class (zero engagements)
• Cost sensitive framework
o assign different weights to errors of type I and type II
o assign higher weights to tweets classified in Zero engagement
class
o use cost matrix C, for a sample x in class i, the following Loss
function is minimized to find p:
C(i,j) = the cost of predicting
class i when the true class is j
• Tweets classified as non-zero
engagement are then passed to the ranker
actual negative actual positive
predicted negative
C(0,0) C(0,1)
predicted positive
C(1,0) C(1,1)
L(x, i) = P( j | x)C(i, j)j
å
Challenge@Recsys 2014 13
Ranker• Let y the grade/ label set be:
• Let twitter_user_ids: and
a set of tweets associated with user ui
then is the set of grades associated with user ui (ni is
the size of Ti and yi)
• The classified training set is denoted as:
• A feature vector is generated from each group-tweet
pair
• Our goal: to train ranking models which can assign a score to a
given pair
• Used Random Forrest (RF)
o uses independent subsets
o parallelizable, robust to noisy data, capable of learning disjunctive expressions
i, jx = F(iu ,
i, jt )
(iu ,
i, jt )
Challenge@Recsys 2014 14
y =1,2,..., l
u1,u2,...,umTi = ti,1, ti,2,..., ti,ni
yi = yi,1, yi,2,..., yi,ni
S = {(ui,Ti ),i=1
m
yi }
Neighborhood-based Repairing for
Tweets with Predicted Zero
Engagement
Challenge@Recsys 2014 15
non-zero engagment
non-zero engagment
zero engagment
zero engagment
Neighborhood-based Repairing for Tweets with
Predicted Zero-Engagement• Goal: to correct predictions of non-zero tweets
misclassified as the zero engagement tweets
• Added neighborhood based approach after step 1:
1. Compute similarity between training and test tweets in
common Latent space (computed using NMF)
2. Find the NT nearest tweets in the training for each test tweet
classified as zero-engagement
3. Reassign the predicted engagement
Challenge@Recsys 2014 16
Neighborhood-based Repairing for Tweets with
Predicted Zero Engagement
• Varied neighborhood size: NT= 5,10 or 20
• Used two options to reassign predicted engagements:
1. Predict non-zero engagement If either of the neighbors’ engagements is non-zero
2. Predict engagement = the most frequent engagement
value among neighbors
• Select a margin based on cosine similarity to consider
a set of zero-predicted-engagement test tweets to
become candidates to be repaired
o move to non-zero predicted engagement
Challenge@Recsys 2014 17
Steps in Process
Challenge@Recsys 2014 18
classificationrefine
classification
ranking the non-zero
tweets
merge the zeros with the ranked non-zeros
tweets
Results• Step 1- Classifier
o Used Weka with a cost sensitive framework
o The engagement was discretized into 6 classes
o Adaboost gave the best result
• 99% classification accuracy
• true positive rate = 0.74 in the minority class (non-zero engagements)
• false positive rate = 0.01
• Step 2- Ranker
o Built global ranking model
o Used RankLib implemented in Java
o Engagement is considered the target value to be ranked
Challenge@Recsys 2014 19
class 1 •only 0s
class 2 •only 1s
class 3 •2-10
class 4 •11-20
class 5 •21-50
class 6 •values > 50
Results
• nDCG@10 for the test data.
• Ranker applied indiscriminately on both zero and
non-zero class tweets.
• This result is only shown to appreciate the impact of
the classier in Step 1.
RankingAlgorithm
All features Excluding IMDbFeatures
Excluding graph-propagatedfeatures
Random Forest
0.553 0.503 0.485
LambdaMART
0.466 0.422 0.411
RankBoost 0.432 0.417 0.406Challenge@Recsys 2014 20
Results
RankingAlgorithm
All features Excluding IMDbFeatures
Excluding graph-propagatedfeatures
Random Forest 0.805 0.503 0.485
LambdaMART 0.466 0.422 0.411
RankBoost 0.432 0.417 0.406
• Merging the zero and non-zero predicted
engagement to calculate nDCG@10
– by appending the zero-class tweets at the end of
the non-zero tweets for each user and then
sorting the tweets again based on the user id
Challenge@Recsys 2014 21
Results• Effect of repairing the predicted zero engagement tweets on
nDCG@10 values for varying margin sizes defined by different
similarity thresholds and neighborhood sizes
• varied the neighborhood size, NT= 5,10,15
• for NT = 10 nearest neighbors and a similarity threshold of 0.9 to
confine the margin where tweets are repaired
increased the nDCG@10 value to 0.817
Challenge@Recsys 2014 22
Lessons Learned
• What helped
o Adding additional sources of data (IMDb).
o Dedicated janitorial work on the data
o Feature Engineering: selection, extraction, and
construction of relevant features.
o Mapping data (tweets and movies) to a Latent Factor
Space using NMF
o Binary classification of zero and non-zero engagement
tweets prior to ranking.
o Cost-sensitive classification in Step 1
o Using Learning to Rank (LTR) methods in Step 2.
o Repairing misclassified non-zero engagement tweets within
a limited margin, and then re-ranking them.
Challenge@Recsys 2014 23
Lessons Learned
• What hurt
o Over or under-sampling to handle class
imbalance
o Spending too much time on filling missing values
for many features did not have any return on
investment
o Not having the luxury of better computational
power
o Building separate models based on whether or
not the users and movies were common
between the training and test sets
Challenge@Recsys 2014 24
Conclusion
• If more time or computational power were within our
reach, we would further explore several directions:
o Exploring other LTR options in addition to the
pointwise approach, including pairwise and
listwise LTR.
o Domain-informed Transductive learning
o Exploring additional extracted or constructed
features that may affect engagement.
o Exploring high performance computing to
compute many embarrassably parallellizable
tasks.
Challenge@Recsys 2014 25