week 10 project presentations
TRANSCRIPT
CSE 291:Trends in Recommender Systems and Human Behavioral
Modeling
Week 10 project presentations
Neural Rating Regression with Abstractive Tips Generation for Recommendation
Balasubramaniam Srinivasan, Nitin Kalra, Prem Nagarajan
Problem StatementGiven a user and an item, simultaneously predict precise rating and generate tips.
DatasetAmazon dataset
Rating regression categories: Electronics, Movies, Books (size ~ 1 GB each)
Multi-task learning: Pet Supplies, Arts & Crafts, Cell Phone accessories
603,668 367,982 8,887,781
192,403 63,001 1,684,779
123,960 50,052 1,697,533
Architecture
Baseline model● Deep Learning based framework named NRT (Neural Rating and Tips
generation)● Multi-layer perceptron models user and item latent factor into rating● Gated Recurrent Unit (GRU) translates user and item latent factor into tips● Uses beam search algorithm to generate tips from the trained model● Multi-task learning framework integrates both rating predictions and tips
generation given by the objective function
Evaluation metrics● For rating prediction task:
○ Mean Absolute Error (MAE):
○ Root Mean Square Error (RMSE):
● For Tip generation task:○ ROUGE-N score:
Extension 1● Effect of the following metadata on the ratings
a. Also viewedb. Also boughtc. Bought together
● Modelling the new features as graphs● Learning from the node2vec representation of the nodes
Extension 2● Using the factoid answers dataset for improving the rating prediction and
tip generation● Contains questions and answers data from Amazon
Results for rating predictionBooks Electronics Movies & TV
Baseline model (NRT)
NRT + Also viewed (128)
NRT + Also bought (128)
MAE RMSE MAE RMSE MAE RMSE
*(Sampled
Down)
*(Sampled
Down)
0.805 1.060 0.885 1.130
* * 0.794 1.039 0.921 1.126
* * 0.802 1.052 0.905 1.119
Results for tip generationROUGE-1 ROUGE-2 ROUGE-L
ROUGE-1 ROUGE-2 ROUGE-L
F1 P R F1 P R F1 P R
27.98 22.11 45.67 2.03 1.68 3.63 22.84 21.66 45.67
28.31 22.32 46.09 2.13 1.56 4.27 23.09 21.83 45.88
NRT
NRT + Q/A
NRT
NRT + Q/A
Pet Supplies:
Arts and CraftsF1 P R F1 P R F1 P R
30.85 26.06 44.90 1.21 1.12 1.88 26.33 25.59 44.90
31.14 23.33 55.84 0.86 0.79 1.13 28.72 26.98 55.84
Results for tip generation
NRT
NRT + Q/A
Cell PhoneAccessories: ROUGE-1 ROUGE-2 ROUGE-L
F1 P R F1 P R F1 P R
28.25 19.82 60.21 1.02 0.64 2.68 20.86 19.40 60.21
29.08 25.73 43.27 0.45 0.34 0.71 23.22 22.09 43.14
Results for tip generation
MAE RMSE
0.712 0.822
0.706 0.784
NRT
NRT + Q/A
NRT
NRT + Q/A
Pet Supplies:
Arts and Crafts MAE RMSE
0.543 0.9310
0.543 1.087
Results for tip generation
NRT
NRT + Q/A
Cell PhoneAccessories:
MAE RMSE
0.539 0.487
0.493 0.477
Limitations● Large datasets ● Model is Compute Intensive● Extensions are compute intensive
Work in Progress
● Analyze the importance of time or season on product ratings and reviews○ Capturing user and item state
● Books Dataset Sampling
Thank you!
Extension to Neural Collaborative Filtering
Wen Liang, Zeng Fan
Original Paper
Presents GMF (General Matrix Factorization) model and NeuMF model.
MotivationsUse user and item attributes in the dataset
Tackle the sparsity issue
DataSet● MovieLens
○ User-Movie Ratings○ User information: gender, age, occupation○ Movie information: genre (e.g. adventure, comedy, etc.)
● Pinterest○ User-item pairs○ Number of each user’s pins○ User’s category
Evaluation and Metrics● Evaluation
○ Leave-one-out evaluation: for each user, leave 1 user-item interaction to the testset
● Metrics○ Hit Ratio@10○ Normalized Discounted Cumulative Gain (NDCG)@10
Revisit NeuMF ModelNeuMF: Combines GMF and MLP together to better capture implicit user-item relationship.
Using only GMF model is efficient and does not cost much performance.
Attributed-aware deep CF model● An extension for the NeuMF
model● Social network based● Add pooling layer above
embedding layer
Wang et al. (2017) Item Silk Road: Recommending Items from Information Domains to Social Users
Proposed ModelUse a shared user embedding to solve cold-start problem
Use a weight to balance:
Element-wise product between pairs of user, item and attributes vectors.
Results
Hit Ratio@10
Normalized Discounted Cumulative Gain (NDCG@10)
Training Loss vs. Epochs
Questions
Final Project: A synthetic Approach for Recommendation
Yan ChengMoyuan Huang
Overview
1. Objective: predict customer ratings for business
2. Metric: root mean square error
3. Dataset: subset of Yelp
4. Models
Dataset
1. Yelp - dataset
2. Select 5000 data for simplicity
3. To avoid sparsity in recommendation matrix, we work with users have more than 30 reviews
Dataset
1. Ratings: business_id, business_stars, user_id, and user_average_stars
2. Relations: user_id and friend_ids
3. Reviews: business_id, user_id, and rating and review_text
Model Overview
1. Basic Modela. Mean estimationb. Matrix Factorization
2. MF with latent factor
3. Topic MFa. origin versionb. modified version
4. Social MFa. Friend relationb. Social popularityc. User similarity
● Basic Modela. Mean estimation
rating = mean(ratings) + [mean(user) - mean(ratings)] + [mean(business) - mean(ratings)]
b. Matrix Factorization
sklearn.decomposition.NMF
Model Overview
● MF with latent factor
Model Overview
Model Overview
● Topic MF(incorporate reviews)a. Input: tf for each reviewb. LDAc. Output: vector of topic distribution
● Different implementa. origin version didn’t work outb. modified version
Model Overview
● Topic MF(incorporating reviews)a. origin versionb. modified version
Model Overview
● Social MF(social relationship information) WIPa. Friend relationb. Social popularityc. User similarity
Result
1. Basic Modela. Mean estimation 0.804b. Matrix Factorization 0.800
2. MF with latent factor ?
3. Topic MFa. origin version 0.907b. modified version 0.794
4. Social MFa. Relation 0.796b. Popularity 0.773c. User similarity 0.804
WIP
1. using word representation instead of bag-of-words
2. Combine social MF together
3. Compare performance across models
4. Explain the result
Dynamic Recurrent Network for Next-Basket Recommendation with Attention
TEAM MEMBERS :
KRITI AGGARWAL, SUDHANSHU BAHETY, DIGVIJAY KARAMCHANDANI
Original Paper: A Dynamic Recurrent Model for Next Basket Recommendation (DREAM)
▶ Original DREAM model proposes a dynamic recurrent basket model based on RNN for next basket recommendation
▶ Merges user basket’s current items and global sequential basket features using RNN - LSTM into users’ recurrent and dynamic representation.
▶ It shows that the nonlinear operation(MAX-POOL) on learning the representation of a basket does well in capturing elaborate interactions among multiple factors of items. (i.e. Learns Item embedding as a part of the network using a feed forward network)
▶ Extensive experiments on two public datasets (T-mall and Ta-Feng) demonstrated the effectiveness of the proposed model.
Original Network Architecture
Extension 1: Implementing and Adapting the DREAM Model to Instacart Dataset
▶ We used the Instacart Market Analysis dataset as the original datasets were not available to us.
▶ The reason this dataset was chosen was because this was found to be the closest to the original datasets while doing literature review.
▶ We needed to communicate with authors to clarify certain parts o of the paper.
▶ Implemented the original DREAM Model on pytorch.▶ Dataset Description:
▶ Anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.
▶ Between 4 and 100 of their orders, with the sequence of products purchased in each order
Extension 2: Adding Attention to the DREAM MODEL
▶ We took the idea of adding attention from the ICLR 2015 paper “NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE”.
At Time t,Yt is the representation of the input basket.st-1 is the hidden representation of the user (LSTM hidden state).(+) -> we add attention for the weighted focus on previous user hidden states.After attention we get a context vector(Ci), same size as st-1.Ci represents the weighted sum of previous user representation, so that the model attends to the most important user hidden factors.
Adding Attention to the DREAM MODEL
▶ Drawing parallels from attention used successfully in Seq2Seq models, we wanted the LSTM to take input the most important parts of the last few user hidden representations.
▶ The hidden representations of the users are captured at each time step t.
▶ The attention is based on an alignment score, i.e, how correlated is the current input to each of the previous baskets in a window.
▶ We have an hyper-parameter k, which decides what would be the appropriate window size for previous baskets.
RESULTS
▶ Due to computational limitations, we sampled out 10% of the Dataset
▶ We ran the DREAM model on 32000 users, 44440 unique Items.
▶ Padding was performed differently on each batch
▶ Runtime: 500s for each epoch.
▶ The model saturated after about 10 epochs.
▶ Final Results :
F1 @k NDCG Precision@k Recall @k
Baseline 0.0548 1.2688 0.2822 0.0367
Our Model 0.0493 1.2377 0.2767 0.0303
Key Takeaways and Future Work▶ Attention did not help our case
▶ Learning item embedding as a part of the network and max pooling being a non linear operation does well in capturing elaborate interactions among multiple factors of items.
▶ Padding specific to each basket performs better than having the same pad length.
Questions ?
Extensions on Generating and Personalizing Bundle
Recommendations on SteamYiwen Gong, Siyu Jiang, Kuang-Hsuan Lee
Objectives1. Predict the preference rating of items/bundles given
the user2. Recommend bundles to the given user according to
their preference3. Generate new personalized bundles
Original paper: Architect
Item BPR model
Bundle BPR model
Bi , Pu , Qi
Items
Initial bundle
Candidate items
Candidate bundle
compare
User_item Data
User_Bundle Data
Recommended bundle
Weakness of Original Paper● Naive model, more features could be considered given data
○ Game Genre○ Bundle Discount Rate
● Unstable results, model AUC varies from 0.63 to 0.88● Users share preference for bundle diversity
● Flaws in bundle generation○ Generate new bundles only consisting items from existing bundles○ Tends to include popular items in bundles (To increase profit,
common bundles usually carry a small amount of unpopular items)
Original Bundle Ranking● 2-step BPR
○ Item BPR
○ Bundle BPR
Extended Bundle Ranking● 2-step BPR
○ Item BPR
○ Bundle BPR
Extended Bundle Ranking: discount effect
Bundle discount [0 , 1] tan(x), x : [ -π/ 2 , π/ 2]
sigmoid(x) : [ 0 , 1]Discount effect : [0 , Tu]
discount Increasing utility
0% - 10% high
40% - 50% few
90% - 100% high
Original Bundle GenerationMethod Result
● Recommend bundles with items a user
has already bought
● For 100 users, only 7 of the items are
new to everyone
● AUC is not always right!
● People can buy the popular items
without recommendation system, one of
goal of system is to activate user’s
demand
Extended Bundle Generation● Item BPR model has learnt info about items outside of all existing
bundles, our bundle generation is able to generate bundles with items new to existing bundles
● Ensure generated bundles consist only items a user never bought● Tends to include unpopular items for profit consideration
New Bundle Generation Algorithm1. Assign picking probability to each item based on their popularity. Less popular
items get higher probability.
2. Initialize a bundle with a random size from [Average Bundle Size - s, Average Bundle Size + s]. Items are chosen according to their assigned probabilities Pu,i.
3. Generate a candidate set. Choose half of the items in the set from items not bought. Choose the other half from all items using Pu,i.
4. Generate new bundles by adding, deleting or replacing items in the initial bundle using items from the candidate set.
5. Choose bundle with the largest Xu,b as the new bundle. 6. Repeat steps 3 to 5 until converge.
Results - Bundle Ranking
Results - Bundle Generation
t-SNE embedding of latent representations
Conclusion ● We propose several extensions for the original bundle recommendation
method.● Our method achieves a large improvement in BPR ranking results over
the original method.● Our method achieves better and more reasonable bundle generation for
specific users.
Limitation & Future Work● About 97% users bought less than 10 bundles. If a user only bought few
items or bundles, it would be hard to estimate the user sensitivity on bundles prices.
● The current model generates bundles based on the preference of users. However, without knowing the commercial information (cost, etc. ), It is hard to generate bundles which are beneficial for game distributors.
TransRec: Smarter Translation Vectors
Rajiv Pasricha
Original Paper
Translation-based Recommendation, by Ruining He, Wang-Cheng Kang, and Julian McAuley
● Sequential model for recommendation○ Embed users and items into a low-dimensional
“translation space”○ Each user travels along their personalized
trajectory of item interactions○
The TransRec Model
● Probability of next item j given user u and previous item i● βj = item bias (captures overall item popularity)● d = distance function (e.g. L1 or L2)● γi = previous item factors, γj = next item factors● Tu = user translation vector● Φ, Ψ= transition space and subspace, restricting factors helps regularization (TransRec: L2 ball)
● Trained using Sequential BPR Loss, SGD
Datasets and Evaluation
Evaluation: AUC
Extensions: Personalization
● Personalized translation vector○ Model “typical” sequences of items that are common across users
○ AUC on the Amazon Video Games dataset: 0.7610 → 0.7633
Extensions: Temporal Dynamics
● Time Delta model○ Incorporate the time delay between interactions○ Interactions that are farther apart can have larger translations between them.○ Amazon Video Games dataset: 0.7610 → 0.7544
● Personalized Time Delta model○ Add a user-specific scaling factor to the above time deltas○ Learn the scaling factor from the data○ Amazon Video Games dataset: 0.7610 → 0.7570
Extensions: Extra Translation Vector
● Introduce separate user offsets for short-term and long-term interactions○ Learn two translation vectors per user, threshold at time delay = 6 months○ Allow users to exhibit different tendencies based on temporal data○ If 6 months, then ○ If 6 months, then
○ Amazon Video Games dataset: 0.7610 → 0.7646
Extensions: Nonlinear Translation Vectors
● Use a neural network to model more complex translation relationships● (1) Model a nonlinear relationship between the previous item and user
translation vector.○ ○ Amazon Video Games dataset: 0.7610 → 0.7661
● (2) Directly estimate the probability of transitioning to the next item.○ ○ Amazon Video Games dataset: 0.7610 → 0.7552
Extensions: Nonlinear Temporal Models
● Add temporal information into the nonlinear neural network models● Add the delta between the previous and next interaction times
○ Neural Net translation vector model: ○ Amazon Video Games dataset: 0.7610 → 0.7661 → 0.7665○ Neural Net distance model:○ Amazon Video Games dataset: 0.7610 → 0.7552 → 0.7629
● Add the raw previous and next interaction times○ Neural Net translation vector model:○ Amazon Video Games dataset: 0.7610 → 0.7661 → 0.7662○ Neural Net distance model:○ Amazon Video Games dataset: 0.7610 → 0.7552 → 0.7661
Visualization
● “Transition space” learned by the model (when k = 2)
Training sequence of items for one user in the dataset
Visualization (without normalization)
● “Transition space” learned by the model (when k = 2)
Training sequence of items for one user in the dataset
Discussion and Future Work
● Adding nonlinear translation vectors helps the model learn more complex relationships between items.
● Adding temporal information helps when integrated with nonlinear models.
● It will be helpful to also compare results using different evaluation metrics in addition to AUC, e.g. Hit@50
● Additional visualizations, come up with a model that more clearly arranges items sequentially in the transition space.
Questions?
Extensions to Personalized Ranking Metric Embedding
(PRME)
- Shreyas Udupa Balekudru
Problem Statement
Next New POI Recommendation problemNew POIs with respect to user’s current location are to be recommended
Input – User ID, Current POI, Physical Location(Latitude and Longitude), Check-In TimeOutput – Recommended POI
Dataset
Foursquare check-insCheck-ins in Singapore between 08/2010 and 07/2011Number of check-ins = 151589Number of users = 2321Number of POIs = 5596
Training for PRME
Sequential transition space and User preference space (weights of metric spaces parameterized)Stochastic Gradient DescentModel parameters initialized with normal distributionIf check-in time difference is greater than threshold, only parameters in user preference space are updated.
Hyperparameters Used
K = 20Number of iterations = 1000Alpha = 0.2Learning rate = 0.005Regularization factor = 0.03
Incorporating Distance (PRMEG)
Include geographical distance as a multiplicative factor in the distance metric.Users prefer visiting nearer POIs over farther POIs.
Issues Faced
Units for distance not specified in the original paperTraining time for higher values of k is prohibitive (Training Algorithm Complexity - O(IK|C|))Dates in data are specified with ID. It is unclear if consecutive IDs represent consecutive days.
Evaluation MetricMean Reciprocal Rank
Where Q is the number of queriesAnd Rank_i is the ranking of the next POI in the test set in comparison to 20 randomly sampled ‘negative’ POIs from the dataset
Results
Higher the value of k, better the results.Not all functions of the geographical distance lead to improvement in performance.
Visualization
Work in Progress
Re-evaluating results of PRMEG from the paper for sanity check.Interpretation of metric embedding visualization.Evaluating PRMEG-like approach for new product recommendation using rating as a distance metric.
Questions?
Wednesday Presentations
Personalized Next Song Recommendation
Kiran Kannar, Rahul DubeyDec 06, 2017
Problem StatementGiven user song listening history, provide personalized next song recommendation using metric embeddings.
s1
Viva La Vida
Coldplay
Just The Way You Are
Bruno Mars
s2
?
?
s4
Firework
Katy Perry
s3
Datasets
Measure Now Playing 30 Music
# sessions 9288 100,000
#users 1032 7146
#tracks 76,652 694817
Avg. #sessions/user 9 9.33
PRME Model
Transition Probability
MAP
Gradient update equation
PRME- AuPersonalizing alpha
- Non-convex problem- Use alternating minimization technique- Empirical results showed random normal clipping better than sigmoid/tanh, 0/1 clipping, - Best initialization: initialize to global alpha of PRME! - Bounding tradeoff better than unbounded tradeoff
PRME SocialSimilarity Score (asymmetric)
MAP
Gradient update equation
Results
AUC vs iterations
Now Playing 30 Music
Metrics vs. Dimensions
MRR Hit Rate
Visualizing songs in sequence space
Alpha_U statistics - I: (30M dataset)
Median: 0.1964205
Mean: 0.216677464557
Standard deviation: 0.146711865691
Minimum value: 9.9e-05
Maximum value: 0.931338
Alpha_U statistics - II: (30M dataset)
Alpha_U statistics - III: (30M dataset)
Thank you!
FashionGAN: A generative model for fashion recommendation
By Vignesh Gokul
Base paper● Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences
(Andreas Veit, Balazs Kovacs, Sean Bell ,Julian McAuley ,Kavita Bala and Serge Belongie)
● The paper implements a Siamese CNN with strategic sampling to learn the embedding space for all items and use these embeddings to build a better item recommender system
Siamese CNN Architecture
FashionGAN● A generative model, which outputs a compatible image given an input image● Condition on the input image● Related Work:
○ Image-to-Image Translation with Conditional Adversarial Networks
Image to Image Translation with CGANs
FashionGAN
Siamese GAN
Figure: Architecture of the Generator
Siamese GAN(Results)
Evaluation
● Inception score● Opposite SSIM
● Can improve siamese GAN by using variational encoders.
Model Inception score Opposite SSIM
Image to Image GAN 3.2733448 0.64017811
Siamese GAN 1.9960622 0.6160975
Another Extension (Work in progress)● Use deep supervision to improve siamese CNNs
Questions?
TransNets: Learning to transform for Recommendation
By:Akanksha Grover Dhruv Sharma Rishab Gulati
TransNets: Using Review Text for rating prediction
● TransNets represents Users, Items using past reviews given by/to them
● Learns a latent representation of the prospective review using the interactions between
<User,Item>
● Optimizes the MSE of the ratings produced
● Models only the interaction between a user and item using only the reviews
● TransNet Ext. models the interaction using both the user-item latent vector and review as input
Dataset and Code● Original Model uses Yelp 2016 dataset: https://www.yelp.com/dataset_challenge
● We ran the model on Yelp 2017 dataset.
● Data Statistics:
○ 4,700,000 reviews
○ 156,000 businesses
○ 1,100,000 users
● This data is larger than the original data the model was run on when the paper was published.
● Original code taken from: https://github.com/rosecatherinek/TransNets
● Modifications have been done to the above code for extensions
Train, Test and Validation Epochs
● We divided the entire dataset including all reviews into 3 parts - train, test and validation
sets randomly.
● Number of datapoints:
○ Train-3,789,517
○ Test-473,689
○ Validation-473,689
● Due to computational reasons, we limited our experiments to a single training epoch.
● We ran the original model and our modified model for one epoch and compared the MSE
value.
Original Model● A Target Network that processes the
target review rev AB
● A Source Network that processes the texts of
the (userA, itemB ) pair that does not
include the joint review, rev AB .
● Original model concatenates all user
reviews and item reviews respectively for
the user and item till a length of 1000 words
(except the common review).
TransNet Original
1. Running Time : 9 hours per epoch
2. Review Length = 800
3. Embedding Trained on top 50K frequent
words in Yelp 2017
4. Test MSE : 1.81
Extension 1:
Issues:● Original model concatenates all reviews
into single composite review● It does not consider variation in review
text across reviews of different ratings.● Requires a large matrix of word
embeddings for each user/item/review
Proposed Solution:● Each column should represent a review’s
embedding● Reviews are sorted by ratings to allow the
CNN to learn variation by spatial correlation in the matrix
● Requires only a matrix of K x Size of embedding
Extension 1:
Each column is a the latent representation of a user review/item review.
Review Embeddings
A set of k user/item reviews sorted by rating in increasing order
ExperimentsWe sampled the reviews for each user/item using the following methods:
● For all the methods we fixed a threshold(K) value i.e the number of users/item reviews to be
sampled. We took values of k=10 and k=20.
● Review embeddings were made by summation of word embeddings of all the words in a
review in the first three experiments.
● Review embeddings were learnt from a separate DeepCon Network in the last experiment
● We did a total of 4 experiments for this extension
● One training epoch takes about 3-4 hours for each experiment.
Experiments1. Sample K Reviews using User/Item Reviews + Global Sampling
○ We randomly sampled ‘k’ user/item reviews and sorted them in increasing order
according to review rating .
○ If the user or item reviews were less than ‘k’, we sampled the reviews from a global set
of all reviews.
MSE : 1.923229
Possible Issue: Sampling from a global set of reviews might not be relevant for the <user,item>
Experiments2. User/Item Reviews + Corresponding Item/User Reviews + Global Sampling
○ We randomly sampled ‘k’ user/item reviews and sorted them in increasing order
according to review rating .
○ If the user or item reviews were less than ‘k’, then we sampled the reviews for the
corresponding item/user in the training data. If the reviews were still less than ‘k’, we
sampled the rest from a global set of all reviews
MSE : 1.858533
Possible Issue : Most of the users, items have very few reviews, hence most training samples are
for cold start users / items
Experiments 3. Filtered Training Data+User/Item Reviews + Corresponding Item/User Reviews + Global
Sampling
○ Sampling process was kept same as (2) but we filtered the training set to keep only those
users/items which had at least ‘k’ reviews.
MSE : 1.886834
Possible Issue: Training on data with no cold start <user, item> did not generalize well for the test set
Experiments4. Generate Review Embeddings using DeepCon
○ Sampling process was kept same as (2).
○ Review embeddings were generated by a separate DeepCon Network
trained on a small sample of training set with equal representation of
each rating.
○ Acts as a step of generate pretrained “Review Embeddings”
○ Tried the review embeddings with only Experiment 2
MSE : 1.730112
Performs the best among all experiments and baseline
Results for Extension 1
Baseline Experiment 1(k=10)
Experiment 2(k=10)
Experiment 2(k=20)
Experiment 3(k=10)
Experiment 4(k=10)
1.813865 1.923229 1.858533 1.891809 1.886834 1.730112
MSE
Baseline Extension(k=10) Extension (k=20)Number of input parameters 153,600 52,480 53,760Hours to train per epoch ~9.5 hrs ~4 hrs ~5 hrs
Extenison 2● We show that TransNet can be used for Tips generation on Yelp dataset
● Inspired from the paper “Neural Rating Regression with Abstractive Tips Generation
for Recommendation”, Piji Li et. al
● Review latent representation learned from TransNets’ transform layer is used as a context
vector to generate tips
Methodology(1/3)
● Original Yelp dataset has about 400,000
data points that has both reviews and tips.
We take top 50,000 training data points.
● Train the entire TransNet for just 1 epoch.
● Transfer the output of Transform Layer for
each data point in the train set as well as
the test set to the RNN.
Methodology (2/3)● Sequence length = 3
● GloVe embeddings of the most common 50,000 words in reviews .
● Added a <UNK> word to represent all words not in the vocabulary
● Add embeddings of 2 <UNK> vectors and embedding of the first word of the tip -> 50 dim vector
● Concatenated with 64 dimension vector from the TransNet that represents corresponding review
● Concatenated vector is fed as the input Context Context Context Context
Methodology (3/3)
● We train for 500 epochs for about 6 hrs.
● For test data, we used 2000 data points from the original 121k data points
● At test time, we concatenate the 64 dimension review vector from TransNet with 50
dimensional representation of <UNK>.
● This generates the first word of the tip and then we use embedding of each generated
word at each time step in concatenation with vector from TransNet.
● We sample word based on the output probability.
Generated Tips
made place to beer . . . this amazing i i
they great place my to sushi onion spot in dog .
tea pizza . . week ' chicken items place very service
sale , great . this worth to cardio were this don
lots spot hair some is they be but no amazing tim
Baseline and Evaluation● We use LexRank as our baseline.● LexRank produces summary of the whole review.● We calculate ROUGE-1 and ROUGE-2 as our evaluation measures.●
Score LexRank TransNet + RNN
Precision Recall F-1 Score Precision Recall F-1 Score
ROUGE-1 0.0694001 0.0451601 0.0456172 0.0242294 0.0292329 0.0239379
ROUGE-2 0.0025476 0.0013569 0.0014726 0.0003033 0.0003055 0.0002377
Conclusion & Future Work
1. Pre-trained review embeddings gave the highest boost
2. We are still not very sure about the best way to sample the K reviews of the user/item
and we want to investigate further how review embeddings change our results
3. There is no analysis on robustness to temporal change
4. The absolute value of MSE is very high
5. Future Work:
a. Combine temporal signals and use global, user and item biases
b. Extend the Transnet model to use implicit feedback and ranking prediction
c. Evaluate Tip Generation against more baselines
Questions
Neural Collaborative Filtering[He et al. 2017] Extensions
Kulshreshth Dhiman, Sai Kolasani
Overview
● Extensions to Neural Collaborating Filtering [He et. al 2017]● Extensions
1. Pairwise Ranking2. Cold start3. Experiments with Architecture
● Dataset○ Movielens20M user-item interactions (converted to implicit feedback)○ Movie features from IMDB
ModelGMF:
● Uses the inner product of the user and item representations in the latent space.
MLP:● A multi layer perceptron network using the
concatenation of user and item latent representations as an input feature.
NeuMF:● Combines the GMF and MLP into a single deep
network for better accuracy.
Pairwise Ranking ModelGMF pairwise model
Shared weights
Shared weights
● The pairwise networks use shared weights and shared user embedding.
● The objective function is modified to maximize the difference between the score of a preferred item over a non preferred item.
● During evaluation we tried two approaches○ Take the output from the final sigmoid
layer(calculating the sigmoid of the difference of scores)
○ Take the sigmoid of the output from the linear dense layer just before the sigmoid - This works better.
● Similarly designed for other models
Pairwise Ranking algorithm
● We can find the position of the positive item in the ranking efficiently using batch prediction.○ We find the pairwise ranking of the positive item with all the negative
items(N) in a single batch and find the number of times the positive item is preferred(k)
○ Then the ranking of the positive item will be N-k● If exact ranking of K items is needed then a heap based
algorithm can be used.
Cold Start ModelGMF Cold start model
Cold Start ModelMLP Cold start model
Cold Start ModelNeuMF Cold start model
Dataset● Movielens20M* dataset with user movie ratings from 2012 to 2015,
converted to implicit feedback data.● Sampling
○ Movies released after 1990○ Al teast 20 items per user○ Randomly sampled 7000 users
*https://grouplens.org/datasets/movielens/20m/
Data Statistics
#users 7000
#items 8491
#ratings 724K
Sparsity 98.78%
Items per user Min:20, Max:1980, Median:59
Users per item Min:1, Max:3609, Median:12
Item Features
● Collected item features using http://www.theimdbapi.org/ API
● From the data extracted from IMDB we used the following features.○ Year of release: Binned years (5 years) and then one hot encoding ○ Genre: Many hot encoding over 24 different genres○ Text features: We used the gensim doc2vec library to learn vector
representation of storyline of the movies in the training set
● Randomly sampled 10% items as cold start items and rest in training set
● Test SetLatest 2 positive user-item pairs in test set
● Test set - Cold-start (completely new items)no user-item pairs in the train set
● Test set - Pseudo Cold-start (relatively new items)10 positive user-items pairs in train set, rest in test set
● Train Set#negatives per positive user-item = 4
Train-Test set
Evalulation
● For evaluating the performance of the model we use ○ HR: HitRate@10○ NDCG: Normal Discounted Cumulative Gain(NDCG@10)
log(2/(rank+2))○ AUC
● Randomly sampled 99 negative user-item pairs (not in train set) and ranked positive item among negative items
Results - Base
Results - Base
Results - Pairwise
Point wise performed better than pair wise
Results - Coldstart● Cold start models outperformed
base models● Item features improved
performance over general test set
Results - Coldstart● Cold start models
performed better thanBase models
● MLP had higher hit rates
Results - Coldstart● MLP models had
relatively high hit rates
Results - Coldstart● NeuMF cold start model
had highest AUC
Architecture Experiments● NeuMF: Shared Embeddings for
GMF and MLP model○ GMF and MLP learn different latent space
● GMF: Add a dense layer after MF layer(dim=latent_size/2)
HR NDCG AUCSeparate 0.8039 0.5021 0.9288
Shared 0.7962 0.4975 0.9261
HR@10
PF GMF With Dense Base GMF
8 0.7981 0.7986
16 0.7999 0.8046
32 0.7999 0.8046
Conclusion
● Pointwise ranking model works better than pairwise ranking model
● ItemFeatures like storyline, genre and year improves hit-rates of cold start performance as well as non-cold start items
● GMF had higher hit rates for non-cold start items and MLP had higher hit-rates for cold-start items
Questions?Thank you!
Jointly Modeling Aspects, Ratings and Sentimentsfor Movie Recommendation (JMARS)
Presented By: Rishabh Misra, Tushar Bansal
Problem Statement
● Motivation: Uncovering aspects and sentiments from reviews could provide a better understanding of users, movies (items), and the process involved in generating ratings.
● Approach: Capture the interest distribution of users and the content distribution for movies and provide a link between interest and relevance on a per-aspect basis. Authors also differentiate between positive and negative sentiments on a per-aspect basis. This all leads to better rating prediction.
Model
Algorithm
● Objective:
● EM Algorithm● E-Step : Sample {y, z, s} for each word from the current distribution● M-Step :
○ Fix sampled {y, z, s} for each word○ Optimize other parameters using L-BFGS.
DataOriginal Paper: IMDB dataset
● 54671 Users | 22380 Movies | 348415 Reviews
Our Implementation:
● Amazon Clothing Category Dataset○ 1981 Users | 1962 Items | 11935 Reviews
● Amazon Instant Video Dataset○ 2000 Users | 1643 Items | 14355 Reviews
● We opted for small datasets because the inference of JMARS on large number of reviews is computationally expensive and time intensive, and we spent most of our time implementing the original method.
Extension● Add temporal dynamics to user latent factors, biases and interest
distribution. ● Idea borrowed from Collaborative Filtering with Temporal Dynamics
(Koren, 2009)● This formulation doesn’t lead to significant increase in parameters.
Quantitative Results
Amazon Clothing Data
Without Temporal Dynamics
With Temporal Dynamics
Improvement
Baseline 1.1505 1.1420 0.74%
JMARS (A=6; K=5) 1.1251 1.1152 0.88%
JMARS (A=12; K=5) 1.1244 1.1150 0.84%
Baseline: JMARS without language models (i.e. simple latent factor model).
Evaluation Metric: MSE
Amazon Video Data
Without Temporal Dynamics
With Temporal Dynamics
Improvement
Baseline 1.1269 1.1170 0.88%
JMARS (A=6; K=5) 1.0945 1.0843 0.93%
Qualitative Results
● Background Words○ Price, Product, Picture, Fit, Wear, Quality, Purchase, Material
● General Sentiment Words○ Positive
■ Comfort, Nice, Well, Love, Buy, Good, Great, Pretty○ Negative
■ Problem, Waste, Flaw, Review, Nothing, Worst
● Aspects Words○ Material/Color
■ Color, Material, Elastic, Light, Care, Weather○ Size/Fit
■ Tight, Wear, Comfort, 8/10 (shoe sizes), Inch
Qualitative Results
● Aspects Sentiment Words○ Material/Color
■ Great, Design, Soft, Quality, Durable, Cheap○ Size/Fit
■ Shrink, True, Doesn’t/Don’t, Small, Thick
● Item Specific Words○ Item 1
■ Bag, Compartment, Pocket, Purse○ Item 2
■ Shoe, Clarks, Merrell, Timberland
Temporal EffectInterest distribution change for aspect material/color.
Date: 06/11/2013
My hubby is hard on his shoes, so I like to find him good ones at a reduced price, such as these. He likes the fit and feel of New Balance, so these will be his next pair when his current ones are too tattered to wear anymore. Good grippy sole for our rocky western trails, and decent laces that shouldn’t break with his hard use.
Date: 04/02/2014
Thanks to another reviewer I got the green ones instead of the raspberry. The green insoles have just the right arch support for my plantar fasciitis-ridden feet. I am glad to have them in my everyday Merrell slip on shoes. These insoles are not too soft, but soft enough, and after just one day of wear I don't notice them at all, which is perfect. Based on the last pair (I had the raspberry) I expect about a year from these, but will happily accept a longer wear time from them.
Conclusion and Future Work
● The extension did improve on the current model but only by a small amount.
● The reasons for only a small improvement could be:○ The dataset we use is relatively small (because of limited resources) with few
reviews for each user so the temporal dynamics might not learned properly. ○ The linear time function might not be the best to capture the temporal dynamics
across different aspects. Other options like binning might work better.
● Add hierarchical structuring to the language models.
Questions?
TransNets++Learning to Translate Better by Accounting for
Higher Order Interactions
Sejal ShahSiddharth Dinesh
Goal
What effect does the inclusion of higher order interactions have on a complex feature extraction mechanism such as TransNets?
MotivationNeural networks are predominantly used for preprocessing of data in recommender systems
Neural factorization machines have not been evaluated in settings where the features are neurally extracted
TransNets
Factorization Machines
Neural Factorization Machine Plain Old Factorization Machine
Implementation of the paper1. Data: Yelp Dataset 2017
a. 4.7 million reviews
b. TransNets paper uses only 4.1 million reviews: Filtering criteria is unclear
2. Resultsa. Our implementation resulted in an MSE of 1.7559 (random epochs, filtered reviews)
b. Used the result from the TransNet implementation as our baseline
Extension 1: L2 Loss
● TransNets optimizes the Factorization Machine using L1 loss.● We report MSE, so makes sense to optimize L2 loss directly
Extension 2: Batch Normalization ● Batch Normalization is new-age alchemy to induce faster convergence of
SGD
Extension 3: Neural Factorization Machines● Added Neural layers to factorization machine● Experimented with 0, 1, 2 hidden layers.
Conclusions● Number of training epochs is important when comparing results● Creation of training epoch batch results in variance in MSE of TransNet
predictions, as TransNet only considers 1000 words from the reviews● Neural Factorization Machine only slightly improves predictions when input is
already constructed using non-linear transformations● Would NFM improve rating prediction if one-hot embeddings of users and
items are also served as input to the Factorization Machine?○ Like in TransNet-Ext
● How much do these results depend on the dataset? ○ Confirm lack of improvement from NFM using another dataset:
■ Google Local■ Amazon Reviews
Questions?
Efficient Bayesian Methods for Graph-based Recommendation Systems
Aditi Mavalankar, Stephanie Chen and Ajitesh Gupta
Original Model Overview
● Authors proposed a fast graph based method for general purpose recommendation
● It scores all items available on a 3-step path from the user in order to provide new recommendations.
● Scoring is done by making use of probability distributions based on the item ratings
Target User Item 1
User X Item 2 PotentialRecommendation!!
1
2
3
Original Model Overview - Reliability of Item
● Binary random variable Yj = 0 for negative assessment, 1 for positive assessment ● P(Yj = 1) = θj ~ Reliability of the item. Modelled as a Beta distribution.
∴ P(θj = 1 | Ratings) ~ Beta( R+ , R- ) (Conjugate Distributions)
● R+ = No. of Positive ratings● R- = No. of Negative ratings
Original Model Overview - Scoring Functions
● Posterior Inequality Scoring (PIS) - Probability of the reliability of candidate item x being greater than the reliability of item v in the user history.
● Posterior Prediction Scoring (PPS) - Probability of both v and x receiving positive assessments where we assume that Yv and Yx are independent.
● Posterior Odds Ratio Scoring (PORS) - How large the odds of x receiving a positive assessment is when compared to the odds of v receiving a positive assessment
Target User Item V
User W Item X PotentialRecommendation!!
Positives of the original model
● Existing approaches often use random walks○ Large number of transition matrices to be stored○ Large matrix multiplication operations○ Large number of simulations to converge in some cases
● No matrix multiplications ⇒ 1-2 orders of magnitude faster● No large matrices to store ⇒ Much lower space complexity
Negatives of original model - Motivation for extensions
● It does not involve the user information in the process of recommendation○ Binary Interactions - How similar users are to each other ?○ Unary Information - How experienced is the user ? How many items has he rated before ?
● It does not involve the binary interactions between items as well○ How similar are two items ?
Extension 1 - User reliability score
● Users that give ratings to more items are more significant. ● We generate a reliability score for each user, and multiply each item’s
PIS/PPS/PORS score by it to determine whether an item ought to be recommended.
Target User Item V
User W Item XModified_score( Ix ) = Rel( Uw ) * Score( Ix )
Extension 2 - User similarity score
U1 I2
I1
U2
I3
I4
1 1
1 1
similarity =
1
0
1
1similarity_score
= similarity / total common items
=
012
2 / 4
= 0.5
Extension 2 - User similarity score
Extension 3 - Item similarity score
User similarity and item similarity
User Similarity Heatmap Item Similarity Heatmap
Mean Average Precision
Mean Reciprocal Rank
Precision@5
Precision@10
Normalized Discounted Cumulative Gain@5
Normalized Discounted Cumulative Gain@10
Results on ML-100k
Method MAP MRR P@5 P@10 NDCG@5 NDCG@10
PIS 0.1459 0.4173 0.2049 0.1654 0.2482 0.2310
PIS_USS 0.1472 0.4209 0.2049 0.1667 0.2486 0.2319
PIS_ISS 0.1476 0.4266 0.2023 0.1653 0.2464 0.2307
PIS_USS_ISS 0.1479 0.4264 0.2023 0.1657 0.2465 0.2309
PPS 0.1531 0.4213 0.2102 0.1724 0.2546 0.2410
PPS_USS 0.1546 0.4235 0.2106 0.1743 0.2554 0.2432
PPS_ISS 0.1546 0.4304 0.2095 0.1727 0.2544 0.2412
PPS_USS_ISS 0.1558 0.4317 0.2076 0.1735 0.2534 0.2415
PORS 0.1147 0.2949 0.1525 0.1330 0.1694 0.1643
PORS_USS 0.1149 0.2931 0.1529 0.1326 0.1693 0.1639
PORS_ISS 0.1188 0.3054 0.1540 0.1372 0.1737 0.1704
PORS_USS_ISS 0.1195 0.3038 0.1559 0.1384 0.1751 0.1716
Conclusion
● User-user similarity is observed to be more useful than item-item similarity.● Introducing either kind of similarity improves the quality of recommendations.● User reliability score proves to be too naive, and hence provides no
improvement.● PPS still remains the top performer among the scoring techniques.● Since the results are consistent on FilmTrust, as well as ML-100k, it can be safe
to say that similar results will be exhibited on the other 5 datasets used in the original paper.
● Future work: Different algorithms to calculate user and item similarities
THANK YOU!