deep learning for recommender systems recsys2017 tutorial
Post on 21-Jan-2018
10.581 Views
Preview:
TRANSCRIPT
DeepLearning for RecommenderSystems
AlexandrosKaratzoglou (ScientificDirector@TelefonicaResearch)alexk@tid.es,@alexk_z
BalázsHidasi (HeadofResearch@GravityR&D)balazs.hidasi@gravityrd.com,@balazshidasi
RecSys’17,29August2017,Como
WhyDeepLearning?
ImageNet challenge error rates (red line = human performance)
WhyDeepLearning?
ComplexArchitectures
NeuralNetworksareUniversalFunctionApproximators
InspirationforNeuralLearning
NeuralModel
Neurona.k.a.Unit
FeedforwardMultilayeredNetwork
Learning
StochasticGradientDescent
• Generalizationof(Stochastic)GradientDescent
StochasticGradientDescent
Backpropagation
Backpropagation
• Doesnotworkwellinplainanormal”multilayerdeepnetwork
• VanishingGradients• SlowLearning
• SVM’seasiertotrain• 2nd NeuralWinter
ModernDeepNetworks
• Ingredients:
• RectifiedLinearActivationfunctiona.k.a.ReLu
ModernDeepNetworks• Ingredients:
• Dropout:
ModernDeepNetworks• Ingredients:
• Mini-batches:– StochasticGradientDescent– Computegradientovermany(50-100)datapoints(minibatch)andupdate.
ModernFeedforwardNetworks• Ingredients:• Adagrad a.k.a.adaptivelearningrates
• Feature extraction directly from the content• Image,text,audio,etc.• Instead ofmetadata• For hybrid algorithms
• Heterogenous data handled easily• Dynamic/Sequential behaviour modeling with RNNs• Moreaccurate representation learning ofusers anditems
• Natural extension ofCF&more• RecSys isacomplex domain
• Deeplearning worked well inother complex domains• Worth atry
DeepLearning for RecSys
• As of2017summer,maintopics:• Learning item embeddings• Deepcollaborative filtering• Feature extraction directly from content• Session-based recommendations with RNN
• Andtheir combinations
ResearchdirectionsinDL-RecSys
• Startsimple• Addimprovements later
• Optimize code• GPU/CPUoptimizations may differ
• Scalability iskey• Opensource code• Experiment (also)on public datasets• Don’t use very small datasets• Don’t work on irrelevant tasks,e.g.rating prediction
Bestpractices
Itemembeddings&2vecmodels
Embeddings
• Embedding:a(learned)realvaluevectorrepresentinganentity– Alsoknownas:
• Latentfeaturevector• (Latent)representation
– Similarentities’embeddingsaresimilar• Useinrecommenders:– Initializationofitemrepresentationinmoreadvancedalgorithms
– Item-to-itemrecommendations
Matrixfactorizationaslearningembeddings
• MF:user&itemembeddinglearning– Similarfeaturevectors
• Twoitemsaresimilar• Twousersaresimilar• Userprefersitem
– MFrepresentationasasimplicticneuralnetwork• Input:one-hotencodeduserID• Inputtohiddenweights:userfeature
matrix• Hiddenlayer:userfeaturevector• Hiddentooutputweights:itemfeature
matrix• Output:preference(oftheuser)overthe
items
R UI
≈
0,0,...,0,1,0,0,...0
u
𝑟#,%, 𝑟#,&, … , 𝑟#,()
𝑊+
𝑊,
Word2Vec• [Mikolovet.al,2013a]• Representationlearningofwords• Shallowmodel• Data:(target)word+contextpairs– Slidingwindowonthedocument– Context=wordsnearthetarget
• Inslidingwindow• 1-5wordsinbothdirections
• Twomodels– ContinousBagofWords(CBOW)– Skip-gram
Word2Vec- CBOW• ContinuousBagofWords• Maximalizestheprobabilityofthetargetwordgiventhe
context• Model
– Input:one-hotencodedwords– Inputtohiddenweights
• Embeddingmatrixofwords– Hiddenlayer
• Sumoftheembeddingsofthewordsinthecontext– Hiddentooutputweights– Softmaxtransformation
• Smoothapproximationofthemaxoperator• Highlightsthehighestvalue
• 𝑠. =012
∑ 0145467
,(𝑟8:scores)
– Output:likelihoodofwordsofthecorpusgiventhecontext• Embeddingsaretakenfromtheinputtohiddenmatrix
– Hiddentooutputmatrixalsohasitemrepresentations(butnotused)
E E E E
𝑤:;& 𝑤:;% 𝑤:<% 𝑤:<&
word(t-2) word(t-1) word(t+2)word(t+1)
Classifier
averaging
0,1,0,0,1,0,0,1,0,1
𝑟. .=%>
𝐸
𝑊
𝑝(𝑤.|𝑐) .=%>
softmax
word(t)
Word2Vec– Skip-gram
• Maximalizestheprobabilityofthecontext,giventhetargetword
• Model– Input:one-hotencodedword– Inputtohiddenmatrix:embeddings– Hiddenstate
• Itemembeddingoftarget– Softmaxtransformation– Output:likelihoodofcontextwords
(giventheinputword)• Reportedtobemoreaccurate
E
𝑤:
word(t)
word(t-1) word(t+2)word(t+1)
Classifier
word(t-2)
0,0,0,0,1,0,0,0,0,0
𝑟. .=%>
𝐸
𝑊
𝑝(𝑤.|𝑐) .=%>
softmax
GeometryoftheEmbeddingSpace
King- Man+Woman=Queen
Paragraph2vec,doc2vec
• [Le&Mikolov,2014]• Learnsrepresentationofparagraph/document
• BasedonCBOWmodel• Paragraph/documentembeddingaddedtothemodelasglobalcontext
E E E E
𝑤:;& 𝑤:;% 𝑤:<% 𝑤:<&
word(t-2) word(t-1) word(t+2)word(t+1)
Classifier
word(t)
averaging
P
paragraphID
𝑝.
...2vecfor Recommendations
Replace words with items inasession/user profile
E E E E
𝑖:;& 𝑖:;% 𝑖:<% 𝑖:<&
item(t-2) item(t-1) item(t+2)item(t+1)
Classifier
item(t)
averaging
Prod2Vec• [Grbovicet.al,2015]• Skip-grammodelonproducts
– Input:i-thproductpurchasedbytheuser– Context:theotherpurchasesoftheuser
• Baggedprod2vecmodel– Input:productspurchasedinonebasketbytheuser
• Basket:sumofproductembeddings– Context:otherbasketsoftheuser
• Learninguserrepresentation– Followsparagraph2vec– Userembeddingaddedasglobalcontext– Input:user+productspurchasedexceptforthei-th– Target:i-thproductpurchasedbytheuser
• [Barkan&Koenigstein,2016]proposedthesamemodellaterasitem2vec– Skip-gramwithNegativeSampling(SGNS)isappliedtoeventdata
Prod2Vec
[Grbovicet.al,2015]
pro2vecskip-grammodelonproducts
Bagged Prod2Vec
[Grbovicet.al,2015]
bagged-prod2vecmodelupdates
User-Prod2Vec
[Grbovicet.al,2015]
Userembeddings forusertoproductpredictions
Utilizingmoreinformation• Meta-Prod2vec[Vasileet.al,2016]
– Basedontheprod2vecmodel– Usesitemmetadata
• Embeddedmetadata• Addedtoboththeinputandthecontext
– Lossesbetween:target/contextitem/metadata• Finallossisthecombinationof5oftheselosses
• Content2vec[Nedelecet.al,2017]– Separate modules formultimodelinformation
• CF:Prod2vec• Image:AlexNet(atypeofCNN)• Text:Word2VecandTextCNN
– Learnspairwisesimilarities• Likelihoodoftwoitemsbeingboughttogether
I
𝑖:
item(t)
item(t-1) item(t+2)meta(t+1)
Classifier
meta(t-1)
M
𝑚:
meta(t)
Classifier Classifier Classifier Classifier
item(t)
References• [Barkan&Koenigstein,2016]O.Barkan,N.Koenigstein:ITEM2VEC:Neuralitemembeddingfor
collaborativefiltering.IEEE26thInternationalWorkshoponMachineLearningforSignalProcessing(MLSP2016).
• [Grbovicet.al,2015]M.Grbovic,V.Radosavljevic,N.Djuric,N.Bhamidipati,J.Savla,V.Bhagwan,D.Sharp:E-commerceinYourInbox:ProductRecommendationsatScale.21thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining (KDD’15).
• [Le&Mikolov,2014]Q.Le,T.Mikolov:DistributedRepresentationsofSentencesandDocuments.31stInternationalConferenceonMachineLearning(ICML2014).
• [Mikolovet.al,2013a]T.Mikolov,K.Chen,G.Corrado,J.Dean:EfficientEstimationofWordRepresentationsinVectorSpace.ICLR2013Workshop.
• [Mikolovet.al,2013b]T.Mikolov,I.Sutskever,K.Chen,G.Corrado,J.Dean:DistributedRepresentationsofWordsandPhrasesandTheirCompositionality.26thAdvancesinNeuralInformationProcessingSystems(NIPS2013).
• [Morin&Bengio,2005]F. Morin,Y. Bengio: Hierarchicalprobabilisticneuralnetworklanguagemodel.International workshoponartificialintelligenceandstatistics,2005.
• [Nedelecet.al,2017]T.Nedelec,E.Smirnova,F.Vasile:SpecializingJointRepresentationsforthetaskofProductRecommendation.2ndWorkshoponDeepLearningforRecommendations(DLRS2017).
• [Vasileet.al,2016]F.Vasile,E.Smirnova,A.Conneau:Meta-Prod2Vec– ProductEmbeddingsUsingSide-InformationforRecommendations.10thACMConferenceonRecommenderSystems(RecSys’16).
DeepCollaborativeFiltering
CFwithNeuralNetworks• Naturalapplicationarea• SomeexplorationduringtheNetflixprize• E.g.:NSVD1[Paterek,2007]
– AsymmetricMF– Themodel:
• Input:sparsevectorofinteractions– Item-NSVD1:ratingsgivenfortheitembyusers
» Alternatively:metadataoftheitem– User-NSVD1:ratingsgivenbytheuser
• Inputtohiddenweights:„secondary”featurevectors• Hiddenlayer:item/userfeaturevector• Hiddentooutputweights:user/itemfeaturevectors• Output:
– Item-NSVD1:predictedratingsontheitembyallusers– User-NSVD1:predictedratingsoftheuseronallitems
– TrainingwithSGD– Implicitcounterpartby[Pilászyet.al,2009]– Nonon-linaritiesinthemodel
Ratingsoftheuser
Userfeatures
Predictedratings
Secondaryfeaturevectors
Itemfeaturevectors
RestrictedBoltzmannMachines(RBM)forrecommendation
• RBM– Generativestochasticneuralnetwork– Visible&hiddenunitsconnectedby(symmetric)weights
• Stochasticbinaryunits• Activationprobabilities:
– 𝑝 ℎ8 = 1 𝑣 = 𝜎 𝑏8L + ∑ 𝑤.,8𝑣.N.=%
– 𝑝 𝑣. = 1 ℎ = 𝜎 𝑏.O + ∑ 𝑤.,8ℎ8P8=%
– Training• Setvisibleunitsbasedondata• Samplehiddenunits• Samplevisibleunits• Modifyweightstoapproachtheconfigurationofvisibleunitstothedata
• Inrecommenders[Salakhutdinovet.al,2007]– Visibleunits:ratingsonthemovie
• Softmaxunit– Vectoroflength5(foreachratingvalue)ineachunit– Ratingsareone-hotencoded
• Unitscorrenpondingtouserswhonotratedthemovieareignored– Hiddenbinaryunits
ℎQℎ&ℎ%
𝑣R𝑣S𝑣Q𝑣% 𝑣&
ℎQℎ&ℎ%
𝑣R𝑣S𝑣Q𝑣% 𝑣&
𝑟.: 2??41
DeepBoltzmannMachines(DBM)• Layer-wisetraining– TrainweightsbetweenvisibleandhiddenunitsinanRBM
– Addanewlayerofhiddenunits
– Trainweightsconnectingthenewlayertothenetwork• Allotherweights(e.g.visible-hiddenweights)arefixed
ℎQ%ℎ&%ℎ%%
𝑣R𝑣S𝑣Q𝑣% 𝑣&
ℎQ%ℎ&%ℎ%%
𝑣R𝑣S𝑣Q𝑣% 𝑣&
ℎ&&ℎ%&
Train
Train
Fixed
ℎQ%ℎ&%ℎ%%
𝑣R𝑣S𝑣Q𝑣% 𝑣&
ℎQ&ℎ&&Train
Fixed
ℎ&Qℎ%Q ℎSQℎQQ
Fixed
Autoencoders
• Autoencoder– Onehiddenlayer– Samenumberofinputandoutputunits– Trytoreconstructtheinputontheoutput– Hiddenlayer:compressedrepresentationofthedata
• Constrainingthemodel:improvegeneralization– Sparseautoencoders
• Activationsofunits are limited• Activationpenalty• Requiresthewholetrainsettocompute
– Denoisingautoencoders[Vincentet.al,2008]• Corrupttheinput(e.g.setrandomvaluestozero)• Restoretheoriginalontheoutput
• Deepversion– Stackedautoencoders– Layerwisetraining(historically)– End-to-endtraining(morerecently)
Data
Corruptedinput
Hiddenlayer
Reconstructedoutput
Data
Autoencodersforrecommendation
• Reconstructcorrupteduserinteractionvectors– CDL[Wanget.al,2015]Collaborative DeepLearningUses Bayesianstacked denoising autoencodersUses tags/metadatainsteadoftheitem ID
Autoencodersforrecommendation
• Reconstructcorrupteduserinteractionvectors– CDAE[Wuet.al,2016]Collaborative Denoising Auto-Encoder
Additional usernodeon theinputandbiasnode besidethe hidden layer
Recurrentautoencoder
• CRAE[Wanget.al,2016]– CollaborativeRecurrentAutoencoder– Encodestext(e.g.movieplot,review)– AutoencodingwithRNNs• Encoder-decoderarchitecture• TheinputiscorruptedbyreplacingwordswithadeisgnatedBLANKtoken
– CDLmodel+textencodingsimultaneously• Jointlearning
DeepCF methods
• MV-DNN[Elkahkyet.al,2015]– Multi-domainrecommender– Separatefeedforwardnetworksforuseranditems perdomain
(D+1networks)• Featuresfirstareembedded• Run through several layers
DeepCF methods
• TDSSM[Songet.al,2016]• Temporal DeepSemantic Structured Model• Similar to MV-DNN• User featuresarethecombinationofastaticandatemporal part• ThetimedependentpartismodeledbyanRNN
DeepCF methods
• Coevolving features[Daiet.al,2016]• Users’tasteanditems’audienceschangeovertime• User/item featuresdependon time andare composedof
• Timedriftvector• Selfevolution• Co-evolutionwithitems/users• Interaction vectorFeature vectorsarelearnedbyRNNs
DeepCF methods
• ProductNeuralNetwork(PNN)[Quet.al,2016]– ForCTRestimation– Embed features– Pairwiselayer:allpairwise combination
ofembeddedfeatures• LikeFactorizationMachines• Outer/innerproductoffeaturevectorsorboth
– Severalfullyconnectedlayers• CF-NADE[Zhenget.al,2016]
– NeuralAutoregressiveCollaborativeFiltering– Usereventsà preference(0/1)+confidence(basedonoccurence)– Reconstructssomeoftheusereventsbasedonothers(notthefullset)
• Randomorderingofuserevents• Reconstructthepreferencei,basedonpreferencesandconfidencesuptoi-1
– Lossisweightedbyconfidences
Applications:apprecommendations
• Wide&DeepLearning[Chenget.al,2016]• Rankingofresultsmatchingaquery• Combinationoftwomodels
– Deepneuralnetwork• Onembeddeditemfeatures• „Generalization”
– Linearmodel• Onembeddeditemfeatures• Andcrossproductofitemfeatures• „Memorization”
– Jointtraining– Logisticloss
• Improvedonlineperformance– +2.9%deepoverwide– +3.9%deep+wideoverwide
Applications:videorecommendations
• YouTubeRecommender[Covingtonet.al,2016]– Twonetworks– Candidategeneration
• Recommendationsasclassification– Itemsclicked/notclickedwhenwererecommended
• Feedforwardnetworkonmanyfeatures– Averagewatchembeddingvectorofuser(lastfewitems)– Averagesearchembeddingvectorofuser(lastfewsearches)– Userattributes– Geographicembedding
• Negativeitemsampling+softmax– Reranking
• Morefeatures– Actualvideoembedding– Averagevideoembeddingofwatchedvideos– Languageinformation– Timesincelastwatch– Etc.
• Weightedlogisticregressiononthetopofthenetwork
References• [Chenget.al,2016]HT.Cheng,L.Koc,J.Harmsen,T.Shaked,T.Chandra,H.Aradhye,G.Anderson,G.Corrado,W.Chai,M.Ispir,R.
Anil, Z. Haque, L. Hong, V. Jain, X. Liu, H. Shah:Wide&DeepLearningforRecommenderSystems.1stWorkshoponDeepLearningforRecommenderSystems(DLRS2016).
• [Covingtonet.al,2016]P.Covington,J.Adams,E.Sargin:DeepNeuralNetworksforYouTubeRecommendations.10thACMConferenceonRecommenderSystems(RecSys’16).
• [Daiet.al,2016]H.Dai,Y.Wang,R.Trivedi,L.Song:RecurrentCo-EvolutionaryLatentFeatureProcessesforContinuous-timeRecommendation.1stWorkshoponDeepLearningforRecommenderSystems(DLRS2016).
• [Elkahkyet.al,2015]A.M.Elkahky,Y.Song,X.He:AMulti-ViewDeepLearningApproachforCrossDomainUserModelinginRecommendationSystems.24thInternationalConferenceonWorldWideWeb (WWW’15).[Paterek,2007]A.Paterek:Improvingregularizedsingularvaluedecompositionforcollaborativefiltering.KDDCupandWorkshop 2007.
• [Paterek,2007]A.Paterek:Improvingregularizedsingularvaluedecompositionforcollaborativefiltering.KDDCup2007Workshop.• [Pilászy&Tikk,2009]I.Pilászy,D.Tikk:Recommendingnewmovies:evenafewratingsaremorevaluablethanmetadata.3rdACM
ConferenceonRecommenderSystems(RecSys’09).• [Quet.al,2016]Y.Qu,H.Cai,K.Ren,W.Zhang,Y.Yu:Product-basedNeuralNetworksforUserResponse Prediction. 16thInternational
Conferenceon DataMining(ICDM 2016).• [Salakhutdinovet.al,2007]R.Salakhutdinov,A.Mnih,G.Hinton:RestrictedBoltzmannMachines forCollaborativeFiltering.24th
International Conference onMachineLearning (ICML2007).• [Songet.al,2016]Y.Song,A.M.Elkahky,X.He:Multi-RateDeepLearningforTemporalRecommendation.39thInternationalACM
SIGIRconferenceonResearchandDevelopmentinInformationRetrieval (SIGIR’16).• [Vincentet.al,2008]P.Vincent,H.Larochelle,Y.Bengio, P. A.Manzagol: Extracting andComposing Robust Features withDenoising
Autoencoders.25thinternationalConference onMachine Learning (ICML2008).• [Wanget.al,2015]H.Wang,N.Wang,DY.Yeung:CollaborativeDeepLearningforRecommenderSystems.21thACMSIGKDD
InternationalConferenceonKnowledgeDiscoveryandDataMining(KDD’15).• [Wanget.al,2016]H.Wang,X.Shi,DY.Yeung:CollaborativeRecurrentAutoencoder: RecommendwhileLearningtoFillintheBlanks.
AdvancesinNeuralInformationProcessingSystems(NIPS2016).• [Wuet.al,2016]Y.Wu,C.DuBois,A.X.Zheng,M.Ester:CollaborativeDenoisingAuto-encodersforTop-nRecommenderSystems.9th
ACMInternationalConferenceonWebSearchandDataMining(WSDM’16)• [Zhenget.al,2016]Y.Zheng,C.Liu,B.Tang,H.Zhou:NeuralAutoregressiveCollaborativeFilteringforImplicitFeedback. 1stWorkshop
onDeepLearningforRecommenderSystems(DLRS2016).
FeatureExtractionfromContent
Contentfeaturesinrecommenders• HybridCF+CBFsystems
– Interactiondata+metadata• Modelbasedhybridsolutions
– Initiliazing• Obtainitemrepresentationbasedonmetadata• Usethisrepresentationasinitialitemfeatures
– Regularizing• Obtainmetadatabasedrepresentations• Theinteractionbasedrepresentationshouldbeclosetothemetadatabased• Addregularizingtermtolossofthisdifference
– Joining• Obtainmetadatabasedrepresentations• Havetheitemfeaturevectorbeaconcatenation
– Fixedmetadatabasedpart– Learnedinteractionbasedpart
Featureextractionfromcontent• Deeplearningiscapableofdirectfeatureextraction
– Workwithcontentdirectly– Instead(orbeside)metadata
• Images– E.g.:productpictures,videothumbnails/frames– Extraction:convolutionalnetworks– Applications(e.g.):
• Fashion• Video
• Text– E.g.:productdescription,contentoftheproduct,reviews– Extraction
• RNNs• 1Dconvolutionnetworks• Weightedwordembeddings• Paragraphvectors
– Applications(e.g.):• News• Books• Publications
• Music/audio– Extraction:convolutionalnetworks(orRNNs)
ConvolutionalNeuralNetworks(CNN)
• Specialityofimages– Hugeamountofinformation• 3channels(RGB)• Lotsofpixels• Numberofweightsrequiredtofullyconnecta320x240imageto2048hiddenunits:– 3*320*240*2048=471,859,200
– Locality• Objects’presenceareindependentoftheirlocationororientation• Objectsarespatiallyrestricted
ConvolutionalNeuralNetworks(CNN)• Imageinput
– 3Dtensor• Width• Height• Channels(R,G,B)
• Text/sequenceinputs– Matrix– ofone-hotencodedentities
• Inputsmustbeofsamesize– Padding
• (Classic)ConvolutionalNets– Convolutionlayers– Poolinglayers– Fullyconnectedlayers
ConvolutionalNeuralNetworks(CNN)• Convolutionallayer(2D)
– Filter• Learnableweights,arrangedinasmall tensor(e.g.3x3xD)
– Thetensor’sdepthequalstothedepthoftheinput• Recognizes certainpatternsontheimage
– Convolutionwithafilter• Applythefilteronregionsoftheimage
– 𝑦V,W = 𝑓 ∑ 𝑤.,8,Y𝐼.<V;%,8<W;%,Y�.,8,Y
» Filtersareappliedoverallchannels(depthoftheinputtensor)» ActivationfunctionisusuallysomekindofReLU
– Startfromtheupperleftcorner– Moveleftbyoneandapplyagain– Oncereachingtheend,gobackandshiftdownbyone
• Result:a2Dmapofactivations,highatplacescorrespondingtothepatternrecognizedbythefilter– Convolutionlayer:multiplefiltersofthesamesize
• Inputsize(𝑊%×𝑊&×𝐷)• Filtersize(𝐹×𝐹×𝐷)• Stride(shiftvalue)(𝑆)• Numberoffilters(𝑁)• Outputsize: a7;b
(+ 1 × ac;b
(+ 1 ×𝑁
• Numberofweights:𝐹×𝐹×𝐷×𝑁– Anotherwaytolookatit:
• Hiddenneuronsorganizedina a7;b(
+ 1 × ac;b(
+ 1 ×𝑁 tensor• Weightsasharedbetweenneuronswiththesamedepth• Aneuronprocessean𝐹×𝐹×𝐷 regionoftheinput• Neighboringneuronsprocessregionsshiftedbythestridevalue
1 3 8 00 7 2 1
2 5 5 1
4 2 3 0
-1 -2 -1
-2 12 -2
-1 -2 -1
48 -27
19 28
ConvolutionalNeuralNetworks(CNN)• Poolinglayer
– Meanpooling:replacean𝑅×𝑅 regionwiththemeanofthevalues– Maxpooling:replacean𝑅×𝑅 regionwiththemaximumofthevalues– Usedtoquicklyreducethesize– Cheap,butveryaggressiveoperator
• Avoidwhenpossible• Oftenneeded,becauseconvolutionsdon’tdecreasethenumberofinputsfastenough
– Inputsize:𝑊%×𝑊&×𝑁– Outputsize:a7
e×ac
e×𝑁
• Fullyconnectedlayers– Finalfewlayers– Eachhiddenneuronisconnectedwitheveryneuroninthenextlayer
• Residualconnections(improvement)[Heet.al,2016]– Verydeepnetworksdegradeperformance– Hardtofindthepropermappings– Reformulationoftheproblem:F(x)à F(x)+x
Layer
Layer
+
𝑥
𝐹 𝑥 + 𝑥
𝐹(𝑥)
ConvolutionalNeuralNetworks(CNN)
• Someexamples• GoogLeNet[Szegedyet.al,2015]
• Inception-v3model[Szegedyet.al,2016]
• ResNet(upto200+layers)[Heet.al,2016]
Imagesinrecommenders• [McAuleyet.Al,2015]– Learnsaparameterizeddistancemetricovervisualfeatures• VisualfeaturesareextractedfromapretrainedCNN• Distancefunction:Euclediandistanceof„embedded”visualfeatures– Embeddinghere:multiplicationwithaweightmatrixtoreducethenumberofdimensions
– Personalizeddistance• Reweightsthedistancewithauserspecificweightvector
– Training:maximizinglikelihoodofanexistingrelationshipwiththetargetitem• Overuniformlysamplednegativeitems
Imagesinrecommenders• VisualBPR[He&McAuley,2016]
– Modelcomposedof• Biasterms• MFmodel• Visualpart
– PretrainedCNNfeatures– Dimensionreductionthrough„embedding”– Theproductofthisvisualitemfeatureandalearneduserfeaturevectorisusedinthe
model• Visualbias
– ProductofthepretrainedCNNfeaturesandaglobalbiasvectoroveritsfeatures
– BPRloss– Testedonclothingdatasets(9-25%improvement)
Musicrepresentations• [Oordet.al,2013]
– ExtendsiALS/WMFwithaudiofeatures• Toovercomecold-start
– Musicfeatureextraction• Time-frequencyrepresentation• AppliedCNNon3second
samples• Latentfactoroftheclip:average
predictionsonconsecutivewindowsoftheclip
– IntegrationwithMF• (a)Minimizedistancebetween
musicfeaturesandtheMF’sfeaturevectors
• (b)Replacetheitemfeatureswiththemusicfeatures(minimizeoriginalloss)
Textualinformationimprovingrecommendations
• [Bansalet.al,2016]– Paperrecommendation– Itemrepresentation
• Textrepresentation– TwolayerGRU(RNN):bidirectionallayerfollowedbyaunidirectionallayer– Representationiscreatedbypoolingoverthehiddenstatesofthesequence
• IDbasedrepresentation(itemfeaturevector)• Finalrepresentation:ID+textadded
– Multi-tasklearning• Predictbothuserscores• Andlikelihoodoftags
– End-to-endtraining• Allparametersaretrainedsimultaneously(nopretraining)• Loss
– Userscores:weightedMSE(likeiniALS)– Tags:weightedloglikelihood(unobservedtagsaredownweighted)
References• [Bansalet.al,2016]T.Bansal,D.Belanger,A.McCallum:AsktheGRU:Multi-Task
LearningforDeepTextRecommendations.10thACMConferenceonRecommenderSystems(RecSys’16).
• [Heet.al,2016]K.He,X.Zhang,S.Ren,J.Sun:DeepResidualLearningforImageRecognition.CVPR2016.
• [He&McAuley,2016]R.He,J.McAuley:VBPR:VisualBayesianPersonalizedRankingfromImplicitFeedback.30thAAAIConferenceonArtificialIntelligence(AAAI’16).
• [McAuleyet.Al,2015]J.McAuley,C.Targett,Q.Shi,A.Hengel:Image-basedRecommendationsonStylesandSubstitutes.38thInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval (SIGIR’15).
• [Oordet.al,2013]A.Oord,S.Dieleman,B.Schrauwen:DeepContent-basedMusicRecommendation.AdvancesinNeuralInformationProcessingSystems(NIPS2013).
• [Szegedyet.al,2015]C.Szegedy,W.Liu,Y.Jia,P.Sermanet,S.Reed,D.Anguelov,D.Erhan,V.Vanhoucke,A.Rabinovich:GoingDeeperwithConvolutions.CVPR2015.
• [Szegedyet.al,2016]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,Z.Wojna:RethinkingtheInceptionArchitectureforComputerVision.CVPR2016.
Session-basedRecommendationswithRNNs
RecurrentNeuralNetworks
• Input:sequentialinformation( 𝑥: :=%g )
• Hiddenstate(ℎ:):– representationofthesequencesofar– influencedbyeveryelementofthesequenceuptot
• ℎ: = 𝑓 𝑊𝑥: + 𝑈ℎ:;% + 𝑏
RNN-basedmachinelearning• Sequencetovalue
– Encoding,labeling– E.g.:timeseriesclassification
• Valuetosequence– Decoding,generation– E.g.:sequencegeneration
• Sequencetosequence– Simultaneous
• E.g.:next-clickprediction– Encoder-decoderarchitecture
• E.g.:machinetranslation• TwoRNNs(encoder&decoder)
– Encoderproducesavectordescribingthesequence» Lasthiddenstate» Combinationofhiddenstates(e.g.meanpooling)» Learnedcombinationofhiddenstates
– Decoderreceivesthesummaryandgeneratesanewsequence» Thegeneratedsymbolisusuallyfedbacktothedecoder» Thesummaryvectorcanbeusedtoinitializethedecoder» Orcanbegivenasaglobalcontext
• Attentionmechanism(optionally)
ℎ% ℎ& ℎQ
𝑥% 𝑥& 𝑥Q
𝑦
ℎ% ℎ& ℎQ
𝑥
𝑦% 𝑦& 𝑦Q
ℎ% ℎ& ℎQ
𝑥% 𝑥& 𝑥Q
𝑦% 𝑦& 𝑦Q
ℎ%0 ℎ&0 ℎQ0
𝑥% 𝑥& 𝑥Q
𝑦% 𝑦& 𝑦Q
ℎ%i ℎ&i ℎQi
𝑠
𝑠 𝑠 𝑠𝑦% 𝑦&0
Exploding/Vanishinggradients
• ℎ: = 𝑓 𝑊𝑥: + 𝑈ℎ:;% + 𝑏• Gradientofℎ: wrt.𝑥%– Simplification:linearactivations
• Inreality:bounded
– kLlkm7
= kLlkLln7
kLln7kLlnc
⋯ kLckL7
kL7km7
= 𝑈:;%𝑊• 𝑈 & < 1à vanishinggradients
– Theeffectofvaluesfurtherinthepastisneglected– Thenetworkforgets
• 𝑈 & > 1à explodinggradients– Gradientsbecomeverylargeonlongersequences– Thenetworkbecomesunstable
Handlingexplodinggradients• Gradientclipping– Ifthegradientislargerthanathreshold,scaleitbacktothethreshold
– Updatesarenotaccurate– Vanishinggradientsarenotsolved
• Enforce 𝑈 & = 1– UnitaryRNN– Unabletoforget
• Gatednetworks– Long-ShortTermMemory(LSTM)– GatedRecurrentUnit(GRU)– (andasomeothervariants)
Long-ShortTermMemory(LSTM)• [Hochreiter&Schmidhuber,1999]• Insteadofrewritingthehiddenstateduringupdate,
addadelta– 𝑠: = 𝑠:;% + Δ𝑠:– Keepsthecontributionofearlierinputsrelevant
• Informationflowiscontrolledbygates– Gatesdependoninputandthehiddenstate– Between0and1– Forgetgate(f):0/1à reset/keephiddenstate– Inputgate(i):0/1à don’t/doconsiderthecontributionof
theinput– Outputgate(o):howmuchofthememoryiswrittentothe
hiddenstate• Hiddenstateisseparatedintotwo(readbeforeyou
write)– Memorycell(c):internalstateoftheLSTMcell– Hiddenstate(h):influencesgates,updatedfromthe
memorycell
𝑓: = 𝜎 𝑊s𝑥: + 𝑈sℎ:;% + 𝑏s𝑖: = 𝜎 𝑊.𝑥: + 𝑈.ℎ:;% + 𝑏.𝑜: = 𝜎 𝑊u𝑥: + 𝑈uℎ:;% + 𝑏u
�̃�: = tanh 𝑊𝑥: + 𝑈ℎ:;% + 𝑏𝑐: = 𝑓: ∘ 𝑐:;% + 𝑖: ∘ �̃�:ℎ: = 𝑜: ∘ tanh 𝑐:
𝐶
ℎ
IN
OUT
+
+
i
f
o
GatedRecurrentUnit(GRU)• [Choet.al,2014]• Simplifiedinformationflow
– Singlehiddenstate– Inputandforgetgatemergedà
updategate(z)– Nooutputgate– Resetgate(r)tobreak
informationflowfromprevioushiddenstate
• SimilarperformancetoLSTM ℎr
IN
OUT
z
+
𝑧: = 𝜎 𝑊~𝑥: + 𝑈~ℎ:;% + 𝑏~𝑟: = 𝜎 𝑊�𝑥: + 𝑈�ℎ:;% + 𝑏�
ℎ�: = tanh 𝑊𝑥: + 𝑟: ∘ 𝑈ℎ:;% + 𝑏ℎ: = 𝑧: ∘ ℎ: + 1 − 𝑧: ∘ ℎ�:
Session-basedrecommendations• Sequenceofevents– Useridentificationproblem– Disjointsessions(insteadofconsistentuserhistory)
• Tasks– Nextclickprediction– Predictingintent
• Classicalgorithmscan’tcopewithitwell– Item-to-itemrecommendationsasapproximationinlivesystems
• ArearevitalizedbyRNNs
GRU4Rec(1/3)• [Hidasiet.al,2015]• Networkstructure
– Input:onehotencodeditemID– Optionalembeddinglayer– GRUlayer(s)– Output:scoresoverallitems– Target:thenextiteminthesession
• AdaptingGRUtosession-basedrecommendations– Sessionsof(very)differentlength&lotsofshort
sessions:session-parallelmini-batching– Lotsofitems(inputs,outputs):samplingonthe
output– Thegoalisranking:listwiselossfunctionson
pointwise/pairwisescores
GRUlayer
One-hotvector
Weightedoutput
Scoresonitems
f()
One-hotvector
ItemID(next)
ItemID
GRU4Rec(2/3)• Session-parallelmini-batches
– Mini-batchisdefinedoversessions– UpdatewithonestepBPTT
• Lotsofsessionsareveryshort• 2Dmini-batching,updatingonlonger
sequences(withorwithoutpadding)didn’timproveaccuracy
• Outputsampling– Computingscoresforallitems(100K– 1M)in
everystepisslow– Onepositiveitem(target)+severalsamples– Fastsolution:scoresonmini-batchtargets
• Itemsoftheothermini-batcharenegativesamplesforthecurrentmini-batch
• Lossfunctions– Cross-entropy+softmax– AverageofBPRscores– TOP1score(averageofrankingerror+
regularizationoverscorevalues)
𝑖%,% 𝑖%,& 𝑖%,Q 𝑖%,S𝑖&,% 𝑖&,& 𝑖&,Q𝑖Q,% 𝑖Q,& 𝑖Q,Q 𝑖Q,S 𝑖Q,R 𝑖Q,�𝑖S,% 𝑖S,&𝑖R,% 𝑖R,& 𝑖R,Q
Session1
Session2
Session3
Session4
Session5
𝑖%,% 𝑖%,& 𝑖%,Q𝑖&,% 𝑖&,&𝑖Q,% 𝑖Q,& 𝑖Q,Q 𝑖Q,S 𝑖Q,R
𝑖S,%
𝑖R,% 𝑖R,&Input
Desired output
…
𝑖%,& 𝑖%,Q 𝑖%,S𝑖&,& 𝑖&,Q𝑖Q,& 𝑖Q,Q 𝑖Q,S 𝑖Q,R 𝑖Q,�
𝑖S,&
𝑖R,& 𝑖R,Q…
𝑖% 𝑖R 𝑖�
𝑦�%% 𝑦�&% 𝑦�Q% 𝑦�S% 𝑦�R% 𝑦��% 𝑦��% 𝑦��%
𝑦�%Q 𝑦�&Q 𝑦�QQ 𝑦�SQ 𝑦�RQ 𝑦��Q 𝑦��Q 𝑦��Q𝑦�%& 𝑦�&& 𝑦�Q& 𝑦�S& 𝑦�R& 𝑦��& 𝑦��& 𝑦��&
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0
𝑋𝐸 = − log 𝑠. ,𝑠. =0��2
∑ 0��45�467
𝐵𝑃𝑅 =−∑ log 𝜎 𝑦�. − 𝑦�8
>�8=%
𝑁(
𝑇𝑂𝑃1 =∑ 𝜎 𝑦�8 − 𝑦�.>�8=% + ∑ 𝜎 𝑦�8&
>�8=%
𝑁(
GRU4Rec(3/3)• Observations– Similaraccuracywith/withoutembedding– Multiplelayersrarelyhelp
• Sometimesslightimprovementwith2layers• Sessionsspanovershorttime,noneedformultipletimescales
– Quickconversion:onlysmallchangesafter5-10epochs– Upperboundformodelcapacity
• Noimprovementwhenaddingadditionalunitsafteracertainthreshold
• Thisthresholdcanbeloweredwithsometechniques• Results– 20-30%improvementoveritem-to-itemrecommendations
ImprovingGRU4Rec• Recall@20onRSC15byGRU4Rec:0.6069 (100units),0.6322 (1000units)• Dataaugmentation[Tanet.al,2016]
– Generateadditionalsessionsbytakingeverypossiblesequencestartingfromtheendofasession– Randomlyremoveitemsfromthesesequences– Longtrainingtimes– Recall@20onRSC15(usingthefulltrainingsetfortraining):~0.685 (100units)
• Bayesianversion(ReLeVar)[Chatziset.al,2017]– Bayesianformulationofthemodel– Basicallyadditionalregularizationbyaddingrandomnoiseduringsampling
– Recall@20onRSC15:0.6507 (1500units)• Newlossesandadditionalsampling[Hidasi&Karatzoglou,2017]
– Useadditionalsamplesbesideminibatchsamples– Designbetterlossfunctions
• BPR��� = − log ∑ 𝑠8𝜎 𝑟. − 𝑟8>�8=% + 𝜆∑ 𝑟8&
>�8=%
– Recall@20onRSC15:0.7119 (100units)
Extensions• Multi-modalinformation(p-RNNmodel)[Hidasiet.al,2016]
– UseimageanddescriptionbesidestheitemID– OneRNNperinformationsource– Hiddenstatesconcatenated– Alternatingtraining
• Itemmetadata[Twardowski,2016]– Embeditemmetadata– MergewiththehiddenlayeroftheRNN(sessionrepresentation)– Predictcompatibilityusingfeedforwardlayers
• Contextualization[Smirnova&Vasile,2017]– Mergingbothcurrentandnextcontext– Currentcontextontheinputmodule– Nextcontextontheoutputmodule– TheRNNcellisredefinedtolearncontext-awaretransitions
• Personalizingbyinter-sessionmodeling– HierarchicalRNNs[Quadranaet.al,2017],[Ruoccoet.al,2017]
• OneRNNworkswithinthesession(nextclickprediction)• TheotherRNNpredictsthetransitionbetweenthesessionsoftheuser
References• [Chatziset.al,2017]S.P.Chatzis,P.Christodoulou,A.Andreou:RecurrentLatentVariableNetworksforSession-Based
Recommendation.2ndWorkshoponDeepLearningforRecommenderSystems(DLRS2017).https://arxiv.org/abs/1706.04026
• [Choet.al,2014]K.Cho,B.vanMerrienboer,D.Bahdanau,Y.Bengio.Onthepropertiesofneuralmachinetranslation:Encoder-decoderapproaches.https://arxiv.org/abs/1409.1259
• [Hidasiet.al,2015]B.Hidasi,A.Karatzoglou,L.Baltrunas,D.Tikk:Session-basedRecommendationswithRecurrentNeuralNetworks.InternationalConferenceonLearningRepresentations(ICLR2016).https://arxiv.org/abs/1511.06939
• [Hidasiet.al,2016]B.Hidasi,M.Quadrana,A.Karatzoglou,D.Tikk:ParallelRecurrentNeuralNetworkArchitecturesforFeature-richSession-basedRecommendations.10thACMConferenceonRecommenderSystems(RecSys’16).
• [Hidasi&Karatzoglou,2017]B.Hidasi,AlexandrosKaratzoglou:RecurrentNeuralNetworkswithTop-kGainsforSession-basedRecommendations.https://arxiv.org/abs/1706.03847
• [Hochreiter&Schmidhuber,1997]S.Hochreiter,J.Schmidhuber:LongShort-termMemory.NeuralComputation,9(8):1735-1780.
• [Quadranaet.al,2017]:M.Quadrana,A.Karatzoglou,B.Hidasi,P.Cremonesi:PersonalizingSession-basedRecommendationswithHierarchicalRecurrentNeuralNetworks.11thACMConferenceonRecommenderSystems(RecSys’17).https://arxiv.org/abs/1706.04148
• [Ruoccoet.al,2017]:M.Ruocco,O.S.LillestølSkrede,H.Langseth:Inter-SessionModelingforSession-BasedRecommendation.2ndWorkshoponDeepLearningforRecommendations(DLRS2017).https://arxiv.org/abs/1706.07506
• [Smirnova&Vasile,2017]E.Smirnova,F.Vasile:ContextualSequenceModelingforRecommendationwithRecurrentNeuralNetworks.2ndWorkshoponDeepLearningforRecommenderSystems(DLRS2017).https://arxiv.org/abs/1706.07684
• [Tanet.al,2016]Y.K.Tan,X.Xu,Y.Liu:ImprovedRecurrentNeuralNetworksforSession-basedRecommendations.1stWorkshoponDeepLearningforRecommendations(DLRS2016).https://arxiv.org/abs/1606.08117
• [Twardowski,2016]B.Twardowski:ModellingContextualInformationinSession-AwareRecommenderSystemswithNeuralNetworks.10thACMConferenceonRecommenderSystems(RecSys’16).
Conclusions• DeepLearningisnowinRecSys• Hugepotential,butlottodo– E.g.ExploremoreadvancedDLtechniques
• Currentresearchdirections– Itemembeddings– Deepcollaborativefiltering– Featureextractionfromcontent– Session-basedrecommendationswithRNNs
• Scalabilityshouldbekeptinmind• Don’tfallforthehypeBUTdon’tdisregardtheachievementsofDLanditspotentialforRecSys
Thankyou!
top related