efficientfeaturesformovie recommendationsystems759691/fulltext01.pdf · i.e. “topics” from...

53
Efficient Features for Movie Recommendation Systems SUVIR BHARGAV Master’s Degree Project Stockholm, Sweden October 2014 XR-EE-KT 2014:012

Upload: others

Post on 21-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Efficient Features for MovieRecommendation Systems

SUVIR BHARGAV

Master’s Degree ProjectStockholm, Sweden October 2014

XR-EE-KT 2014:012

Page 2: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document
Page 3: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Efficient Features for Movie RecommendationSystems

SUVIR BHARGAV

Master’s Thesis at VionLabs ABSupervisor: Roelof PietersExaminer: Markus Flierl

XR-EE-KT 2014:012

Page 4: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document
Page 5: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

iii

Abstract

User written movie reviews carry substantial amounts of movie re-lated features such as description of location, time period, genres, char-acters, etc. Using natural language processing and topic modeling basedtechniques, it is possible to extract features from movie reviews and findmovies with similar features. In this thesis, a feature extraction methodis presented and the use of the extracted features in finding similarmovies is investigated. We do the text pre-processing on a collection ofmovie reviews. We then extract topics from the collection using topicmodeling techniques and store the topic distribution for each movie.Similarity metrics such as Hellinger distance is then used to find movieswith similar topic distribution. Furthermore, the extracted topics areused as an explanation during subjective evaluation. Experimental re-sults show that our extracted topics represent useful movie features andthat they can be used to find similar movies efficiently.

Page 6: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document
Page 7: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

v

Acknowledgements

This thesis has been carried out at Vionlabs AB. Starting from theinitial idea to final execution, everyone at Vionlabs supported the en-deavour to build and create something around movie and technology. Iwould like to thank my supervisor, Roelof Pieters for his guidance andhaving so many endless discussions around NLP, topic modeling andmovie recommendation systems.

I would also like to thank main author of Gensim library, Radimfor his endless suggestions and ideas. I extend my gratitude to greatcommunity of programmers and engineers who took time to reply andgave suggestions to my questions on stackoverflow.

I would like to thank my coordinator and examiner, Markus Flierlfor giving valuable guidance and suggestions at each stage of the project.I would also like to thank all the movie judges at Vionlabs for their timeand effort in rating movies. In the end, I would like to thank my familyand friends, who constantly supported me throughout the thesis.

Page 8: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Contents

Contents vi

List of Figures viii

1 Introduction 11.1 Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Movie Data Processing: A Literature review . . . . . . . . . . . . . . 32.2 Document representation . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . 6

2.4 Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4.1 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 82.4.2 Kullback-Leibler (KL) divergence . . . . . . . . . . . . . . . . 92.4.3 Hellinger Distance . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Recommendation based on Movie Topics 113.1 User Reviews of Movies as Data . . . . . . . . . . . . . . . . . . . . . 123.2 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.2 Movie Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Topic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Experimental Setup and Results 194.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Text processing . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Training LDA model . . . . . . . . . . . . . . . . . . . . . . . 224.1.3 Calculating movie similarity . . . . . . . . . . . . . . . . . . . 22

4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

vi

Page 9: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

CONTENTS vii

4.2.1 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Web based movie evaluation setup . . . . . . . . . . . . . . . 23

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.1 Evaluation result . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.2 Rating correlation . . . . . . . . . . . . . . . . . . . . . . . . 284.3.3 Observations on Subjective evaluation . . . . . . . . . . . . . 28

5 Conclusion and Future Directions 355.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.1 Movie review preprocessing . . . . . . . . . . . . . . . . . . . 365.2.2 Building complex topic models . . . . . . . . . . . . . . . . . 36

Bibliography 37

Page 10: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

List of Figures

2.1 Vector Space Model of documents, Figure by pyevolve[24] . . . . . . . . 52.2 The graphical model for latent Dirichlet allocation. Each node is a ran-

dom variable in the generative process. Shaded circle represents observedvariable i.e. words of documents and unshaded circles are all hiddenvariable. Plates represents replication i.e. N denotes words within docu-ments and D is collection of documents. Figure taken from Blei’s paper[20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Angle between two documents in a 2-d document-term space. . . . . . . 9

3.1 The overall system showing all steps involved. System works by pre-processing reviews, traning LDA model, extracting topics out of it. Top-ics are then later used to find similar movies. . . . . . . . . . . . . . . . 11

3.2 Screen-shot shows a sample movie review taken from IMDB. Highlightedwords are relevant features that can be used for finding similar movies. . 12

3.3 Collection and preprocessing of movie reviews. . . . . . . . . . . . . . . 133.4 Preprocessing of movie reviews is done in parallel by spawning sub-

processes for available number of CPU cores. Above representation isinspired from Chris Kiehl’s blog [37]. . . . . . . . . . . . . . . . . . . . 14

3.5 Tree showing nltk based chunking technique applied on movie data. . . 143.6 Sample topics generated from user movie reviews for the movie Gravity 173.7 Cosine similarity and Hellinger distance shows strong positively correla-

tion. The X-axis shows similarity score for Hellinger distance whereasY-axis represents cosine similarity score. . . . . . . . . . . . . . . . . . 18

4.1 A tree diagram showing movie review corpus . . . . . . . . . . . . . . . 204.2 Chart showing movies genres of popular movies from last 10 years. . . 204.3 A visualization showing 20 topics generated from 100 movie reviews. Ver-

tical axis represents movie reviews data denoted by their correspondingids while the horizontal axis represents movie topics. . . . . . . . . . . . 21

4.4 Front-page of the movie evaluation system, showing five target movies.A user clicks on a target movie and five similar movies are presented forevaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

viii

Page 11: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

List of Figures ix

4.5 Web based movie evaluation system. Shown on left is a target movieFront-page upon log-in to the movie evaluation system, showing 10 targetmovies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6 Movie evaluation system with explanation. . . . . . . . . . . . . . . . . 264.7 Result of average rating for Genre (top) and Genre with explanation

(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.8 Result of average rating for Mood (top) and Mood with explanation

(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.9 Result of average rating for Plot (top) and Plot with explanation (bottom). 314.10 Result of average rating for Overlap (top) and Overlap with explanation

(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.11 Result shows average ratings for the movie topics. . . . . . . . . . . . . 334.12 Strong positive correlation between Genre and Mood. . . . . . . . . . . 334.13 Strong positive correlation of ratings between two judges. Judges most

agree with rating 1 and then with ratings 2 and 3. . . . . . . . . . . . . 34

Page 12: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document
Page 13: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Chapter 1

Introduction

The advent of movie streaming services made available thousands of movies witha click of a button [1]. We now have movies not only from Hollywood, but alsofrom international cinema, documentaries, indie movies, etc. With so many moviesat hand, the consumer faces the dilemma of what to watch. At the end of theday, people just want to relax and watch something that matches with their mood,taste and style. This is where Recommendation Systems (RS) can help, suggestingmovies that match user taste and viewing habits. In order to recommend movies,we need to understand movies first. The more we understand movie features(genre,keywords, mood, etc,.), the better recommendation we can serve.

Commercial streaming services such as Netflix [2] and Jinni [3] combine semanticinformation about movies with user ratings to get the optimum hybrid RS. However,they still depend on human taggers [4], [5] for basic feature representation whichare needed to classify movies or songs. Although the results obtained from humantaggers is quite good, such an approach is definitely not scalable when tagginghundred of thousands of movies or millions of daily generated videos.

For a system to understand a movie, it needs movie features such as moviecast, movie genre, movie plot, etc. With these information, a system can bettercategorize movies. User written movie reviews is one such source of features. It car-ries substantial amount of movie related information such as location, time period,genre, lead characters and memorable scene descriptions.

Since a user written movie review contains both useful (i.e keywords) and useless(i.e. stopwords) information, some text pre-processing is required before it can beused by a RS. With pre-processed movie data, the next step is to find a good featurerepresentation for movies. In this thesis, we explore feature extraction from moviereviews using Natural Language Processing (NLP) and topic modeling techniquesand use the extracted features to find similar movies. The experiments are done ona small set of movies to show that movie topics are efficient features for RS.

1

Page 14: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

2 CHAPTER 1. INTRODUCTION

1.1 Question• Is it possible to extract or generate movie features from user reviews?

• Is it possible to use extracted features to find similar movies?

• What is a good feature and how can we distinguish good features from badones?

1.2 GoalsThe goals of this master thesis are:

• Extract movies features from user reviews of movies.

• Investigate extracted features to find similar movies.

• Draw conclusions about the performance of the developed prototype system.

1.3 OutlineThis work is presented in the following chapters:

• Chapter 2 discusses the background study done during the project. Technicalconcepts that have been used in the project will be presented.

• Chapter 3 presents a recommendation system based on movie topics. Majorsteps involved in topic extraction are discussed in detail. The chapter closesby discussing the implementation of similarity metrics used to find similarmovies based on topics.

• Chapter 4 presents the experimental setup, evaluation system and results.

• Chapter 5 concludes the project and discusses future directions.

Page 15: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Chapter 2

Background

A Recommendation System provides items and suggestions to a person based onhis or her interests and past usage history. Such a system is the backbone to manyof today’s content streaming services such as Netflix1, Pandora2 and Youtube3.Recommendation Systems (RS) are usually classified based on the approach usedto filter information such as content based filtering, collaborative filtering (based onusers activities) and hybrid (combining both).

Collaborative filtering based RS have seen interest lately because of the Netflixcompetition [6] whereas content based systems face challenges of efficient featurerepresentation of meta-data from audio, video and text. Luckily for the moviedomain, lot of textual information is readily available such as plot lines, dialoguesand reviews.

2.1 Movie Data Processing: A Literature review

Movie data in the form of keywords, script, dialogue, review has been used inresearch activity in past decade [7]–[10]. [8] explores movie recommendation usingcultural metadata such as user comments, plot outlines, keywords, etc, and showshighest precision with user comments. The report [9] discusses movie classificationusing NLP based methods such as Named Entity Recognizer (NER) and Part-of-Speech (POS) tagger with movie script as input. It concludes that NLP basedfeatures performs well when compared to non-NLP features (without the use ofNER and POS) although it reports only 50% accuracy because of the small corpussize.

We decided to use movie reviews written by moviegoers primarily because a) easyavailability [7], [11], and computationally inexpensive for off-the-shelf hardware. b)Each movie review can be considered as a single document representing the movie.This allows us to use document based classification methods on movies. c) Simple

1www.netflix.com2http://www.pandora.com/3www.youtube.com

3

Page 16: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

4 CHAPTER 2. BACKGROUND

heuristic of combining several user written movie reviews of a single movie into asingle document have the potential to discover semantic patterns at the movie level.Furthermore combining all individual movie documents into collection allows us toexplore patterns across the collection (essentially, across genres).

In order to use movie review as data, it is necessary to remove irrelevant words,symbols, html tags, etc,. In NLP, a large number of open source tools and libraries[12]–[14] are available and used as the first step in any kind of text processing.The chapter 7 of [12] mentions the steps involved in extracting information fromtext. [15] uses nltk toolkit for stopwords and stemming steps. Text data with noisecan drastically affect the result of any kind of NLP based model training. Textfiltering helps in removing unnecessary information and allows us to use complexmathematical models on it.

The paper [8] compares the results obtained by preprocessing of meta data. Itsimply computes the cosine similarity from document-term matrix of movie data.Although paper showed highest precision with user comments of movies, it did notanalyzed the data further by advance techniques such as LSA. After preprocessing,reducing the data dimensionality is next step in feature extraction.

Document-term matrix is used as the input to many semantic analysis tech-niques starting from basic tf-idf scheme to complex models such as LSI, LSA andLDA. These dimensionality reduction techniques could yield semantic features ofbig data from off-the-shelf hardware [13]. Such models are interesting candidates toinvestigate semantic concepts from movie data. The thesis [10] studies sentimentanalysis done on movie review using LSA but concludes that dimensions that capturethe most variance are not always the most discriminating features for classification.

On the other hand, [16] shows interesting results when using topic modeling incontent based RS. Probabilistic topic modeling allows us to extract hidden featuresi.e. “topics” from documents. LDA, a model based on topic modeling, showsgood results in both document clustering [17] and recommendation system [18],[19]. It can capture important intra-document statistical structure by consideringmixture models for exchangeability of both words and documents within a corpus[20]. With probabilistic techniques such as LDA, it is possible to derive semanticsimilarities from textual movie data. Such extracted semantic information can beused to find similar movies. Moreover, LDA can assign topic distribution to newunseen documents, an important requirement for building scalable RS for moviesas it should be trivial to add new movies on regular basis. For a RS, computingsimilarity is an essential part, be it either similarity of content or user rating.

Clustering, an unsupervised classification of patterns is a technique applied onmovie meta-data in RS. The review paper on clustering [21] briefly discusses sim-ilarity measure but emphasizes that similarity is fundamental to clustering. [22],[23] has done a detailed study with commonly used similarity measure in text clus-tering. Since the input data for our project i.e. movie reviews are in the form oftext document, we can look for similarity measure discussed in [22] to begin with.Once the similarity between movies is computed, it is important to evaluate theresult obtained. For unsupervised learning techniques such as LDA, evaluation is

Page 17: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

2.2. DOCUMENT REPRESENTATION 5

still a challenge. Since our project is based on movies, subjective evaluation is anobvious choice as the movies are ultimately watched by people.

In the end, even though topic modeling has shown good results in recommendersystems [16], [19], it has been hardly explored for movie recommendation. Under-standing of movie data still face challenges and we need algorithms with semanticunderstanding to solve it.

2.2 Document representation

Before stepping into NLP based techniques, it is important to understand basicdocument representation. Let’s say we have a set of documents and for the sakeof simplicity, each document consist of a single sentence. We can represent such amodel into vector space as shown in Figure 2.1. Such a representation is called Vec-tor Space Model (VSM). Each word corresponds to a dimension and each documentis a vector with non-negative values on each dimensions. Figure 2.1 is an example ina 3-dimensional space but in practice the document space usually runs into tens andthousands of dimensions. VSM allow us to use 2, 3-dimensional geometric formulaeand extend it to m-dimensions, where m is the number of distinct terms appearingin a set of documents.

Figure 2.1. Vector Space Model of documents, Figure by pyevolve[24]

To represent document-term as a vector, consider each word as a term. Obvi-ously, some terms appear more frequently and are considered to be important fordocument. Let D = {d1, ....., dn}, be a corpus of documents and T = {t1, ....., tm},be the set of distinct terms occurring in D. Let tf(d,t) represents the frequencyof term t ∈ T in document d ∈ D. For document d, we can then represent am-dimensional vector ~td [22] as

Page 18: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

6 CHAPTER 2. BACKGROUND

~td = (tf(d, t1), ......, tf(d, tm))

In practice more complicated schemes such as tfidf weighting are used. For basicproto-typing, document-term vector, ~td is good to start with.

2.3 Topic Modeling

2.3.1 OverviewModeling the text corpora is central problem of information retrieval (IR) andclassification. tf-idf, a widely used scheme in IR domain is based on document-termmethodology. It describes the importance of word to a document in the collectionand reduces the length of documents to a fixed length matrix representation. Buttf-idf hardly gives any insight into intra and inter document statistical structure andit still has a term×document sized matrix, quite a high dimension. To tackle theseproblems, Latent semantic indexing (LSI) was proposed which uses singular valuedecomposition on document-term matrix. LSI, being a dimensionality reductiontechnique quickly became popular. In 1999, Hofmann proposed an improvementover LSI called probabilistic LSI (pLSI) [25]. pLSI models each word in a documentas a sample from a mixture model [20] thereby giving a representation of documentinterms of probability distribution of “topics”. Mixture components representingtopics, are basically multinomial random variables.

Although an improvement over LSI, pLSI lacked the probabilistic model at thelevel of documents. This lead to pLSI parameters growing linearly with the size ofcorpus. Another challenge was to assign topic proportion to new unseen documents.Improving on pLSI shortcomings, LDA model was introduced by David Blei [20].

Before going into LDA, it is important to distinguish between feature and hiddenfeature. In image analysis, a feature is said to be “point of interest” for imagedescription. A “good feature” is said to have useful properties [26] such as

• perceptually meaningful (as to humans)

• analytically special (eg. maxima)

• identifiable on different images

Hidden features is mostly used in statistical and probabilistic modeling, are hid-den random variables that are inferred from observed variables. In topic modelingsense, hidden variables are topics representing the thematic structure of a documentcollection and observed variables are words of the document.

2.3.2 Latent Dirichlet AllocationIn the original paper [27], a topic is defined as distribution over fixed vocabulary.Such a distribution allows us to represent document in terms of multiple topics

Page 19: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

2.3. TOPIC MODELING 7

with different proportions thereby making it easier to classify, store and find similardocuments in a collection.

LDA defines the generative process for documents with the assumption thatthe topics are generated first, before the documents. Hence, while training with anumber of topic equal to 100, we are basically assuming that there are 100 topicsin the collection of documents. For each document in the collection, the words canbe generated in two stage [27] process

1. Randomly choose a distribution over topics.

2. For each word in the document

a) Randomly choose a topic from the distribution over topics in step 1.b) Randomly choose a word from the corresponding distribution over the

vocabulary

Above process reflects the idea of LDA that documents exhibits multiple topics.Step 1 shows that each document exhibits the topics with different proportion.Further, each word within each document is picked from one of the topics (step 2b),where the selected topic is chosen from the per-document distribution over topics(step 2a).

The generative process for LDA can be written as joint distribution of the hiddenand observed variables:

p(β1:K , θ1:D, z1:D, w1:D)

=K∏

i=1p(βi)

D∏d=1

p(θd)(∏N

n=1p(zd,n | θd)p(wd,n | β1:K , zd,n))

(2.1)

Where β1:K are the topics and each βk is a distribution over the vocabulary. wd

are observed words for document d. wd,n is the nth word in document d. The topicproportions for the dth document are θd, where θd,k is the topic proportion for topick in document d. The topic assignments for the dth document are zd, where zd,n isthe topic assignment for the nth word in document d. Figure 2.2 shows the graphicalmodel of LDA with three levels. First, α and η are corpus level parameter, assumedto be sampled once in the process of generating a corpus. The variables θd aredocument-level variables, sampled once per document. Finally, the variables zd,n

and wd,n are word-level variables and sampled once for each word in each document[27].

After obtaining the joint distribution, we now compute the conditional distri-bution of the hidden variables that is topics given the observed variables that iswords. In Bayesian statistics, it is called a posterior of the hidden variables giventhe observed variables.

p(β1:K , θ1:D, z1:D | w1:D) = p(β1:K , θ1:D, z1:D, w1:D)p(w1:D) (2.2)

Page 20: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

8 CHAPTER 2. BACKGROUND

Figure 2.2. The graphical model for latent Dirichlet allocation. Each node is arandom variable in the generative process. Shaded circle represents observed variablei.e. words of documents and unshaded circles are all hidden variable. Plates representsreplication i.e. N denotes words within documents and D is collection of documents.Figure taken from Blei’s paper [20].

The numerator is the joint distribution of all the random variables and thedenominator is the marginal probability, summing over all possible ways of assigningeach observed word of the collection to one of the topics [27]. With exponentiallylarge computation in denominator, various approximation techniques are used toapproximate the posterior. We used the MALLET [28] package, which uses Gibbssampling for posterior approximation.

As mentioned in [27], relaxing and extending the statistical assumptions made byLDA could narrow down the topics to specific semantic patterns. Nowadays, topicmodeling have been optimized with features such as online learning LDA model fordocuments arriving in stream and multi-threading support.

2.4 Similarity Metrics

Finding similar movies for a target movie is the objective of the Content BasedRS. Media content can be in the form of audio, video, and text. In our case, eachmovie is represented by a single document consisting of movie review as text hence,it is useful to look at currently used similarity metrics in the document clusteringdomain. In document clustering, closeness between documents is defined in termsof similarity or distance between them. In the rest of the chapters, some of thecommonly used similarity metrics are discussed.

2.4.1 Cosine Similarity

Cosine Similarity (CS) is the most used measure of document similarity. Its usagecan be seen in information retrieval domain such as measuring similarity betweenthe documents with data obtained from LSI algorithm [29]. In-order to measure thesimilarity of two documents, we can calculate the cosine of the angle between thetwo term-vectors of the document. Figure 2.3 shows the angle in two-dimensionaldocument space.

Given two documents ~ta and ~tb, their cosine similarity is represented by

Page 21: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

2.4. SIMILARITY METRICS 9

Figure 2.3. Angle between two documents in a 2-d document-term space.

docsimcs(~ta, ~tb) =~ta · ~tb|~ta| × |~tb|

, (2.3)

where ~ta and ~tb are m-dimensional vectors over the term set T = {t1, ....., tm} . Itis important to note that for documents, the tf-idf weights are non-negative. Hence,the CS is always between [0,1].

Although cosine similarity is a widely used similarity metric, it is importantto consider metrics based on probability distributions if the input data is topicdistribution. The Kullback-Leibler divergence is shown to effectively cluster textdata using both terms [22] and topic distribution [30].

2.4.2 Kullback-Leibler (KL) divergence

In the field of information theory, a document is described by a probability distri-bution of terms. We can then calculate the similarity between two documents asthe distance between the two corresponding probability distributions [22]. For twodistributions P and Q, the KL divergence of Q from P is

Dkl(P ||Q) =∑

i

P (i) log (Pi

Qi) (2.4)

In other words, KL divergence of Q from P is a measure of the information lostwhen Q is used to approximate P [31].

The limitation with KL divergence when using it for similarity between docu-ments based on probability distribution of topics is that it is not symmetric. For adistance measure to be considered as a metric of similarity, it must be symmetric i.e. distance from x to y is the same as the distance from y to x. For the case of KLdivergence, consider the following equation, again in the document scenario:

Dkl(~ta||~tb)−Dkl(~tb||~ta) =m∑

t=1log (wt,a

wt,b)(wt,a + wt,b) (2.5)

Page 22: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

10 CHAPTER 2. BACKGROUND

Since the above equation is not zero, KL divergence is not symmetric. The solution isto use the arithmetic average of Dkl(P ||Q) and Dkl(Q||P ) or calculate the HellingerDistance (HL) for such cases [32], [33]. In this work, we explored further with theHL distance.

2.4.3 Hellinger DistanceHellinger Distance is a metric of similarity between two probability distributions.For probability distributions P = {pi}i∈[n] , Q = {qi}i∈[n] supported on [n], theHellinger distance [34] between P and Q is defined as

h(P,Q) = 1√2· ||√P −

√Q||2, (2.6)

It is important to note that for cosine similarity, a higher value is better whereasfor the Hellinger distance, a smaller value represents more similarity.

The motivation to improve movie recommendation is the initial push to exploreNLP and topic modeling techniques. Along with the knowledge about documentprocessing, topic modeling and similarity measure, we can now discusses the designapproach taken during the project implementation.

Page 23: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Chapter 3

Recommendation based on MovieTopics

This chapter discusses the implementation of major steps involved in prototyping amovie topics based RS. The algorithm for overall system is visualized in four stepsas shown in Figure 3.1.

Figure 3.1. The overall system showing all steps involved. System works by pre-processing reviews, traning LDA model, extracting topics out of it. Topics are thenlater used to find similar movies.

Summarizing the system

1. Two set of dataset set were created and preprocessed for the experiment

• Corpus A, a set of user written movie reviews extracted from the web.Basically, a list of popular 943 movies over the last 10 years, rated byusers on web.

• Corpus B, a list of ten target movies representing popular genres, hand-picked by two movie lovers who later evaluated the results.

2. A LDA based model is trained on Corpus A to generate movie topics.

11

Page 24: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

12 CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS

3. Using the trained model, indexes of topic distributions for both corpus A andunseen corpus B are created.

4. Using Similarity metrics, a list of five similar movies for each target movie iscreated and presented for evaluation.

To implement the above system, python [35] is used as programming language ofchoice because of the large ecosystem of machine learning (ML) tools and librariesaround it. Python based ML systems are easy to scale as most of the open sourcelibraries are memory efficient and supports multi-thread of execution.

We start by analyzing and pre-processing movie data. Next, feature extraction isperformed on processed data. Finally, extracted features are used to find similaritybetween movies.

3.1 User Reviews of Movies as Data

Figure 3.2. Screen-shot shows a sample movie review taken from IMDB. High-lighted words are relevant features that can be used for finding similar movies.

Movies reviews are available widely in the form of audio, video and text based,we needed to narrow down our approach of initial data. We decided to use text basedmovie reviews as they are easy to extract over the Internet and has low computationcomplexity when proto-typing with different algorithms. Reviews themselves arewritten by movie critics or users. Basing our feature extraction on movie criticreviews could result in biased view about movie. Combining large amount of reviewswritten by users and using it as the source to our feature extraction system hasthe benefit that we might pick semantic patterns considered or agreed by wideaudience of cinema. Figure 3.2 shows such semantic patterns that we want toextract in this project. In the sample review for movie Gravity shown below, observethe description of another movie “Apollo 13”. Users connect movies while writing

Page 25: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

3.2. TEXT PREPROCESSING 13

reviews and it could useful in finding semantic patterns across movies belonging tosame genre.

In the report, we use the term “document” to be consistent with the IR andtopic modeling domain but in our experimental setup, a document consists of userwritten movie reviews and it represents a movie.

3.2 Text Preprocessing

In Natural Language Processing, a corpus is a collection of text data [36], used forverifying hypothesis about language such as extracting features from text or findingpattern of word usage. For movie review data, we collected the text data andfollowed the preprocessing as shown in Figure 3.3. During preprocessing, irrelevantwords such as {of, and, or} are removed using common english stopword list.

Figure 3.3. Collection and preprocessing of movie reviews.

Page 26: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

14 CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS

Figure 3.4. Preprocessing of movie reviews is done in parallel by spawning sub-processes for available number of CPU cores. Above representation is inspired fromChris Kiehl’s blog [37].

Next, NLTK’s default lemmatizer is used for lemmatisation. It uses WordNetDatabase1 to look up lemmas. A lemmatizer reduces all derivationally related formsof a word to a common base form. For example the word “cars” is reduced to “car”.This allows us to keep the concept words and remove other forms of same word ina corpus.

Since text preprocessing is done on 1k movie reviews, it is useful to processthem in parallel. Figure 3.4 shows multiprocessing approach taken to implementpreprocessing of movie reviews in parallel. Python based multiprocessing packageis used as it allows to spawn new processes to utilize multiple processors on a givenmachine [35]. This saves time during prototyping and allows us to scale the system.

With the preprocessed data at hand, we explored number of techniques in NLPdomain. We experimented with chunk extraction on movie data. Chunking is usefulfor segmenting and labelling multi-token sequences in a sentence. One such resultis shown in Figure 3.5.

Figure 3.5. Tree showing nltk based chunking technique applied on movie data.

1http://wordnet.princeton.edu/

Page 27: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

3.3. FEATURE EXTRACTION 15

Although chunking based approach is useful for tasks such as extracting infor-mation it is not the right tool when analyzing semantic pattern in large volumesof unlabeled text such as movie reviews. In IR domain, analyzing large unlabeledtext is common requirement and this motivated us to look for various IR techniquessuch as LSI and LDA.

3.3 Feature Extraction

3.3.1 Overview

The goal of feature extraction is to transform data from image or text into nu-merical features for the purpose of analysis. In text processing, techniques such asdocument-term methodology convert text documents into numerical data. We canthen easily feed such matrix-form data into machine learning algorithms to observethe thematic structure of documents. Mathematical techniques such as Latent Se-mantic Indexing (LSI) are then used to project document-term matrix from highdimensional to lower dimensional spaces in order to identify semantic meaning andsimilarity between documents. LSI is basically an application of Singular ValueDecomposition (SVD) to a document-term matrix. Another approach in text pro-cessing is to express such words and documents in terms of probability distributionleading to models useful in finding semantic information. Probabilistic LSI (pLSI)and Latent Dirichlet allocation (LDA) are such probabilistic models. Compared toLDA, pLSI provides no probabilistic model at the level of documents. For analyz-ing movies, it is necessary to model at the level of movies in a collection. Anotherbenefit with LDA is that it better fits for new unseen documents (new upcomingmovies in our case), a important requirement for movie recommendation system.

In Figure 3.2, we can observe that the movie review talks about the conceptspace with words such as “science”, “cosmic”; genres with words such as “drama”,“thriller”. Hence, a single movie review blends multiple topics with different pro-portions. Essentially, a movie is a combination of different genres where each genrecould be represented with different proportion. As discussed in section 2.3.2 ofchapter 2, LDA model correlates with the idea of representing document (moviein our case) with multiple topics. Hence, we experimented with LDA modeling onmovie reviews dataset to analyze the movie topics.

3.3.2 Movie Topics

For the project, Gensim’s [13] python wrapper to LDA MALLET [28] is used.MALLET has a number of benefits such as multi-threading support and a fastimplementation of Gibbs sampling. In order to generate movie topics, we first trainthe LDA model on 1k movie reviews corpus. We then obtained the topic distributionby passing a movie review to a trained LDA model.

Figure 3.6 shows five topics generated from reviews of the movie Gravity. Eachcolumn represents one topic. It can be observed from the Figure that the topics t1,

Page 28: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

16 CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS

t2 and t4 represents the movie Gravity with words such as “shuttle”, “exploration”,“debris”, “adrenaline”. The topics t3 and t5 does not give accurate description andneeds some more filtering in order to get better topics.

Prototyping with review dataset gave us following insights about the quality oftopics

• Preprocess the reviews extensively, remove unnecessary words.

• Use descriptive reviews as they are more useful compared to reviews with justsentiment value.

Ultimately, training LDA model on movie reviews is just one step in getting goodmovie features. As a post-processing step, similarity measures can be used to findmovies with similar topic distribution.

3.4 Topic SimilarityDuring prototyping we explored commonly used similarity measure such as CosineSimilarity (CS), Kullback Leibler (KL) divergence and Hellinger Distance (HL).As mentioned in 2.4.2, KL divergence is a non symmetric measure. Hence, wecalculated both CS and HL as similarity metric for ten target movies against thecorpus of 1k reviews. Similarity values are then converted to common similarityscore of 0-100 for comparison. Figure 3.7 shows positive correlation obtained from50 movie score done separately for both CS and HL. Considering the probabilitydistribution of the movie topics, we used the Hellinger distance as the similaritymeasure for the experimental setup. The distance metric is calculated as

1. Index the topic distribution of the query movies q and the movie corpus C.

2. Apply distance metric formula on indexed q and C.

3. Sort and pick the top five movies.

Page 29: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

3.4.TOPIC

SIMILA

RIT

Y17

clooneywillisbullocksandragravitymcclanedebris

brucewillisjustinshuttle

Topic 1

suicideflashbacks

philosophicalbleak

symbolismnarrationlinear

explorationpoeticartsy

Topic 2

mandamagedoutsetcards

watcherwhathappens

nerveonhis

maintainsatfirst

Topic 3

johnsonweaponsassassinbullets

installmentbullet

adrenalineactionsequences

matrixcombat

Topic 4

gapslastedialso

welldoneifeel

postededit

knowthatinsteadofahuge

Topic 5

Figure 3.6. Sample topics generated from user movie reviews for the movie Gravity

Page 30: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

18CHAPT

ER3.

REC

OMMEN

DAT

ION

BASED

ON

MOVIE

TOPIC

S

Figure 3.7. Cosine similarity and Hellinger distance shows strong positively cor-relation. The X-axis shows similarity score for Hellinger distance whereas Y-axisrepresents cosine similarity score.

Page 31: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Chapter 4

Experimental Setup and Results

4.1 Experimental SetupWe started the setup by collecting necessary movie reviews from the web. A list oftop movies from last 10 years is created. It consist of around top 50 movies fromeach year between 2004-2013. IMDB creates popular movie list1 for each year basedon user votes. Such a list represents a good mixture of popular genres liked by moviegoers. With the list of 943 movies, user written movie reviews were scraped andstored in raw HTML format. Next, text content is extracted from html files usingBeautifulSoup [38], an open source library.

The extracted text is stored in a directory where each text file consists of userwritten movie reviews for a single movie. In total, the corpus has 943 movie reviewsas shown in Figure 4.1 in the tree structure. The prepared corpus is a balancedmixture of popular genres as shown in Figure 4.2. It allows us to experiment withoutany bias towards particular movie genre.

A point to note is that we created a corpus by processing Large Movie ReviewDataset [39] as well but due to the computational complexity, we decided to scaledown and prototyped on smaller dataset as mentioned above.

4.1.1 Text processingFor text processing on reviews we used NLTK [40], an open source library. First,iterate over the corpus and tokenize each file. Tokens are basic element of textmining, allowing us to analyze and process text at word level. Since we have ac-cess to individual words now, remove the punctuation and unwanted words i.e.stopwords. Stopwords are high-frequency grammatical words which are usually ig-nored as they do not provide any useful information. Examples of stopwords are{other, there, the, of, are}. We used NLTK’s default stopword2 for English lan-

1http://www.imdb.com/search/title?year=2013,2013&title_type=feature2http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/

/english.stop

19

Page 32: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

20 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

src

movie-reviews-10-years

movie1.txt

movie2.txt

...

...

movie943.txt

Figure 4.1. A tree diagram showing movie review corpus

Figure 4.2. Chart showing movies genres of popular movies from last 10 years.

guage. We kept all the word tokens which are longer than two alphabetical charac-ters. Words occurring only once in whole corpus are also removed.

During prototyping, we picked few meaningless topics (from the movie recom-mendation point of view; words such as good, great, bad, excellent, review, film) andcreated a stop word list out of it. The visualization Figure 4.3 shows meaninglesstopics with high density blue columns. These columns represents topics with com-mon words present throughout the corpus. With the new list as feedback to thesystem, we re-preprocessed our corpus and obtained improved topics.

Page 33: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

4.1.EX

PERIM

ENTA

LSET

UP

21

Figure 4.3. A visualization showing 20 topics generated from 100 movie reviews.Vertical axis represents movie reviews data denoted by their corresponding ids whilethe horizontal axis represents movie topics.

Page 34: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

22 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

4.1.2 Training LDA model

The movie reviews corpus is passed to the gensim’s [13] python wrapper for LDAmallet [28]. We tested the quality of generated topics with 100, 150 and 250 topicsand found 100 topic as the right size for our 1k movie review corpus. Having tooless or too many topics could affect the quality of topics. Also, training model for100 topics is computationally efficient and saves time. This allows us to repeatthe process with different inputs. With the generated movie topics, we decidedto investigate further and experimented with the use of topics in finding similarmovies.

4.1.3 Calculating movie similarity

Once the model is trained on movie reviews, we can infer the topic distributions onnew, unseen movies by passing their reviews in the same way as we passed originalcorpus. We indexed and stored the topic distribution for ten target movies that wewish to use during evaluation. Using the HL distance metric, we then find movieswith similar distribution. The final result is saved in json format and passed to theweb evaluation system for evaluation purpose.

4.2 Evaluation

The goal of our experimental evaluation is twofold. First, evaluate the performanceof the system. Secondly, verify the topics themselves and their effectiveness inrepresenting movies. We choose the subjective evaluation with explanation as itfits our two-fold goal. Subjective evaluation measures are expressions of the usersabout the system or their interaction with the system [41]. It is commonly used toevaluate the usability of a recommender systems.

During implementation, we borrowed the ideas from approaches used in RS withexplanation [42]. Traditional RS behaves like a black box to end-user. This leavesthe end user confused as to why a particular movie have been recommended. RSwith explanation could help the user to understand the system better and comple-ment it by giving feedback. For our evaluation, the movie topics are presented asan explanation to the recommended movie and as a criterion to receive feedback onit.

4.2.1 Evaluation criteria

Table 4.1 shows five-point criteria for evaluation system. Genre, Mood and Plotare basic to the movie similarity. We observed the presence of actor names in theextracted movie topics. Hence, evaluating the effect of actor overlap could be useful.

Page 35: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

4.2. EVALUATION 23

Criteria ExplanationGenre similarity of genres between the target movie and

recommended movies.Mood similarity of mood.Plot similarity of plot.Overlap Overlap of Actors/Actress/Director or lead cast.Topic-relevance-score

Relevance of topic as an explanation to the recommendedmovies.

Table 4.1. Evaluation criteria used in our web based movie evaluation system.

4.2.2 Web based movie evaluation setupA web based movie evaluation system is created to evaluate the results obtainedfrom our experimental setup. Figure 4.4 shows the home page of the system. Itallows users to log-in over the web and rate movies. Evaluation starts by

• presenting a target movie and a recommended movie. Judges then rate therecommended movie based on various evaluation criteria.

• Next, explanation in the form of movie topics is shown and judges re-rate themovie after reading the explanation.

We decided to show explanation for each recommended movie in order to evaluatehow well the topics represent a movie. Figure 4.5 and 4.6 shows the above men-tioned two step evaluation. Finally, for each target movie, five movies are presentedfollowing the above mentioned steps. The ratings are then saved to database andlater used to analyze the results. For the project, three judges were invited to ratemovies. Our judges are regular moviegoers and watch movies from a wide spectrumof genres.

Page 36: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

24CHAPT

ER4.

EXPER

IMEN

TALSET

UP

AND

RESU

LTS

Figure 4.4. Front-page of the movie evaluation system, showing five target movies.A user clicks on a target movie and five similar movies are presented for evaluation.

Page 37: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

4.2.EVA

LUAT

ION

25

Figure 4.5. Web based movie evaluation system. Shown on left is a target movieFront-page upon log-in to the movie evaluation system, showing 10 target movies.

Page 38: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

26CHAPT

ER4.

EXPER

IMEN

TALSET

UP

AND

RESU

LTS

Figure 4.6. Movie evaluation system with explanation.

Page 39: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

4.3. RESULTS 27

4.3 ResultsInitially we did an evaluation for ten target movies, but realized that subjectiveevaluation is a slow process. Hence we updated the system and did the evaluaitonfor five target movies only. Figures [4.7, 4.8, 4.9, 4.10, 4.11] show evaluation resultsfor four evaluation criterion. For each of the evaluation criteria, we also show themovie topics as an explanation. The result for evaluation with explanation is shownat the bottom visualization of each figure. The movie judges rated the movie on ascale 1-4 with 1 equal to “Not Similar”. Rating 2 is for “Somewhat similar”. Rating3 is for “Similar” and finally, rating 4 is for “Perfect”, representing that user ishappy with recommended movie.

4.3.1 Evaluation result

Some observations about the results

• Out of five movies given to judges for evaluation, one movie has similarityscore of 40-50%, hence it received lowest ratings whereas for most of the othermovies, scores were in the range of 50-70%.

• As shown in the top of Figure 4.7, the genre criterion shows results with 30-35% ratings between 2 and 3 with median of 3 for genre only and 2 for genrewith explanation. As observable in re-ratings (bottom figure), movie topics areslightly different than judge’s understanding about genre. But overall judgesagree with movie topics as an explanation for movie genre.

• As shown in the top of Figure 4.8, the mood criterion shows results between25-30% ratings between 2 and 3 with a median of 2. Both genre and moodinformation have been captured quite well by movie topics. As observable,both ratings (top) and re-ratings (bottom) are almost same for mood evalua-tion. Hence, judges agree with movie topics as explanation of mood aspect ofmovies.

• As shown in Figure 4.9, the majority of recommended movies are not similarat all in terms of movie plot with 40% of ratings given to “Not Similar”. Thisshows that capturing the plot is much more difficult than genre or mood. Bothratings (top) and re-ratings (bottom) are almost similar for plot evaluation.Hence, judges agree that the plot information is not well captured by movietopics and more information is needed to recommend movies with similarplots.

• In LDA model, the order of words and order of documents are not considered.For modeling movie plot information, a time based description of conceptsand events are important. In order to better extract plot information, topicsmust evolve over the timeline of a movie.

Page 40: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

28 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

• Overlap criterion have been rated “Not Similar” with 60-70% of the ratings.This is understandable as our collection is of 1k movies only and it is difficultto find overlaps of actors within smaller corpus. Again, both ratings (top)and re-ratings (bottom) are almost same for overlap evaluation. Hence, judgesagree that the actors overlaps are not well captured by movie topics.

• Figure 4.11 shows ratings for topic relevance score. A combined rating of73% is given between 2 and 3. It represents the overall usefulness of topics infinding similar movies and use of topics as an explanation.

• We did the subjective evaluation on smaller scale as it is a slow process to ratemovies. Judges needed some time to watch previously unseen movies beforerating them.

• Overall, genre, mood and topic relevance score criteria has shown useful re-sults.

It is important to observe that topics generated from LDA model changes everytime a model is re-trained. Running the model anew generates different set oftopics, slightly changed from previous one. Hence, final result of similar moviesmight change as well based on generated topics.

4.3.2 Rating correlation• Although we analyzed other criteria for correlation but observed a strong

positive correlation between genre and mood as shown in the Figure 4.12.

• For correlation between judges ratings, we analyzed all the ratings. Figure4.13 shows strong positive correlation between two judges. This shows thatboth judges agree with each other with 34.4% of ratings into rank 1 followedby 13.6% ratings in rank 2.

4.3.3 Observations on Subjective evaluationSubjective evaluation is useful in getting feedback on recommended movies. Ourjudges gave feedback about the movie topics and showed interest in rating topicsindividually for future evaluation. This could be useful in filtering noise and main-taining a top rated list of movie topics. Although, our system has only 100 topics,topic rating could be highly relevant when building hierarchical list of movie topicsas the key challenge with higher number of topics is to maintain good topics andremove bad topics. In the end, subjective evaluation has time constraint as it is slowprocess to evaluate movies and topics individually, but the outcome is quite accurateconclusion of extracted features, recommended movies and the system itself.

Page 41: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

4.3. RESULTS 29

Figure 4.7. Result of average rating for Genre (top) and Genre with explanation(bottom).

Page 42: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

30 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

Figure 4.8. Result of average rating for Mood (top) and Mood with explanation(bottom).

Page 43: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

4.3. RESULTS 31

Figure 4.9. Result of average rating for Plot (top) and Plot with explanation(bottom).

Page 44: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

32 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

Figure 4.10. Result of average rating for Overlap (top) and Overlap with explana-tion (bottom).

Page 45: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

4.3. RESULTS 33

Figure 4.11. Result shows average ratings for the movie topics.

Figure 4.12. Strong positive correlation between Genre and Mood.

Page 46: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

34 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

Figure 4.13. Strong positive correlation of ratings between two judges. Judges mostagree with rating 1 and then with ratings 2 and 3.

Page 47: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Chapter 5

Conclusion and Future Directions

5.1 Conclusion

In this project, we developed the prototyping system for extracting movie featuresi.e. topics. We trained a model on a collection of movie reviews and used the trainedmodel to find similar movies based on the Hellinger distance of movie topics.

Evaluation results shows that such an approach gives good result even with asmall movie collection. Results shows that the movie topics are efficient features asthey performs fairly well in capturing movie genre and mood. Movie plot resultsare somewhat satisfactory but need descriptive plot information and better methodsthat can capture the story-line. Our small sized movie corpus resulted in very fewoverlap between actors. The topics as an explanation in movie recommendationare quite useful but need to be fine-tuned with the ability to rate individual topics.User rated movie topics could be used as a feedback to the system.

Finally, movie topics are efficient features for movie recommendation systemsas they represent the semantic patterns behind movies. With user movie reviewsas data, movie topics capture the essential movie aspects such as genre and mood.Our prototyping approach to feature extraction has the potential to scale for a largenumber of movies.

5.2 Future Directions

In this project, we considered user written movie reviews for extracting features.Such a method could be extended or combined with other forms of movie meta-datasuch as plot, genres, keywords. With recent advancement in deep learning, it wouldbe interesting to study the effect of combining LDA as a preprocessing step in deeplearning analysis of movie reviews. In the following, we discuss a few interestingfuture directions.

35

Page 48: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

36 CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS

5.2.1 Movie review preprocessingBasic LDA model itself does not care about the word order. As it is easily observ-able, word-order matters in several cases, especially for bi-grams movie keywordssuch as “dark comedy” or “nordic horror”. We did a little experiment with bi-gramsbut ended up with noisy bi-grams based movie topics as the bi-grams were not con-sistent with their representation. Approaches in language construction [43] couldbe used to create multi-word movie keywords. Finally, extracting and using wordconstruction from movie reviews has the potential to further capture the moviesemantics.

5.2.2 Building complex topic modelsThe LDA model can be considered as a base model, and more complex models canbe build on top of it based on the complex needs we have from the data at hand.Correlated topic model (CTM) [44] and Dynamic topic model (DTM) [45] are suchmodels built on top of LDA. For example, DTM could be used to observe changingmovie patterns over the years. With TV shows being made for 10-15 seasons, DTMcould highlight the rise and fall of characters over the seasons.

Topic models can be extended to include additional information such as meta-data. For example, author-topic models attach the topic proportions to authors,making it possible to calculate author similarity [27] based on topic proportions. Hi-erarchical LDA models [46] are another direction to explore as extending hundredsof topics to thousands could represent a wide spectrum of movie genres. Recom-mending movies based on the topics liked by users and rating topics themselves aresome of the ways to improve extracted topics and build a system based on topicmodeling.

With so many choices of streamable content, the challenge is to efficiently extractfeatures from all forms of meta-data, recommend relevant content to the end-userand keep serendipity in your recommendation.

Page 49: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

Bibliography

[1] J. Booton, One-click netflix button to make movie streaming even easier | foxbusiness, en-US, Text.Article, Netflix, Aug. 2011. [Online]. Available: http://www.foxbusiness.com/markets/2011/01/04/click-netflix-button-appear-remote-controls-movie-streaming/ (visited on Jun. 11, 2014).

[2] A. C. Madrigal, How netflix reverse engineered hollywood, Jan. 2014. [Online].Available: http : / / www . theatlantic . com / technology / archive / 2014 /01/how- netflix- reverse- engineered- hollywood/282679/ (visited onMay 13, 2014).

[3] Me TV: how jinni is revolutionizing search. [Online]. Available: http://www.forbes.com/sites/dorothypomerantz/2013/02/18/me-tv-how-jinni-is-revolutionizing-search/ (visited on May 13, 2014).

[4] B. Fritz, “Cadre of film buffs helps netflix viewers sort through the clutter”,en-US, Los Angeles Times, Sep. 2012, issn: 0458-3035. [Online]. Available:http://articles.latimes.com/2012/sep/03/business/la-fi-0903-ct-netflix-taggers-20120903 (visited on May 15, 2014).

[5] J. Layton. (May 2006). How pandora radio works, [Online]. Available: http://computer.howstuffworks.com/internet/basics/pandora.htm.

[6] X. Amatriain, The netflix tech blog: netflix recommendations: beyond the 5stars (part 1). [Online]. Available: http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html (visited on Jun. 11,2014).

[7] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classificationusing machine learning techniques”, in Proceedings of EMNLP, 2002, pp. 79–86.

[8] S. Ah and C.-K. Shi, “Exploring movie recommendation system using cul-tural metadata”, in 2008 International Conference on Cyberworlds, Sep. 2008,pp. 431–438. doi: 10.1109/CW.2008.13.

[9] A. Blackstock and M. Spitz, “Classifying movie scripts by genre with a MEMMusing NLP-Based features”, Stanford, M.Sc.Course Natural Language Pro-cessing, Student report, Jun. 2008. [Online]. Available: http://nlp.stanford.edu/courses/cs224n/2008/reports/06.pdf.

37

Page 50: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

38 BIBLIOGRAPHY

[10] R. Berendsen, “Movie reviews: do words add up to a sentiment?”, PhD thesis,Rijksuniversiteit Groningen, Sep. 2010.

[11] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens:an open architecture for collaborative filtering of netnews”, in Proceedingsof the 1994 ACM Conference on Computer Supported Cooperative Work, ser.CSCW ’94, Chapel Hill, North Carolina, USA: ACM, 1994, pp. 175–186, isbn:0-89791-689-1. doi: 10 . 1145 / 192844 . 192905. [Online]. Available: http ://doi.acm.org/10.1145/192844.192905.

[12] S. Bird, E. Klein, and E. Loper, Natural language processing with Python,1st ed. O’Reilly, 2009.

[13] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling withLarge Corpora”, English, in Proceedings of the LREC 2010 Workshop onNew Challenges for NLP Frameworks, http://is.muni.cz/publication/884893/en, Valletta, Malta: ELRA, May 22, 2010, pp. 45–50.

[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: machinelearning in Python”, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[15] J. Vig, S. Sen, and J. Riedl, “Tagsplanations”, Proceedingsc of the 13th in-ternational conference on Intelligent user interfaces - IUI ’09, 2008. doi:10 . 1145 / 1502650 . 1502661. [Online]. Available: http : / / dx . doi . org /10.1145/1502650.1502661.

[16] T. Luostarinen and o. Kohonen, “Using topic models in content-based newsrecommender systems”, English, in nodalida13, ser. 19, vol. 85, Oslo, Norway:Linkoping University Electronic Press; 581 83 Linkoping; Sweden, May 2013,239 of 474, isbn: 978-91-7519-589-6. [Online]. Available: http://emmtee.net/oe/nodalida13/conference/11.pdf.

[17] R. K. V and K. Raghuveer, “Article: legal documents clustering using latentdirichlet allocation”, International Journal of Applied Information Systems,vol. 2, no. 6, pp. 27–33, May 2012, Published by Foundation of ComputerScience, New York, USA.

[18] R. Krestel, P. Fankhauser, and W. Nejdl, “Latent dirichlet allocation for tagrecommendation”, Proceedings of the third ACM conference on Recommendersystems - RecSys ’09, 2009. doi: 10.1145/1639714.1639726. [Online]. Avail-able: http://dx.doi.org/10.1145/1639714.1639726.

[19] C. Wang and D. M. Blei, “Collaborative topic modeling for recommendingscientific articles”, in Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, ser. KDD ’11, SanDiego, California, USA: ACM, 2011, pp. 448–456, isbn: 978-1-4503-0813-7.

Page 51: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

BIBLIOGRAPHY 39

doi: 10.1145/2020408.2020480. [Online]. Available: http://doi.acm.org/10.1145/2020408.2020480.

[20] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation”, J.Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003, issn: 1532-4435. [Online].Available: http://dl.acm.org/citation.cfm?id=944919.944937.

[21] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review”, CSUR,vol. 31, no. 3, pp. 264–323, 1999. doi: 10.1145/331499.331504. [Online].Available: http://dx.doi.org/10.1145/331499.331504.

[22] A. Huang, “Similarity measures for text document clustering”, in Proceed-ings of the sixth new zealand computer science research student conference(NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49–56.

[23] S. Bordag, “A comparison of co-occurrence and similarity measures as simula-tions of context”, Proceedings of the 9th international conference on Compu-tational linguistics and intelligent text processing, pp. 52–63, 2008. [Online].Available: http://dl.acm.org/citation.cfm?id=1787584.

[24] C. Perone. (Sep. 2013). Machine learning :: cosine similarity for vector spacemodels (part iii) | pyevolve, [Online]. Available: http://pyevolve.sourceforge.net/wordpress/?p=2497.

[25] T. Hofmann, “Probabilistic latent semantic indexing”, in Proceedings of the22nd annual international ACM SIGIR conference on Research and develop-ment in information retrieval, ACM, 1999, pp. 50–57.

[26] A. Aichert, “Feature extraction techniques”, in CAMP MEDICAL SEMINAR,2008.

[27] D. M. Blei, “Introduction to probabilistic topic models”, Communications ofthe ACM, 2011. [Online]. Available: http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf.

[28] A. K. McCallum, Mallet: a machine learning for language toolkit, 2002. [On-line]. Available: http://mallet.cs.umass.edu.

[29] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to InformationRetrieval. New York, NY, USA: Cambridge University Press, 2008, isbn:0521865719, 9780521865715.

[30] D. Olszewski, “Fraud detection in telecommunications using kullback-leiblerdivergence and latent dirichlet allocation”, English, in Adaptive and NaturalComputing Algorithms, ser. Lecture Notes in Computer Science, A. Dobnikar,U. Lotrič, and B. Šter, Eds., vol. 6594, Springer Berlin Heidelberg, 2011,pp. 71–80, isbn: 978-3-642-20266-7. doi: 10.1007/978-3-642-20267-4_8.[Online]. Available: http://dx.doi.org/10.1007/978-3-642-20267-4_8.

[31] Wikipedia. (Sep. 2014). Kullback–leibler divergence, [Online]. Available: https://en.wikipedia.org/wiki/Kullback-Leibler_divergence.

Page 52: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

40 BIBLIOGRAPHY

[32] DocEng 2011: document visual similarity measure for document search, Oct.2011. [Online]. Available: https://www.youtube.com/watch?v=KVFY-r-BLJQ&feature=youtube_gdata_player (visited on Aug. 26, 2014).

[33] D. M. Blei and J. D. Lafferty, “Topic models”, Text mining: classification,clustering, and applications, vol. 10, p. 71, 2009.

[34] P. Harsha. (Sep. 2011). Hellinger distance, [Online]. Available: http://www.tcs.tifr.res.in/~prahladh/teaching/2011-12/comm/lectures/l12.pdf.

[35] G. van Rossum and F. L. Drake, The Python Language Reference Manual.Network Theory Ltd., 2011, isbn: 1906966141, 9781906966140.

[36] D. Crystal, What is a corpus? what is corpus linguistics?, English, universitywebsite. [Online]. Available: http://www.tu-chemnitz.de/phil/english/chairs/linguist/independent/kursmaterialien/language_computers/whatis.htm.

[37] C. kiehl. (Dec. 2013). Parallelism in one line, [Online]. Available: https://medium.com/@thechriskiehl/parallelism-in-one-line-40e9b2b36148.

[38] L. Richardson. (Apr. 2007). Beautiful soup documentation, [Online]. Avail-able: http://www.crummy.com/software/BeautifulSoup/bs4/doc/.

[39] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,“Learning word vectors for sentiment analysis”, in Proceedings of the 49thAnnual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies, Portland, Oregon, USA: Association for Computa-tional Linguistics, Jun. 2011, pp. 142–150. [Online]. Available: http://www.aclweb.org/anthology/P11-1015.

[40] E. Loper and S. Bird. (May 17, 2002). NLTK: The Natural Language Toolkit.arXiv: cs / 0205028, [Online]. Available: http : / / arxiv . org / abs / cs /0205028.

[41] Recsyswiki. (Feb. 2011). Subjective evaluation measures, [Online]. Available:http://recsyswiki.com/wiki/Subjective_evaluation_measures.

[42] N. Tintarev and J. Masthoff, “Designing and evaluating explanations for rec-ommender systems”, English, in Recommender Systems Handbook, F. Ricci, L.Rokach, B. Shapira, and P. B. Kantor, Eds., Springer US, 2011, pp. 479–510,isbn: 978-0-387-85819-7. doi: 10.1007/978-0-387-85820-3_15. [Online].Available: http://dx.doi.org/10.1007/978-0-387-85820-3_15.

[43] M. Sahlgren and O. Knutsson, “Proceedings of the workshop on extractingand using constructions in nlp”, Swedish Institute of Computer Science, 2009.

[44] D. M. Blei and J. D. Lafferty, “A correlated topic model of science”, TheAnnals of Applied Statistics, vol. 1, no. 1, pp. 17–35, Jun. 2007. doi: 10.1214/07-AOAS114. [Online]. Available: http://dx.doi.org/10.1214/07-AOAS114.

Page 53: EfficientFeaturesforMovie RecommendationSystems759691/FULLTEXT01.pdf · i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document

BIBLIOGRAPHY 41

[45] D. M. Blei and J. D. Lafferty, “Dynamic topic models”, in Proceedings of the23rd International Conference on Machine Learning, ser. ICML ’06, Pitts-burgh, Pennsylvania: ACM, 2006, pp. 113–120, isbn: 1-59593-383-2. doi: 10.1145/1143844.1143859. [Online]. Available: http://doi.acm.org/10.1145/1143844.1143859.

[46] D. Griffiths and M. Tenenbaum, “Hierarchical topic models and the nestedchinese restaurant process”, Advances in neural information processing sys-tems, vol. 16, p. 17, 2004.