presentation on a specialist topic in data mining and text analytics (tv and films)

12
By Klejdi Muca & Stephen Quinn Presentation on a specialist topic in Data Mining and Text Analytics (TV and Films)

Upload: benoit

Post on 25-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Presentation on a specialist topic in Data Mining and Text Analytics (TV and Films) . By Klejdi Muca & Stephen Quinn. What is Data Mining?. A method used by companies like IMDB or Netlfix to turn raw data into useful information, f or example - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

By Klejdi Muca & Stephen Quinn

Presentation on a specialist topic in Data Mining and Text Analytics

(TV and Films)

Page 2: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

What is Data Mining?

A method used by companies like IMDB or Netlfix to turn raw data into useful information, for example

• It helps companies concentrate on the most important behavioural data that they have collected from their users and even potential users.

• It enables companies such as Blockbuster to mine their video rental history database to recommend rentals to individual customers.

• The techniques and algorithms data mining uses will not just change a presentation, but discovers formerly unknown relationships in the data.

Page 3: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

Introduction to

The Internet Movie Database provides current Film and TV programme information freely to the user. IMDB includes plot summaries, actors, production crew and significantly offers a rating system that allows users to rate films on a scale of one to ten. “The database aims to capture any and all information associated with movies from any part of the world, starting with the earliest cinema to the very latest releases.” IMDB uses data mining techniques to find relationships in its dataset and structures it well allowing the user to navigate around the website easily and efficiently.

Page 4: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

In 2012 The AIUB (American international University-Bangladesh) started a project in which they attempted to create a classification scheme of pre-release movie popularity based on inherent attributes using C4.5 (an algorithm used to generate a decision tree.) their aim was to basically attempt to create a system that would predict how popular a film/ TV title would be based on the relationships found between data gathered from other Film/TV titles. The data gathered included:• production budget • actors • directors • country • language • release date

All of this information would be parsed and inserted into an SQL database where queries will be created and sorted into its final data sets and analysed with the use of WEKA for patterns in the relationships, examples would be whether the more money spent on a film would result in a greater financial return or if films directed by a certain director would be more likely to be popular.

Movie Popularity Classification based on Inherent Movie Attributes using C4.5 (IMDB)

Page 5: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

Figure 1

Overview of Process Architecture

Conclusion

“ The model and theoretical machine learning steps as shown in this paper will benefit various internet sites that are dealing with movie information. It will also aid producers and directors. It will also assist the film financing organizations to make decisions on movie rentals, streaming services, brand sponsorship, etc”

Page 6: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

Introduction to

Netflix is an American based internet streaming service that provides on demand TV programmes and films to its subscribers. Netflix uses data mining to its advantage by mining the films and TV programmes that the subscriber has watched as well as the rating that they gave, Netflix will then use data mining techniques to find patterns in the data and then proceed to produce recommendations to the subscriber.

On October 2nd 2006 the 'Netflix Prize' began, the aim of the competition was for its competitors to create a collaborative filtering algorithm that improved Netflix's prediction accuracy by 10%, the winners of the competition were BellKor's pragmatic chaos team who in 2009 achieved an improvement of 10.06%.

Why did they do this? Customer satisfaction/retention is key to Netflix – they would really like to improve their recommendation systems.

Page 7: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

Introduction to

Page 8: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

Data Mining Techniques

Classification

Attribute Importance

This technique is commonly used for predicting a precise outcome such as star ratings and whether the user is likely to watch or not watch a TV programme or film.

This technique is used to rank the strength of a relationship with its target attribute, for example the budget of a film and its relationship with how popular the film will be the same can be done with actors, actresses or directors that are involved with a film and consequently how likely the film is to be popular based on those attributes.

Page 9: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

Data Mining Techniques

Anomaly Detection

Clustering

This technique is used to find natural groups within a data set, for example movie genres, films by certain directors and TV or films that contain a specific actor/actress.

This technique is used to detect results that do not follow the normal pattern a good example of this would be from the Netflix prize when the film ‘Napoleon Dynamite’ caused problems for the participants because of users varying ratings of the film, some users rated the film poorly whereas others rated it very highly making it very hard to predict how popular the film was going to be, some contestants claimed to be on average eight-tenths of a star out but on films such as ‘Napoleon Dynamite’ they were off by an average of 1.2 stars.

Page 10: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

What is Text Analytics?

Text analytics is the process of finding High quality information/knowledge from a piece of text. This is done through the use of software such as: • Autonomy • AeroText • Medallia

These pieces of software analyse the text to find patterns and trends through statistical pattern learning.

Around 80% of information in the world is currently stored in unstructured textual format.

Page 11: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

We can analyse a film or TV programmes popularity by extracting reviews from websites such as Rotten Tomatoes, IMDB and Twitter. Both Rotten Tomatoes and Twitter contain API's (application programming interface) that will allow us to write a program that will interact with the data set and extract the data that we need. IMDB however does not contain an API meaning we would have to extract the data manually.

Text Analytics

Page 12: Presentation on a specialist topic in Data Mining and Text  Analytics (TV and Films)

From Twitter we can search for the movie by using the hashtag or any words that relates to the film. For example for the film Twilight a user can type in Breaking Bad or #BreakingBad and get all information other users opinions about the film around the world.

Or if the user wants to be more specific and refine the result they can simply search Breaking bad/ and other key words such as good/ amazing/ terrible and they will be presented with other people’s review on the film.

Each tweet can be analysed to find key words and phrases that are commonly used, to get an understanding of the trends and patterns.

Text Analytics Techniques