a probabilistic approach to tweets' sentiment classification - acii 2013 conference

Post on 15-Jan-2015

241 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference - Colace De Santo Greco

TRANSCRIPT

A Probabilistic Approach to Tweets’ Sentiment Classification

Francesco Colace, Massimo De Santo, Luca Greco

DIEM –Università degli Studi di Salerno

{fcolace, desanto, lgreco}@unisa.it

ACII 2013 – Geneva, 2-5 September 2013

Motivation Web 2.0 (or Web X.Y) rules!

Social Networks, Blogs, Microblogs, Reviews’ Collectors Sites: huge and terrific quantity of heterogeneus and opinonated data

ACII 2013 – Geneva, 2-5 September 2013

Motivation Open issues:

o How to manage this information?o How to extract the sentiment inside the data?o How to understand something about the users?o How to evaluate the opinion of people about some topics or

products? Sentiment Analysis

ACII 2013 – Geneva, 2-5 September 2013

Outline Brief introduction to the Sentiment Analysis

o Related Works

Towards a Sentiment Analysis Frameworko The Proposed Approach

• The LDA Approach• The Mixed Graph of Terms• A sentiment mining algorithm

Experimental results

Conclusions and Future WorksACII 2013 – Geneva, 2-5 September 2013

Sentiment Analysis Sentiment:

o a thought, view, or attitude, especially based mainly on emotion instead of reason

Sentiment Analysis (as known as Opinion mining):o use of Natural Language Processing (NLP) and computational

techniques to automate the extraction and classification of sentiment from unstructured texts

ACII 2013 – Geneva, 2-5 September 2013

Sentiment Analysis: Why?

Consumer informationo Product reviews (Amazon, e-Bay, …)

Marketingo Consumer attitudeso Trends

Politicso Politicians want to know voters’ point of viewso Voters want to know policitians’ stances and who else supports them

Socialo Find like-minded individuals or communities

ACII 2013 – Geneva, 2-5 September 2013

Sentiment Analysis: Open Issues

What features adopt?o Wordso Sentences

How to interpret features for sentiment detection?o As a bag of words o By the use of annotated lexiconso According to syntactic patternso Analyzing the paragraph structure

ACII 2013 – Geneva, 2-5 September 2013

Sentiment Analysis: Approaches

Naïve Bayes

Maximum Entropy Classifier

SVM

Markov Blanket Classifier

… … …

Latent Dirichlet Allocation (LDA)ACII 2013 – Geneva, 2-5 September 2013

The Proposed Approach: from the Bag-of-Words …

By the use of the Bag of Words approach, a document can be represented as an ordered set of words

Problems:

o What words express better the sentiment in a text?

o How to compare various «bag of words» derived from texts with the same sentiment?

o By the use of the bag of words is it possible to represent the documents’ domain of interest?

ACII 2013 – Geneva, 2-5 September 2013

… to mixed Graph of Terms (mGT)

The mixed Graph of Terms is a «graph based» representation of documents

In the proposed approach, a mixed Graph of Terms is obtained by an automatic extraction of words based on probabilistic clustering techniques as Latent Dirichlet Allocation (LDA)

In a mixed Graph of Terms the words are linked according to their mutual occurence probability and «aggregating_word» and «aggregated_words» can be recognized

Our proposal: a mixed Graph of Terms can be used as a «sentiment filter»

ACII 2013 – Geneva, 2-5 September 2013

mGT: a different point of view

In the proposed approach, in a mixed Graph of Terms two different layers can be recognized:

The Aggregator Layer: the words with higher degree of interconnection with the words that are in the documents

The “Aggregated Words” Layer: this layer expresses words that have higher degree of interconnection with one or more Aggregator Word

ACII 2013 – Geneva, 2-5 September 2013

Latent Dirichlet Allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a

generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar

For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics

The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words

By the use of the Latent Dirichlet Allocation technique a set of documents can be represented as a mixed Graph of Terms

ACII 2013 – Geneva, 2-5 September 2013

Extraction of a Mixed Graph of Terms

ACII 2013 – Geneva, 2-5 September 2013

mGT: an example

ACII 2013 – Geneva, 2-5 September 2013

Sentiment Classification by the use of mGT

Step_1: Learn a mixed Graph of Terms by the use of labelled documents (i.e. Positive or Negative) obtaining:o mGT positiveo mGT negative

Step_2: Use the mixed Graph of Terms as filter in order to classify the sentiment of textso Comparing concepts that are both in the mGTs

both in the texto Comparing words that are both in the mGTs both in

the text

ACII 2013 – Geneva, 2-5 September 2013

Sentiment Classification by the use of mGT

ACII 2013 – Geneva, 2-5 September 2013

Experimental Results

Dataset: Movie Reviews

Approach Accuracy

Support Vector Machine* 82,90

Naive Bayes* 81,50

Maximum Entropy* 81,00

mGT-LDA 88,50

*[Bo Pang, 2002]

ACII 2013 – Geneva, 2-5 September 2013

Experimental Results

Dataset: Real Tweets related to Politics Training Set: 3980 Tweets Test Set: 32185 Tweets

ACII 2013 – Geneva, 2-5 September 2013

Approach Accuracy

mGT-LDA 87,10

SVM 79,20

Naive Bayes 76,60

Experimental Results

ACII 2013 – Geneva, 2-5 September 2013

http://193.205.190.209/elezioni2013/

Experimental Results

ACII 2013 – Geneva, 2-5 September 2013

days

accuracy

Experimental Results

ACII 2013 – Geneva, 2-5 September 2013

Masterchef - http://193.205.190.209/tvshow/masterchef/

Conclusions

Pro:o Indipendent from Languageo Fast classificationo Continous Upgradeo Little Training Set

Cons:o In general, long Time for mGT building

processo An Annotated Lexicon is needed

ACII 2013 – Geneva, 2-5 September 2013

Future Works

To improve the classification by the continous update of the training set

To Introduce SentiWordnet as Annotated lexicon

To adopt an ontological formalism for a better representation of the mGT

To build a bigger tweets’ dataset

ACII 2013 – Geneva, 2-5 September 2013

Any Questions?

ACII 2013 – Geneva, 2-5 September 2013

Don’t forget to tweet your sentiment!!!

top related