sentiment detection naveen sharma(02005010) prateekchoudhary(02005016) yashpal meena(02005030) under...

30
Sentiment Detection Sentiment Detection Naveen Sharma(02005010) Naveen Sharma(02005010) PrateekChoudhary(02005016) PrateekChoudhary(02005016) Yashpal Meena(02005030) Yashpal Meena(02005030) Under guidance Under guidance Of Of Prof. Pushpak Bhattacharya Prof. Pushpak Bhattacharya

Upload: mae-riley

Post on 02-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Sentiment DetectionSentiment Detection

Naveen Sharma(02005010)Naveen Sharma(02005010)PrateekChoudhary(02005016)PrateekChoudhary(02005016)

Yashpal Meena(02005030)Yashpal Meena(02005030)Under guidance Under guidance

OfOfProf. Pushpak BhattacharyaProf. Pushpak Bhattacharya

OutlineOutline

Problem StatementProblem Statement

ChallengesChallenges

Earlier Work and Traditional ApproachesEarlier Work and Traditional Approaches

Recent AdvancesRecent Advances

Conclusion/Future DirectionsConclusion/Future Directions

Sentiment AnalysisSentiment Analysis

What is Sentiment Analysis?What is Sentiment Analysis?– Determining the overall polarity of a given Determining the overall polarity of a given

documentdocument

Polarity:Polarity:- Positive- Positive- Negative- Negative- Mixed- Mixed- Neutral- Neutral

MotivationMotivation

IndividualIndividual– Movie Reviews on web (Thumbs up or Thumbs down)Movie Reviews on web (Thumbs up or Thumbs down)

CommercialCommercial– Feedback/evaluation forms.Feedback/evaluation forms.– Opinions about a product.Opinions about a product.– Recognizing and discarding “flames” on newsgroups.Recognizing and discarding “flames” on newsgroups.

PoliticalPolitical– Opinions on government policiesOpinions on government policies

eg. Iraq War, Taxationeg. Iraq War, Taxation

Sentiment AnalysisSentiment Analysis

A type of Text ClassificationA type of Text ClassificationOther types of Text ClassificationsOther types of Text Classifications– Author based ClassificationAuthor based Classification– Topic CategorizationTopic Categorization

Sentiment Analysis and Topic Sentiment Analysis and Topic categorizationcategorization– Topics - subject matterTopics - subject matter– Sentiments - opinion towards subject matterSentiments - opinion towards subject matter

ChallengesChallenges

Reference to multiple objects in the same Reference to multiple objects in the same documentdocument- - The NR70 is The NR70 is trendy.trendy. T-Series is fast becoming T-Series is fast becoming obsoleteobsolete..Dependence on the context of the documentDependence on the context of the document- - “Unpredictable” plot ; “Unpredictable” performance“Unpredictable” plot ; “Unpredictable” performance

Negations have to be capturedNegations have to be captured- - Monochrome display is Monochrome display is notnot what the user what the user wantswants– It is It is notnot like the movie is a total waste of time like the movie is a total waste of time

Challenges(contd.)Challenges(contd.)

Metaphors/SimilesMetaphors/Similes

- - The metallic body is The metallic body is solid as a rocksolid as a rock

Part-of and Attribute-of relationshipsPart-of and Attribute-of relationships

- - The small keypad is inconvenientThe small keypad is inconvenient

Subtle ExpressionSubtle Expression

- - How can someone sit through this How can someone sit through this movie?movie?

Earlier Work (First approaches)Earlier Work (First approaches)

Naive BayesNaive Bayes

Maximum EntropyMaximum Entropy

Support Vector MachinesSupport Vector Machines

Naïve BayesNaïve Bayes

What is Naïve Bayesian ClassifierWhat is Naïve Bayesian Classifier

DifficultyDifficulty

-More than few variables-More than few variables

How to over come this difficultyHow to over come this difficulty

- Independence of variables- Independence of variables

Naïve Bayes(Contd.)Naïve Bayes(Contd.) --- set of predefined feature vectors--- set of predefined feature vectors

– Features can be representative words/word patternsFeatures can be representative words/word patternsEach document d represented by document vector Each document d represented by document vector

Where nWhere nii(d) = no. of times feature vector f(d) = no. of times feature vector f i i occurs in doccurs in d

Assign a document d to classAssign a document d to class

WhereWhere

P(d) plays no role in selecting c*.P(d) plays no role in selecting c*.

( )* ( / )( / )

( )

P c P d cP c d

P d

1 2{ , ,..... }mf f f

1( ( ),.... ( ))md n d n d

* arg max ( / )cc P c d

Naïve Bayes(contd.)Naïve Bayes(contd.)

Assuming fAssuming fiis are independent, Naïve Bayes s are independent, Naïve Bayes can be decomposed ascan be decomposed as

Advantages: Advantages: SimpleSimplePerforms Well Performs Well

( )

1( )( ( | ) )

( / ) :( )

im n d

iiNB

P c P f cP c d

P d

Recent AdvancesRecent Advances

An unsupervised learning algorithmAn unsupervised learning algorithm

Extract phrases from the review based on Extract phrases from the review based on pattern of parts of speech tags.pattern of parts of speech tags.

JJ = adjective NN = NounJJ = adjective NN = Noun

Eg. Extracting 2 word patternsEg. Extracting 2 word patterns

First wordFirst word Second WordSecond Word Third Word (Not Third Word (Not extracted)extracted)

JJJJ NN or NNSNN or NNS AnythingAnything

JJJJ JJJJ Not NN nor NNSNot NN nor NNS

Unsupervised Learning(contd.)Unsupervised Learning(contd.)

Estimate Semantic Orientation of Estimate Semantic Orientation of extracted phrasesextracted phrases

PMI (Pointwise Mutual Information) PMI (Pointwise Mutual Information) as strength of semantic associationas strength of semantic association

PMI(wordPMI(word11 , word , word22) = ) =

loglog22[ p(word[ p(word1 1 & word& word22)/ p(word)/ p(word11) p(word) p(word22)])]

SO(phrase) = SO(phrase) = PMI (phrase, ”excellent”) – PMI (phrase, “poor”)

Unsupervised Learning(contd.)Unsupervised Learning(contd.)

Determine the Determine the Semantic Orientation Semantic Orientation (SO) of the phrases(SO) of the phrases

Search on AltaVistaSearch on AltaVista

SO (SO (phrasephrase) = ) =

( " ") (" ")log

( " ") (" ")

hits phraseNear excellent hits poor

hits phraseNear poor hits excellent

Unsupervised Learning(contd.)Unsupervised Learning(contd.)

Calculate the average semantic orientation Calculate the average semantic orientation of phrases in the given review and classify of phrases in the given review and classify the review as recommended if the av-the review as recommended if the av-erage is positive and otherwise not erage is positive and otherwise not recommended.recommended.

Recent Advances(contd.)Recent Advances(contd.)

Subjectivity and min-cuts Approach by Subjectivity and min-cuts Approach by Pang and LeePang and Lee– Step1: labeling sentences as subjective and Step1: labeling sentences as subjective and

objective.objective.– Step2: applying standard machine learning Step2: applying standard machine learning

classifier to the subjective extract.classifier to the subjective extract.

Min cut approach(contd.)Min cut approach(contd.)

Formalization : Suppose we have n items Formalization : Suppose we have n items xx1 1 …..x…..xnn to divide into classes C to divide into classes C1 1 and Cand C22

We need two types of scores:We need two types of scores:– Individual scores indIndividual scores ind jj(x(xii))

estimate of each xestimate of each x ii’s preference’s preference

– Associative scores assoc(xAssociative scores assoc(x ii, x, xkk))

estimate of importance of both being in the estimate of importance of both being in the same classsame class

Min cut approach(contd.)Min cut approach(contd.)Maximize individual preferenceMaximize individual preference

Penalize tightly associated items in different Penalize tightly associated items in different classesclasses

Optimization problem: The formula for cost:Optimization problem: The formula for cost:

Build an undirected graph G with vertices {vBuild an undirected graph G with vertices {v11 ….v ….vnn, ,

s, t}s, t}

edge (s, vedge (s, vii) ---- weight ind) ---- weight ind11(x(xii))

1 2 1

2

2 1,

( ) ( ) ( , )i

k

i kx C x C x C

x C

ind x ind x assoc x x

Min cut approach(contd.)Min cut approach(contd.)

edge (vedge (vi i , t) – weight ind, t) – weight ind22(x(xii))

edge (vedge (vii, v, vkk) –weight assoc(x) –weight assoc(xii, x, xkk))

Classification problem now reduces to Classification problem now reduces to finding minimum cuts in the graphfinding minimum cuts in the graph

Min cut approach(contd.)Min cut approach(contd.)

Min cut approach(contd.)Min cut approach(contd.)

Advantages/Analysis:Advantages/Analysis:– Different algorithmsDifferent algorithms– Maximum flow algorithms Maximum flow algorithms – N most subjective sentences.N most subjective sentences.– Last N sentences Last N sentences – Most Subjective N sentencesMost Subjective N sentences

Recent AdvancesRecent Advances

Using linguistic knowledge and wordnet Using linguistic knowledge and wordnet synonymy graphs – Agarwal and synonymy graphs – Agarwal and BhattacharyaBhattacharya

On Movie reviewsOn Movie reviews

Bag of words featuresBag of words features

Strength of adjective:Strength of adjective:

( , ) ( , )( )

( , )

d w bad d w goodEVA w

d good bad

Wordnet Approach(contd.)Wordnet Approach(contd.)

aboutabout and and ofof sentences sentences– About the movie (review)About the movie (review)– Whats in the movieWhats in the movie

Two kinds of weights:Two kinds of weights:– Individual weights :: probability estimates by an SVM Individual weights :: probability estimates by an SVM

classifierclassifier– Mutual weights:: tendency to fall in same categoryMutual weights:: tendency to fall in same category

Physical separationPhysical separation– Paragraph boundariesParagraph boundaries

Contextual similarityContextual similarity– Total adjective strengthTotal adjective strength– Scaling and distance measureScaling and distance measure

WordnetWordnet Approach(cont.) Approach(cont.)

Minimum cut algorithm similar to Pang and LeeMinimum cut algorithm similar to Pang and Lee

Mutual Similarity CoefficientMutual Similarity Coefficient

ffkk is the kth feature is the kth feature

FFii(f(fkk) = 1 if kth feature present in document) = 1 if kth feature present in document

= 0 otherwise= 0 otherwise

min

max min

( )* ( )( , ) i k j kki j

F f F f sMSC d d

s s

WordnetWordnet Approach(contd.) Approach(contd.)

SVM trained to give PrSVM trained to give Prgoodgood and Pr and Prbadbad

SVM probabilities and MSC values – SVM probabilities and MSC values – Weights MatrixWeights Matrix

Min cut ApproachMin cut Approach

WordnetWordnet Approach(contd.) Approach(contd.)

AnalysisAnalysis– Mutual relationships between documentsMutual relationships between documents– Graph cut technique as simple and powerfulGraph cut technique as simple and powerful– Decline in accuracy with subjectivityDecline in accuracy with subjectivity– Wordnet Wordnet - a useful lexicon resource- a useful lexicon resource

Conclusion/Future DirectionsConclusion/Future Directions

Practical UtilityPractical Utility

Harder than other text classificationsHarder than other text classifications

Traditional machine learning techniques Traditional machine learning techniques don’t perform that well.don’t perform that well.

Linguistic knowledge needs to be usedLinguistic knowledge needs to be used– Eg. Eg. WordnetWordnet

Subjectivity extracts and mutual Subjectivity extracts and mutual dependenciesdependencies

Conclusion/Future DirectionsConclusion/Future Directions

Better measure to incorporate linguistic Better measure to incorporate linguistic knowledgeknowledge

Better measures for degree of similarityBetter measures for degree of similarity

Formulation as multiclass problemFormulation as multiclass problem– Eg. Emotional icons in messengersEg. Emotional icons in messengers– May be helpful in building psychological May be helpful in building psychological

profiles through newsgroup mailsprofiles through newsgroup mails

ReferencesReferences

Alekh Agarwal and Pushpak Bhattacharyya, Alekh Agarwal and Pushpak Bhattacharyya, Sentiment Analysis: A New Sentiment Analysis: A New Approach for Effective Use of Linguistic Knowledge and Exploiting Approach for Effective Use of Linguistic Knowledge and Exploiting Similarities in a Set of Documents to be ClassifiedSimilarities in a Set of Documents to be Classified, International Conference , International Conference on Natural Language Processing (on Natural Language Processing ( ICON 05 ICON 05), IIT Kanpur, India, December, ), IIT Kanpur, India, December, 20052005

Bo Pang and Lillian Lee, Bo Pang and Lillian Lee, A Sentimental Education:Sentiment Analysis Using A Sentimental Education:Sentiment Analysis Using Subjectivity Summarization Based on Minimum CutsSubjectivity Summarization Based on Minimum Cuts, Proceedings of ACL, , Proceedings of ACL, 2004.2004.

Bo Pang, Lillian Lee and Shivakumar Vaithyanathan, Bo Pang, Lillian Lee and Shivakumar Vaithyanathan, Thumbs Up? Thumbs Up? Sentiment Classification Using Machine Learning TechniquesSentiment Classification Using Machine Learning Techniques, Proceedings , Proceedings of EMNLP 2002,pp 79-86.of EMNLP 2002,pp 79-86.

Peter Turney. 2002. Peter Turney. 2002. Thumbs up or thumbs down? Se-mantic orientation Thumbs up or thumbs down? Se-mantic orientation applied to unsupervised classication of reviewsapplied to unsupervised classication of reviews. In Proc. of the ACL.. In Proc. of the ACL.

Thank YouThank You