sentiment analysis - stanford university · 2018-10-25 · dan jurafsky sentiment analysis •...

SentimentAnalysis

WhatisSentimentAnalysis?

DanJurafsky

Positiveornegativemoviereview?

• unbelievablydisappointing• Fullofzanycharactersandrichlyappliedsatire,andsome

greatplottwists• thisisthegreatestscrewballcomedyeverfilmed• Itwaspathetic.Theworstpartaboutitwastheboxing

scenes.

2

DanJurafsky

• a

3

GoogleShoppingaspectshttps://www.google.com/shopping/product/7914298775914872081

DanJurafsky TwittersentimentversusGallupPollofConsumerConfidence

Brendan O'Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. FromTweetstoPolls:LinkingTextSentimenttoPublicOpinionTimeSeries.InICWSM-2010

DanJurafsky

Twittersentiment:

JohanBollen,Huina Mao,Xiaojun Zeng.2011.Twittermoodpredictsthestockmarket,JournalofComputationalScience2:1,1-8.10.1016/j.jocs.2010.12.007.

5

DanJurafsky

TargetSentimentonTwitter

• TwitterSentimentApp• AlecGo,Richa Bhayani,LeiHuang.2009.

TwitterSentimentClassificationusingDistantSupervision

6

DanJurafsky

Sentimentanalysishasmanyothernames

• Opinionextraction• Opinionmining• Sentimentmining• Subjectivityanalysis

7

DanJurafsky

Whysentimentanalysis?

• Movie:isthisreviewpositiveornegative?• Products:whatdopeoplethinkaboutthenewiPhone?• Publicsentiment:howisconsumerconfidence?Isdespairincreasing?

• Politics:whatdopeoplethinkaboutthiscandidateorissue?• Prediction:predictelectionoutcomesormarkettrendsfromsentiment

8

DanJurafsky

SchererTypologyofAffectiveStates

• Emotion:brieforganicallysynchronized…evaluationofamajorevent• angry,sad,joyful,fearful,ashamed,proud,elated

• Mood:diffusenon-causedlow-intensitylong-durationchangeinsubjectivefeeling• cheerful,gloomy,irritable,listless,depressed,buoyant

• Interpersonalstances:affectivestancetowardanotherpersoninaspecificinteraction• friendly,flirtatious,distant,cold,warm,supportive,contemptuous

• Attitudes:enduring,affectivelycoloredbeliefs,dispositionstowardsobjectsorpersons• liking,loving,hating,valuing,desiring

• Personalitytraits:stablepersonalitydispositionsandtypicalbehaviortendencies• nervous,anxious,reckless,morose,hostile,jealous

DanJurafsky

SentimentAnalysis

• Sentimentanalysisisthedetectionofattitudes“enduring,affectivelycoloredbeliefs,dispositionstowardsobjectsorpersons”1. Holder(source)ofattitude2. Target(aspect)ofattitude3. Typeofattitude• Fromasetoftypes

• Like,love,hate,value,desire, etc.• Or(morecommonly)simpleweightedpolarity:

• positive,negative,neutral,togetherwithstrength4. Text containingtheattitude• Sentence orentiredocument11

DanJurafsky

SentimentAnalysis

• Simplesttask:• Istheattitudeofthistextpositiveornegative?

• Morecomplex:• Ranktheattitudeofthistextfrom1to5

• Advanced:• Detectthetarget(stancedetection)• Detectsource• Complexattitudetypes

SentimentAnalysis

WhatisSentimentAnalysis?

SentimentAnalysis

ABaselineAlgorithm

DanJurafsky

Sentiment Classification in Movie Reviews

• Polaritydetection:• IsanIMDBmoviereviewpositiveornegative?

• Data:PolarityData2.0:• http://www.cs.cornell.edu/people/pabo/movie-review-data

BoPang,LillianLee,andShivakumar Vaithyanathan.2002.Thumbsup?SentimentClassificationusingMachineLearningTechniques.EMNLP-2002,79—86.BoPangandLillianLee.2004.ASentimentalEducation:SentimentAnalysisUsingSubjectivitySummarizationBasedonMinimumCuts.ACL,271-278

DanJurafsky

IMDBdatainthePangandLeedatabase

when_starwars_cameoutsometwentyyearsago,theimageoftravelingthroughoutthestarshasbecomeacommonplaceimage.[…]whenhan sologoeslightspeed,thestarschangetobrightlines,goingtowardstheviewerinlinesthatconvergeataninvisiblepoint.cool._october sky_offersamuchsimplerimage–thatofasinglewhitedot,travelinghorizontallyacrossthenightsky.[...]

“snakeeyes”isthemostaggravatingkindofmovie:thekindthatshowssomuchpotentialthenbecomesunbelievablydisappointing.it’snotjustbecausethisisabriandepalma film,andsincehe’sagreatdirectorandonewho’sfilmsarealwaysgreetedwithatleastsomefanfare.andit’snotevenbecausethiswasafilmstarringnicolas cageandsincehegivesabrauvara performance,thisfilmishardlyworthhistalents.

✓ ✗

DanJurafsky

BaselineAlgorithm(adaptedfromPangandLee)

• Tokenization• FeatureExtraction• Classificationusingdifferentclassifiers• NaiveBayes• MaxEnt• SVM

DanJurafsky

SentimentTokenizationIssues

• DealwithHTMLandXMLmarkup• Twittermark-up(names,hashtags)• Capitalization(preservefor

wordsinallcaps)• Phonenumbers,dates• Emoticons• Usefulcode:

• ChristopherPottssentimenttokenizer• BrendanO’Connortwittertokenizer19

[<>]? # optional hat/brow[:;=8] # eyes[\-o\*\']? # optional nose[\)\]$\[dDpP/\:\}\{@\|\\] # mouth | #### reverse orientation[$\]\(\[dDpP/\:\}\{@\|\\] # mouth[\-o\*\']? # optional nose[:;=8] # eyes[<>]? # optional hat/brow

Pottsemoticons

DanJurafsky

ExtractingFeaturesforSentimentClassification

• Howtohandlenegation• I didn’t like this movievs• Don't dismiss this film

20

DanJurafsky

Negation

AddNOT_toeverywordbetweennegationandfollowingpunctuation:

didn’t like this movie , but I

didn’t NOT_like NOT_this NOT_movie but I

Das,Sanjiv andMikeChen.2001.Yahoo!forAmazon:Extractingmarketsentimentfromstockmessageboards.InProceedingsoftheAsiaPacificFinanceAssociationAnnualConference(APFA).Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.

DanJurafsky

ExtractingFeaturesforSentimentClassification

Whichwordstouse?• Onlyadjectives• Allwords

Allwordsturnsouttoworkbetter,atleastonthisdata22

DanJurafsky

Reminder:Naive Bayes

23

cNB = argmaxc j∈C

P(cj ) P(wi | cj )i∈positions∏

DanJurafsky


24

LetNc benumberofdocumentswithclasscLetNdoc betotalnumberofdocuments

6.2 • TRAINING THE NAIVE BAYES CLASSIFIER 5

positions all word positions in test document

cNB = argmaxc2C

P(c)Y

i2positions

P(wi|c) (6.9)

Naive Bayes calculations, like calculations for language modeling, are done inlog space, to avoid underflow and increase speed. Thus Eq. 6.9 is generally insteadexpressed as

cNB = argmaxc2C

logP(c)+X

i2positions

logP(wi|c) (6.10)

By considering features in log space Eq. 6.10 computes the predicted class asa linear function of input features. Classifiers that use a linear combination ofthe inputs to make a classification decision —like naive Bayes and also logisticregression— are called linear classifiers.linear

classifiers

6.2 Training the Naive Bayes Classifier

How can we learn the probabilities P(c) and P( fi|c)? Let’s first consider the max-imum likelihood estimate. We’ll simply use the frequencies in the data. For thedocument prior P(c) we ask what percentage of the documents in our training setare in each class c. Let Nc be the number of documents in our training data withclass c and Ndoc be the total number of documents. Then:

P̂(c) =Nc

Ndoc(6.11)

(6.12)

To learn the probability P( fi|c), we’ll assume a feature is just the existence of aword in the document’s bag of words, and so we’ll want P(wi|c), which we computeas the fraction of times the word wi appears among all words in all documents oftopic c. We first concatenate all documents with category c into one big “categoryc” text. Then we use the frequency of wi in this concatenated document to give amaximum likelihood estimate of the probability:

P̂(wi|c) =count(wi,c)Pw2V count(w,c)

(6.13)

Here the vocabulary V consists of the union of all the word types in all classes,not just the words in one class c.

There is a problem, however, with maximum likelihood training. Imagine weare trying to estimate the likelihood of the word “fantastic” given class positive, butsuppose there are no training documents that both contain the word “fantastic” andare classified as positive. Perhaps the word “fantastic” happens to occur (sarcasti-cally?) in the class negative. In such a case the probability for this feature will bezero:

DanJurafsky


25

• Likelihoods

• Whataboutzeros?Suppose"fantastic"neveroccurs?

• Add-onesmoothing

6.2 • TRAINING THE NAIVE BAYES CLASSIFIER 5

positions all word positions in test document

cNB = argmaxc2C

P(c)Y

i2positions

P(wi|c) (6.9)

Naive Bayes calculations, like calculations for language modeling, are done inlog space, to avoid underflow and increase speed. Thus Eq. 6.9 is generally insteadexpressed as

cNB = argmaxc2C

logP(c)+X

i2positions

logP(wi|c) (6.10)

By considering features in log space Eq. 6.10 computes the predicted class asa linear function of input features. Classifiers that use a linear combination ofthe inputs to make a classification decision —like naive Bayes and also logisticregression— are called linear classifiers.linear

classifiers

6.2 Training the Naive Bayes Classifier

How can we learn the probabilities P(c) and P( fi|c)? Let’s first consider the max-imum likelihood estimate. We’ll simply use the frequencies in the data. For thedocument prior P(c) we ask what percentage of the documents in our training setare in each class c. Let Nc be the number of documents in our training data withclass c and Ndoc be the total number of documents. Then:

P̂(c) =Nc

Ndoc(6.11)

(6.12)

To learn the probability P( fi|c), we’ll assume a feature is just the existence of aword in the document’s bag of words, and so we’ll want P(wi|c), which we computeas the fraction of times the word wi appears among all words in all documents oftopic c. We first concatenate all documents with category c into one big “categoryc” text. Then we use the frequency of wi in this concatenated document to give amaximum likelihood estimate of the probability:

P̂(wi|c) =count(wi,c)Pw2V count(w,c)

(6.13)

Here the vocabulary V consists of the union of all the word types in all classes,not just the words in one class c.

There is a problem, however, with maximum likelihood training. Imagine weare trying to estimate the likelihood of the word “fantastic” given class positive, butsuppose there are no training documents that both contain the word “fantastic” andare classified as positive. Perhaps the word “fantastic” happens to occur (sarcasti-cally?) in the class negative. In such a case the probability for this feature will bezero:

6 CHAPTER 6 • NAIVE BAYES AND SENTIMENT CLASSIFICATION

P̂(“fantastic”|positive) =count(“fantastic”,positive)P

w2V count(w,positive)= 0 (6.14)

But since naive Bayes naively multiplies all the feature likelihoods together, zeroprobabilities in the likelihood term for any class will cause the probability of theclass to be zero, no matter the other evidence!

The simplest solution is the add-one (Laplace) smoothing introduced in Chap-ter 4. While Laplace smoothing is usually replaced by more sophisticated smoothingalgorithms in language modeling, it is commonly used in naive Bayes text catego-rization:

P̂(wi|c) =count(wi,c)+1P

w2V (count(w,c)+1)=

count(wi,c)+1�Pw2V count(w,c)

�+ |V |

(6.15)

Note once again that it is a crucial that the vocabulary V consists of the unionof all the word types in all classes, not just the words in one class c (try to convinceyourself why this must be true; see the exercise at the end of the chapter).

What do we do about words that occur in our test data but are not in our vocab-ulary at all because they did not occur in any training document in any class? Thestandard solution for such unknown words is to ignore such words—remove themfrom the test document and not include any probability for them at all.

Finally, some systems choose to completely ignore another class of words: stopwords, very frequent words like the and a. This can be done by sorting the vocabu-stop words

lary by frequency in the training set, and defining the top 10–100 vocabulary entriesas stop words, or alternatively by using one of the many pre-defined stop word listavailable online. Then every instance of these stop words are simply removed fromboth training and test documents as if they had never occurred. In most text classi-fication applications, however, using a stop word list doesn’t improve performance,and so it is more common to make use of the entire vocabulary and not use a stopword list.

Fig. 6.2 shows the final algorithm.

6.3 Worked example

Let’s walk through an example of training and testing naive Bayes with add-onesmoothing. We’ll use a sentiment analysis domain with the two classes positive(+) and negative (-), and take the following miniature training and test documentssimplified from actual movie reviews.

Cat DocumentsTraining - just plain boring

- entirely predictable and lacks energy- no surprises and very few laughs+ very powerful+ the most fun film of the summer

Test ? predictable with no fun

The prior P(c) for the two classes is computed via Eq. 6.12 as NcNdoc

:


P̂(“fantastic”|positive) =count(“fantastic”,positive)P

w2V count(w,positive)= 0 (6.14)

But since naive Bayes naively multiplies all the feature likelihoods together, zeroprobabilities in the likelihood term for any class will cause the probability of theclass to be zero, no matter the other evidence!

The simplest solution is the add-one (Laplace) smoothing introduced in Chap-ter 4. While Laplace smoothing is usually replaced by more sophisticated smoothingalgorithms in language modeling, it is commonly used in naive Bayes text catego-rization:

P̂(wi|c) =count(wi,c)+1P

w2V (count(w,c)+1)=

count(wi,c)+1�Pw2V count(w,c)

�+ |V |

(6.15)

Note once again that it is a crucial that the vocabulary V consists of the unionof all the word types in all classes, not just the words in one class c (try to convinceyourself why this must be true; see the exercise at the end of the chapter).

What do we do about words that occur in our test data but are not in our vocab-ulary at all because they did not occur in any training document in any class? Thestandard solution for such unknown words is to ignore such words—remove themfrom the test document and not include any probability for them at all.

Finally, some systems choose to completely ignore another class of words: stopwords, very frequent words like the and a. This can be done by sorting the vocabu-stop words

lary by frequency in the training set, and defining the top 10–100 vocabulary entriesas stop words, or alternatively by using one of the many pre-defined stop word listavailable online. Then every instance of these stop words are simply removed fromboth training and test documents as if they had never occurred. In most text classi-fication applications, however, using a stop word list doesn’t improve performance,and so it is more common to make use of the entire vocabulary and not use a stopword list.

Fig. 6.2 shows the final algorithm.

6.3 Worked example

Let’s walk through an example of training and testing naive Bayes with add-onesmoothing. We’ll use a sentiment analysis domain with the two classes positive(+) and negative (-), and take the following miniature training and test documentssimplified from actual movie reviews.

Cat DocumentsTraining - just plain boring

- entirely predictable and lacks energy- no surprises and very few laughs+ very powerful+ the most fun film of the summer

Test ? predictable with no fun

The prior P(c) for the two classes is computed via Eq. 6.12 as NcNdoc

:

DanJurafsky

Binarized (Booleanfeature)MultinomialNaive Bayes

• Intuition:• Forsentiment(andprobablyforothertextclassificationdomains)

• Wordoccurrencemaymattermorethanwordfrequency• Theoccurrenceofthewordfantastic tellsusalot• Thefactthatitoccurs5timesmaynottellusmuchmore.

• "BinaryNaive Bayes"• Clipsallthewordcountsineachdocumentat1

26

DanJurafsky

BooleanMultinomialNaiveBayes:Learning

• CalculateP(cj) terms• Foreachcj inC do

docsj¬ alldocswithclass=cj

P(cj )←| docsj |

| total # documents| P(wk | cj )←nk +α

n+α |Vocabulary |

• Textj¬ singledoccontainingalldocsj• For eachwordwk inVocabulary

nk¬ #ofoccurrencesofwk inTextj

• Fromtrainingcorpus,extractVocabulary

• CalculateP(wk | cj) terms• Removeduplicatesineachdoc:

• Foreachwordtypewindocj• Retainonlyasingleinstanceofw

DanJurafsky

BooleanMultinomialNaive Bayes(BinaryNB)onatestdocumentd

28

• Firstremoveallduplicatewordsfromd• ThencomputeNBusingthesameequation:

cNB = argmaxc j∈C

P(cj ) P(wi | cj )i∈positions∏

DanJurafsky

Normalvs.BinaryNB

29


P(+)P(S|+) =25⇥ 1⇥1⇥2

293 = 3.2⇥10�5

The model thus predicts the class negative for the test sentence.

6.4 Optimizing for Sentiment Analysis

While standard naive Bayes text classification can work well for sentiment analysis,some small changes are generally employed that improve performance.

First, for sentiment classification and a number of other text classification tasks,whether a word occurs or not seems to matter more than its frequency. Thus it oftenimproves performance to clip the word counts in each document at 1. This variantis called binary multinominal naive Bayes or binary NB. The variant uses thebinary NB

same Eq. 6.10 except that for each document we remove all duplicate words beforeconcatenating them into the single big document. Fig. 6.3 shows an example inwhich a set of four documents (shortened and text-normalized for this example) areremapped to binary, with the modified counts shown in the table on the right. Theexample is worked without add-1 smoothing to make the differences clearer. Notethat the results counts need not be 1; the word great has a count of 2 even for BinaryNB, because it appears in multiple documents.

Four original documents:� it was pathetic the worst part was the

boxing scenes� no plot twists or great scenes+ and satire and great plot twists+ great scenes great film

After per-document binarization:� it was pathetic the worst part boxing

scenes� no plot twists or great scenes+ and satire great plot twists+ great scenes film

NB BinaryCounts Counts+ � + �

and 2 0 1 0boxing 0 1 0 1film 1 0 1 0great 3 1 2 1it 0 1 0 1no 0 1 0 1or 0 1 0 1part 0 1 0 1pathetic 0 1 0 1plot 1 1 1 1satire 1 0 1 0scenes 1 2 1 2the 0 2 0 1twists 1 1 1 1was 0 2 0 1worst 0 1 0 1

Figure 6.3 An example of binarization for the binary naive Bayes algorithm.

A second important addition commonly made when doing text classification forsentiment is to deal with negation. Consider the difference between I really like thismovie (positive) and I didn’t like this movie (negative). The negation expressed bydidn’t completely alters the inferences we draw from the predicate like. Similarly,negation can modify a negative word to produce a positive review (don’t dismiss thisfilm, doesn’t let us get bored).

A very simple baseline that is commonly used in sentiment to deal with negationis during text normalization to prepend the prefix NOT to every word after a tokenof logical negation (n’t, not, no, never) until the next punctuation mark. Thus thephrase


P(+)P(S|+) =25⇥ 1⇥1⇥2

293 = 3.2⇥10�5
















P(+)P(S|+) =25⇥ 1⇥1⇥2

293 = 3.2⇥10�5















DanJurafsky

BinaryNB

• Binaryworksbetterthanfullwordcountsforsentimentclassification

30

B.Pang,L.Lee,andS.Vaithyanathan.2002.Thumbsup?SentimentClassificationusingMachineLearningTechniques.EMNLP-2002,79—86.Wang,Sida,andChristopherD.Manning.2012."Baselinesandbigrams:Simple,goodsentimentandtopicclassification."ProceedingsofACL,90-94.

DanJurafsky

Cross-Validation

• Breakupdatainto5folds• (Equalpositiveandnegativeinsideeachfold?)

• Foreachfold• Choosethefoldasatemporarytestset

• Trainon4folds,computeperformanceonthetestfold

• Reportaverageperformanceofthe5runs

TrainingTest

Test

Test

Test

Test

Training

Training Training

Training

Training

Iteration

1

2

3

4

5

DanJurafsky

OtherissuesinClassification

• LogisticRegressionandSVMtendtodobetterthanNaïve Bayes

32

DanJurafsky Problems:Whatmakesreviewshardtoclassify?

• Subtlety:• PerfumereviewinPerfumes:theGuide:• “Ifyouarereadingthisbecauseitisyourdarlingfragrance,pleasewearitathomeexclusively,andtapethewindowsshut.”

• DorothyParkeronKatherineHepburn• “SherunsthegamutofemotionsfromAtoB”

33

DanJurafsky

ThwartedExpectationsandOrderingEffects

• “Thisfilmshouldbebrilliant.Itsoundslikeagreatplot,theactorsarefirstgrade,andthesupportingcastisgoodaswell,andStalloneisattemptingtodeliveragoodperformance.However,itcan’tholdup.”

• WellasusualKeanuReevesisnothingspecial,butsurprisingly,theverytalentedLaurenceFishbourne isnotsogoodeither,Iwassurprised.

34

SentimentAnalysis

ABaselineAlgorithm

SentimentAnalysis

SentimentLexicons

DanJurafsky

TheGeneralInquirer

• Homepage:http://www.wjh.harvard.edu/~inquirer• ListofCategories: http://www.wjh.harvard.edu/~inquirer/homecat.htm

• Spreadsheet:http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls• Categories:

• Positiv (1915words)andNegativ (2291words)• Strongvs Weak,Activevs Passive,OverstatedversusUnderstated• Pleasure,Pain,Virtue,Vice,Motivation,CognitiveOrientation,etc

• FreeforResearchUse

PhilipJ.Stone,DexterCDunphy,MarshallS.Smith,DanielM.Ogilvie.1966.TheGeneralInquirer:AComputerApproachtoContentAnalysis.MITPress

DanJurafsky

LIWC(LinguisticInquiryandWordCount)Pennebaker,J.W.,Booth,R.J.,&Francis,M.E.(2007).LinguisticInquiryandWordCount:LIWC2007.Austin,TX

• Homepage:http://www.liwc.net/• 2300words,>70classes• AffectiveProcesses

• negativeemotion(bad,weird,hate,problem,tough)• positiveemotion(love,nice,sweet)

• CognitiveProcesses• Tentative(maybe,perhaps,guess),Inhibition(block,constraint)

• Pronouns,Negation(no,never),Quantifiers(few,many)• $30or$90fee

DanJurafsky

MPQASubjectivityCuesLexicon

• Homepage:http://mpqa.cs.pitt.edu/lexicons/• 6885wordsfrom8221lemmas

• 2718positive• 4912negative

• Eachwordannotatedforintensity(strong,weak)• GNUGPL39

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.

Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.

DanJurafsky

BingLiuOpinionLexicon

• BingLiu'sPageonOpinionMining• http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

• 6786words• 2006positive• 4783negative

40

Minqing HuandBingLiu.MiningandSummarizingCustomerReviews.ACMSIGKDD-2004.

SentimentAnalysis

SentimentLexicons

SentimentAnalysis

LearningSentimentLexicons

DanJurafsky

Semi-supervisedlearningoflexicons

• Whattodofordomainswhereyoudon'thavealexicon?

• Learnalexicon!• Useasmallamountofinformation

• Afewlabeledexamples• Afewhand-builtpatterns

• Tobootstrapalexicon43

DanJurafsky

Semi-supervisedlearningoflexicons

18.2 • SEMI-SUPERVISED INDUCTION OF SENTIMENT LEXICONS 3

The General Inquirer is a freely available web resource with lexicons of 1915 posi-tive words and 2291 negative words (and also includes other lexicons we’ll discussin the next section).

The MPQA Subjectivity lexicon (Wilson et al., 2005) has 2718 positive and4912 negative words drawn from a combination of sources, including the GeneralInquirer lists, the output of the Hatzivassiloglou and McKeown (1997) system de-scribed below, and a bootstrapped list of subjective words and phrases (Riloff andWiebe, 2003) that was then hand-labeled for sentiment. Each phrase in the lexiconis also labeled for reliability (strongly subjective or weakly subjective). The po-larity lexicon of (Hu and Liu, 2004) gives 2006 positive and 4783 negative words,drawn from product reviews, labeled using a bootstrapping method from WordNetdescribed in the next section.

Positive admire, amazing, assure, celebration, charm, eager, enthusiastic, excel-lent, fancy, fantastic, frolic, graceful, happy, joy, luck, majesty, mercy,nice, patience, perfect, proud, rejoice, relief, respect, satisfactorily, sen-sational, super, terrific, thank, vivid, wise, wonderful, zest

Negative abominable, anger, anxious, bad, catastrophe, cheap, complaint, conde-scending, deceit, defective, disappointment, embarrass, fake, fear, filthy,fool, guilt, hate, idiot, inflict, lazy, miserable, mourn, nervous, objection,pest, plot, reject, scream, silly, terrible, unfriendly, vile, wicked

Figure 18.2 Some samples of words with consistent sentiment across three sentiment lexi-cons: the General Inquirer (Stone et al., 1966), the MPQA Subjectivity lexicon (Wilson et al.,2005), and the polarity lexicon of Hu and Liu (2004).

18.2 Semi-supervised induction of sentiment lexicons

Some affective lexicons are built by having humans assign ratings to words; thiswas the technique for building the General Inquirer starting in the 1960s (Stoneet al., 1966), and for modern lexicons based on crowd-sourcing to be described inSection 18.5.1. But one of the most powerful ways to learn lexicons is to use semi-supervised learning.

In this section we introduce three methods for semi-supervised learning that areimportant in sentiment lexicon extraction. The three methods all share the sameintuitive algorithm which is sketched in Fig. 18.3.

function BUILDSENTIMENTLEXICON(posseeds,negseeds) returns poslex,neglex

poslex posseedsneglex negseedsUntil done

poslex poslex + FINDSIMILARWORDS(poslex)neglex neglex + FINDSIMILARWORDS(neglex)

poslex,neglex POSTPROCESS(poslex,neglex)

Figure 18.3 Schematic for semi-supervised sentiment lexicon induction. Different algo-rithms differ in the how words of similar polarity are found, in the stopping criterion, and inthe post-processing.44

DanJurafsky

Hatzivassiloglou andMcKeown intuitionforidentifyingwordpolarity

• Adjectivesconjoinedby“and”havesamepolarity• Fairand legitimate,corruptand brutal• *fairand brutal,*corruptand legitimate

• Adjectivesconjoinedby“but”donot• fairbutbrutal

45

Vasileios Hatzivassiloglou andKathleenR.McKeown.1997.PredictingtheSemanticOrientationofAdjectives.ACL,174–181

DanJurafsky

Hatzivassiloglou &McKeown 1997Step1

• Labelseedsetof1336adjectives(all>20in21millionwordWSJcorpus)

• 657positive• adequatecentralcleverfamousintelligentremarkablereputedsensitiveslenderthriving…

• 679negative• contagiousdrunkenignorantlankylistlessprimitivestridenttroublesomeunresolvedunsuspecting…

46

DanJurafsky


• Expandseedsettoconjoinedadjectives

47

nice, helpful

nice, classy

DanJurafsky


• Supervisedclassifierassigns“polaritysimilarity”toeachwordpair,resultingingraph:

48

classy

nice

helpful

fair

brutal

irrationalcorrupt

DanJurafsky


• Clusteringforpartitioningthegraphintotwo

49

classy

nice

helpful

fair

brutal

irrationalcorrupt

+ -

DanJurafsky

Outputpolaritylexicon

• Positive• bolddecisivedisturbinggenerousgoodhonestimportantlargematurepatientpeacefulpositiveproudsoundstimulatingstraightforwardstrangetalentedvigorouswitty…

• Negative• ambiguouscautiouscynicalevasiveharmfulhypocriticalinefficientinsecureirrationalirresponsibleminoroutspokenpleasantrecklessriskyselfishtediousunsupportedvulnerablewasteful…

50

DanJurafsky

Outputpolaritylexicon

• Positive• bolddecisivedisturbing generousgoodhonestimportantlargematurepatientpeacefulpositiveproudsoundstimulatingstraightforwardstrange talentedvigorouswitty…

• Negative• ambiguouscautious cynicalevasiveharmfulhypocriticalinefficientinsecureirrationalirresponsibleminoroutspoken pleasant recklessriskyselfishtediousunsupportedvulnerablewasteful…

51

DanJurafsky

Turney Algorithm

1. Extractaphrasallexiconfromreviews2. Learnpolarityofeachphrase3. Rateareviewbytheaveragepolarityofitsphrases

52

Turney (2002): Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews

DanJurafsky

Extracttwo-wordphraseswithadjectives

FirstWord SecondWord ThirdWord (notextracted)

Adj Noun anythingAdverb Adj not nounAdj Adj not nounNoun Adj not nounAdverb Verb anything53

DanJurafsky

Howtomeasurepolarityofaphrase?

Positivephrasesco-occurmorewith“excellent”Negativephrasesco-occurmorewith“poor”

Buthowtomeasureco-occurrence?

54

DanJurafsky

Pointwise MutualInformation

• Mutualinformationbetween2randomvariablesXandY

• Pointwise mutualinformation:• Howmuchmoredoeventsxandyco-occurthaniftheywereindependent?

I(X,Y ) = P(x, y)y∑

x∑ log2

P(x,y)P(x)P(y)

PMI(X,Y ) = log2P(x,y)P(x)P(y)

DanJurafsky

Pointwise MutualInformation

• Pointwise mutualinformation:• Howmuchmoredoeventsxandyco-occurthaniftheywereindependent?

• PMIbetweentwowords:• Howmuchmoredotwowordsco-occurthaniftheywereindependent?

PMI(word1,word2 ) = log2P(word1,word2)P(word1)P(word2)

PMI(X,Y ) = log2P(x,y)P(x)P(y)

DanJurafsky

HowtoEstimatePointwise MutualInformation

Querysearchengine• P(word)estimatedbyhits(word)/N• P(word1,word2)byhits(word1 NEAR word2)/N

(Caveat:MorecorrectlythebigramdenominatorshouldbekN,becausethereareatotalofNconsecutivebigrams(word1,word2),butkN bigramsthatarekwordsapart,butwejustuseNontherestofthisslideandthenext.)PMI(word1,word2 ) = log2

1Nhits(word1 NEAR word2)

1Nhits(word1) 1

Nhits(word2)

DanJurafsky

Doesphraseappearmorewith“poor”or“excellent”?

58

Polarity(phrase) = PMI(phrase,"excellent")−PMI(phrase,"poor")

= log2hits(phrase NEAR "excellent")hits("poor")hits(phrase NEAR "poor")hits("excellent")!

"#

$

%&

= log2hits(phrase NEAR "excellent")

hits(phrase)hits("excellent")hits(phrase)hits("poor")

hits(phrase NEAR "poor")

= log2

1N hits(phrase NEAR "excellent")1N hits(phrase) 1

N hits("excellent")− log2

1N hits(phrase NEAR "poor")1N hits(phrase) 1

N hits("poor")

DanJurafsky

Learnedphrases(reviewsofabank)

59

Phrase Polarity

onlineexperience 2.3veryhandy 1.4lowfees 0.3inconvenientlylocated -1.5otherproblems -2.8unethicalpractices -8.5

DanJurafsky

SummaryonLearningLexicons

• Why:• Learnalexiconthatisspecifictoadomain• Learnalexiconwithmorewords(morerobust)thanoff-the-shelf

• Intuition• Startwithaseedsetofwords(‘good’,‘poor’)• Findotherwordsthathavesimilarpolarity:• Using“and”and“but”• Usingwordsthatoccurnearbyinthesamedocument• Addthemtolexicon

DanJurafsky

Modernversionsoflexiconlearning

(Roughlythesamealgorithm)

• Startwithaseedsetofwords• Expandtowordsthathave"similarmeaning"

• Measuresimilarityusingembeddings likeword2vec;deeplearningbasedvectormodelsofmeaning

• We'llcovertheseinweek7,VectorSemantics!

61

SentimentAnalysis

LearningSentimentLexicons

SentimentAnalysis

OtherSentimentTasks

DanJurafsky

Findingsentimentofasentence

• Importantforfindingaspectsorattributes• Targetofsentiment

64

The food was great but the service was awfulThe food was great but the service was awful

DanJurafsky

Findingaspect/attribute/targetofsentiment

• Frequentphrases+rules• Findallhighlyfrequentphrasesacrossreviews(“fish tacos”)• Filterbyruleslike“occursrightaftersentimentword”• “…great fish tacos”meansfish tacos alikelyaspect

Casino casino,buffet,pool,resort,bedsChildren’s Barber haircut,job,experience,kidsGreekRestaurant food,wine,service,appetizer,lambDepartmentStore selection,department,sales,shop,clothing

M.HuandB.Liu.2004.Miningandsummarizingcustomerreviews.InProceedingsofKDD.S.Blair-Goldensohn,K.Hannan,R.McDonald,T.Neylon,G.Reis,andJ.Reynar.2008.BuildingaSentimentSummarizerforLocalServiceReviews.WWWWorkshop.

DanJurafsky

Findingaspect/attribute/targetofsentiment

• Theaspectnamemaynotbeinthesentence• Forrestaurants/hotels,aspectsarewell-understood• Supervisedclassification

• Hand-labelasmallcorpusofrestaurantreviewsentenceswithaspect• food,décor,service,value,NONE

• Trainaclassifiertoassignanaspecttoasentence• “Giventhissentence,istheaspectfood,décor,service,value,or NONE”

66

DanJurafsky

Puttingitalltogether:Findingsentimentforaspects

67

ReviewsFinalSummary

Sentences&Phrases

Sentences&Phrases

Sentences&Phrases

TextExtractor

SentimentClassifier

AspectExtractor

Aggregator

S.Blair-Goldensohn,K.Hannan,R.McDonald,T.Neylon,G.Reis,andJ.Reynar.2008.BuildingaSentimentSummarizerforLocalServiceReviews.WWWWorkshop

DanJurafsky

ResultsofBlair-Goldensohn etal.methodRooms (3/5stars,41comments)

(+) Theroomwascleanandeverythingworkedfine– eventhewaterpressure...

(+)Wewentbecauseofthefreeroomandwaspleasantlypleased...

(-)…theworsthotelIhadeverstayedat...Service (3/5stars,31comments)

(+)Uponcheckingoutanothercouplewascheckingearlyduetoaproblem...

(+)Everysinglehotelstaffmembertreatedusgreatandansweredevery...

(-)ThefoodiscoldandtheservicegivesnewmeaningtoSLOW.

Dining (3/5stars,18comments)(+)ourfavoriteplacetostayinbiloxi.the foodisgreatalsotheservice...(+)OfferoffreebuffetforjoiningthePlay

DanJurafsky

SummaryonSentiment

• Generallymodeledasclassificationorregressiontask• predictabinaryorordinallabel

• Features:• Negationisimportant• Usingallwords(innaivebayes)workswellforsometasks• Findingsubsetsofwordsmayhelpinothertasks• Hand-builtpolaritylexicons• Useseedsandsemi-supervisedlearningtoinducelexicons

SentimentAnalysis

Extra

DanJurafsky

AnalyzingthepolarityofeachwordinIMDB

• Howlikelyiseachwordtoappearineachsentimentclass?• Count(“bad”)in1-star,2-star,3-star,etc.• Butcan’tuserawcounts:• Instead,likelihood:

• Makethemcomparablebetweenwords• Scaledlikelihood:

Potts,Christopher.2011.Onthenegativityofnegation.SALT20,636-659.

P(w | c) = f (w,c)f (w,c)

w∈c∑

P(w | c)P(w)

DanJurafsky

AnalyzingthepolarityofeachwordinIMDBPotts,Christopher.2011.Onthenegativityofnegation.SALT20,636-659.

Overview Data Methods Categorization Scale induction Looking ahead

Example: attenuators

IMDB – 53,775 tokens

Category

-0.50

-0.39

-0.28

-0.17

-0.06

0.06

0.17

0.28

0.39

0.50

0.050.09

0.15

Cat = 0.33 (p = 0.004)Cat^2 = -4.02 (p < 0.001)

OpenTable – 3,890 tokens

Category

-0.50

-0.25

0.00

0.25

0.50

0.08

0.38

Cat = 0.11 (p = 0.707)Cat^2 = -6.2 (p = 0.014)

Goodreads – 3,424 tokens

Category

-0.50

-0.25

0.00

0.25

0.50

0.08

0.19

0.36

Cat = -0.55 (p = 0.128)Cat^2 = -5.04 (p = 0.016)

Amazon/Tripadvisor – 2,060 tokens

Category

-0.50

-0.25

0.00

0.25

0.50

0.12

0.28

Cat = 0.42 (p = 0.207)Cat^2 = -2.74 (p = 0.05)

somewhat/r


Category

-0.50

-0.39

-0.28

-0.17

-0.06

0.06

0.17

0.28

0.39

0.50

0.04

0.09

0.17

Cat = -0.13 (p = 0.284)Cat^2 = -5.37 (p < 0.001)


Category

-0.50

-0.25

0.00

0.25

0.50

0.08

0.31

Cat = 0.2 (p = 0.265)Cat^2 = -4.16 (p = 0.007)


Category

-0.50

-0.25

0.00

0.25

0.50

0.05

0.12

0.18

0.35

Cat = -0.87 (p = 0.016)Cat^2 = -5.74 (p = 0.004)


Category

-0.50

-0.25

0.00

0.25

0.50

0.11

0.29

Cat = 0.54 (p = 0.183)Cat^2 = -3.32 (p = 0.045)

fairly/r


Category

-0.50

-0.39

-0.28

-0.17

-0.06

0.06

0.17

0.28

0.39

0.50

0.050.090.13

Cat = -0.43 (p < 0.001)Cat^2 = -3.6 (p < 0.001)


Category

-0.50

-0.25

0.00

0.25

0.50

0.08

0.140.19

0.32

Cat = -0.64 (p = 0.035)Cat^2 = -4.47 (p = 0.007)


Category

-0.50

-0.25

0.00

0.25

0.50

0.07

0.15

0.34

Cat = -0.71 (p = 0.072)Cat^2 = -4.59 (p = 0.018)


Category

-0.50

-0.25

0.00

0.25

0.50

0.15

0.28

Cat = 0.26 (p = 0.496)Cat^2 = -2.23 (p = 0.131)

pretty/r

“Potts&diagrams” Potts,&Christopher.& 2011.&NSF&workshop&on&restructuring&adjectives.

good

great

excellent

disappointing

bad

terrible

totally

absolutely

utterly

somewhat

fairly

pretty

Positive scalars Negative scalars Emphatics Attenuators

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

Overview Data Methods Categorization Scale induction Looking ahead

Example: attenuators


Category

-0.50

-0.39

-0.28

-0.17

-0.06

0.06

0.17

0.28

0.39

0.50

0.050.09

0.15

Cat = 0.33 (p = 0.004)Cat^2 = -4.02 (p < 0.001)


Category

-0.50

-0.25

0.00

0.25

0.50

0.08

0.38

Cat = 0.11 (p = 0.707)Cat^2 = -6.2 (p = 0.014)


Category

-0.50

-0.25

0.00

0.25

0.50

0.08

0.19

0.36

Cat = -0.55 (p = 0.128)Cat^2 = -5.04 (p = 0.016)


Category

-0.50

-0.25

0.00

0.25

0.50

0.12

0.28

Cat = 0.42 (p = 0.207)Cat^2 = -2.74 (p = 0.05)

somewhat/r


Category

-0.50

-0.39

-0.28

-0.17

-0.06

0.06

0.17

0.28

0.39

0.50

0.04

0.09

0.17

Cat = -0.13 (p = 0.284)Cat^2 = -5.37 (p < 0.001)


Category

-0.50

-0.25

0.00

0.25

0.50

0.08

0.31

Cat = 0.2 (p = 0.265)Cat^2 = -4.16 (p = 0.007)


Category

-0.50

-0.25

0.00

0.25

0.50

0.05

0.12

0.18

0.35

Cat = -0.87 (p = 0.016)Cat^2 = -5.74 (p = 0.004)


Category

-0.50

-0.25

0.00

0.25

0.50

0.11

0.29

Cat = 0.54 (p = 0.183)Cat^2 = -3.32 (p = 0.045)

fairly/r


Category

-0.50

-0.39

-0.28

-0.17

-0.06

0.06

0.17

0.28

0.39

0.50

0.050.090.13

Cat = -0.43 (p < 0.001)Cat^2 = -3.6 (p < 0.001)


Category

-0.50

-0.25

0.00

0.25

0.50

0.08

0.140.19

0.32

Cat = -0.64 (p = 0.035)Cat^2 = -4.47 (p = 0.007)


Category

-0.50

-0.25

0.00

0.25

0.50

0.07

0.15

0.34

Cat = -0.71 (p = 0.072)Cat^2 = -4.59 (p = 0.018)


Category

-0.50

-0.25

0.00

0.25

0.50

0.15

0.28

Cat = 0.26 (p = 0.496)Cat^2 = -2.23 (p = 0.131)

pretty/r

“Potts&diagrams” Potts,&Christopher.& 2011.&NSF&workshop&on&restructuring&adjectives.

good

great

excellent

disappointing

bad

terrible

totally

absolutely

utterly

somewhat

fairly

pretty

Positive scalars Negative scalars Emphatics Attenuators

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

1 2 3 4 5 6 7 8 9 10rating

DanJurafsky

Othersentimentfeature:Logicalnegation

• Islogicalnegation(no,not)associatedwithnegativesentiment?

• Pottsexperiment:• Countnegation(not,n’t,no,never)inonlinereviews• Regressagainstthereviewrating

Potts,Christopher.2011.Onthenegativityofnegation.SALT20,636-659.

DanJurafsky Potts2011Results:Morenegationinnegativesentiment

a

Scaled

likelihoo

dP(w|c)/P(w)