sentiment analysis through the use of unsupervised deep ......ibm. outline •the problem ... latent...

19
Sentiment Analysis Through the Use of Unsupervised Deep Learning S7330 Monday 8 th May 2017 GPU Technology Conference A. Stephen McGough, Noura Al Moubayed Durham University, UK

Upload: others

Post on 21-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

SentimentAnalysisThroughtheUseofUnsupervised

DeepLearningS7330

Monday8th May2017GPUTechnologyConference

A.StephenMcGough,Noura AlMoubayed

DurhamUniversity,UK

Page 2: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

IntelParallelComputingCentre

Page 3: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

Outline

• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification

Page 4: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

TheProblem

• “90%ofallthedataintheworldhasbeengeneratedoverthelasttwoyears”…IBM

• “85%ofworldwidedataisheldinun-structuredformats”…BerryandKogan

• Howcanweunderstandit?….orbetterstillmakeuseofit?• Howcanwedeterminethemostpertinentinformation?…andthenactonit?• Howcanwefindtheneedleifwearenotsurewhatitlookslikeorwhathaylookslike?

• "Withoutlabelling,youcannottrainamachinewithanewtask”…IBM

Page 5: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

Outline

• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification

Page 6: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

DeepLearning:CategorizationandRegression

InputVectorButtheserequirelabelleddatafortraining

True False

HiddenLayer

Output

HiddenLayer

HiddenLayer

Page 7: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

AnomalyDetection:UnsupervisedDeepLearning

StackedDenoisingAutoencoder (SDA)- Trainedlayerbylayer- Inputiscorrupted

withnoise

Reconstruct

Construct

InputData ReconstructedInputData

vvv v v v

h h h

hhhhh

h h h h h

hh h h h

h h h h hh h h h

v v v v v v v v v v v v

OutputData

We’reallgoingtothecinemaonSaturday

We’reallgoingtothetheatreonSaturday

Lowreconstructionerror

Page 8: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

AnomalyDetection:UnsupervisedDeepLearning

StackedDenoisingAutoencoder (SDA)- Trainedlayerbylayer- Inputiscorrupted

withnoise

Reconstruct

Construct

InputData ReconstructedInputData

vvv v v v

h h h

hhhhh

h h h h h

hh h h h

h h h h hh h h h

v v v v v v v v v v v v

OutputData

BuyViagrafromouronlinemedsstore

Bythevalleyouroverallmedalsstore

Highreconstructionerror

Page 9: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

Outline

• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification

Page 10: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

ProbabilisticTopicModelling

• Unsupervisedanalysisoftext• Toomanydocumentstolabelmanually

• Allowsustouncoverautomaticallythemesthatarelatentinacollectionofdocuments• Samewordsmayhavedifferentmeaningsdependingontheirco-occurrencewithotherwordsinadocument• Statisticallyidentifythetopicsinasetofdocuments• Whichtopicsappearinwhichdocuments

• Statisticallyidentifywordsintopics• Whichwordsco-occurinadocumentcontainingtopicX

• Documenttransformedtofeaturevectoroftopics

Thisreportpresentsaproofofconceptofourapproachtosolveanomalydetectionproblemsusingunsuperviseddeeplearning. TheworkfocusesontwospecificmodelsnamelydeeprestrictedBoltzmannmachinesandstackeddenoising autoencoders.Theapproachistestedontwodatasets:VASTNewsfeedDataandtheCommissionforEnergyRegulationsmartmeterprojectdatasetwithtextdataandnumericdatarespectively.Topicmodelingisusedforfeaturesextractionfromtextualdata. Theresultsshowhighcorrelationbetweentheoutputofthetwomodelingtechniques.Theoutliersinenergydatadetectedbythedeeplearningmodelshowaclearpatternovertheperiodofrecordeddatademonstratingthepotentialofthisapproachinanomalydetectionwithinbigdataproblemswherethereislittleornopriorknowledgeorlabels.Theseresultsshowthepotentialofusingunsuperviseddeeplearningmethodstoaddressanomalydetectionproblems. Forexampleitcouldbeusedtodetectsuspiciousmoneytransactionsandhelpwithdetection ofterroristfundingactivitiesoritcouldalsobeappliedtothedetectionofpotentialcriminalorterroristactivityusingphoneordigitil).

Topics

Page 11: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

TopicModelling:LatentDirichlet Allocation(LDA)

Topics

RandomlySelecttopics

US Govt Data Shows RussiaUsed Outdated Ukrainian PHPMalware

The United States governmentearlier this year officiallyaccused Russia of interferingwith the US elections. Earlierthis year on October 7th, theDepartment of HomelandSecurity and the Office of theDirector of National Intelligence

RandomlySelectwordsfromtopics

Page 12: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

TopicModelling:LatentDirichlet Allocation(LDA)• Eachdocumentcontainsaproportion(θ)ofeachtopic

• Proportionmaybezero• Theprobabilityofeachwordappearinginatopic(β)• LDAassumesdocumentswereconstructedvia:Wordsindocument=NPerdocumenttopicproportions=θ =Dirichlet function(α)ForeachoftheNwordswn:

⎻ Chooseatopiczn ∼Multinomial(θ)⎻ Chooseawordwn fromtopiczn (basedonprobabilitiesβ)

• Needtodetermineα,β,θ givenwehavethedocuments• Solveusingvariational BayesianmethodsorGibbssampling

α

θ

z

β

w

Topic

Document

WordsinDocum

ent

Page 13: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

Outline

• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification

Page 14: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

Example:AnomalyIdentificationSPAMandHAMinSMS

• TrainLDAusingthelabelleddataonlye.g.SPAM

• Unlabeled(HAM+SPAM)usedtotrainSDA

• Auto-identificationofSPAMfromHAMinSMSmessages

Page 15: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

Example:AnomalyIdentificationSPAMandHAMinSMS

Page 16: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

Outline

• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification

Page 17: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

MultipleLabelsExample:DeepLearningforSentimentClassification• Trainananomalydetectoroneachpolarityofsentiment• Canhavemany• E.g.positive&negativeview

• Generateoverallreconstructionerror(ORE)foreachlabel• UsesimpleclassifieronOREstoidentifyclass

(SDA)

(LDA)

(IDA)

Page 18: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

PerformanceAnalysis

80 82 84 86 88 90 92 94 96 98 10080

82

84

86

88

90

92

94

96

98

100

BestinLiterature(%)

OurBestA

pproach(%

)

Problem BestLiterature

BestOurApproach

SMSSpam 97.64 97.92MDSD-B(%) 80.4 99.6MDSD-D(%) 82.4 99.4MDSD-E(%) 84.4 99.4MDSD-K(%) 87.7 99.45Movie-Sub 93.6 90.72Movie-Rev1 82.9 87.28Movie-Rev2 90.2 98.05IMDB 91.22 85.61

Page 19: Sentiment Analysis Through the Use of Unsupervised Deep ......IBM. Outline •The Problem ... Latent DirichletAllocation (LDA) Topics Randomly Select topics US Govt Data Shows Russia

Summary

• StackedDenoising Autoencoder allowustospotanomalieswithinunlabeleddata• ProbabilisticTopicModellingallowsustolookatdocumentsatahigherlevelthanjustwords•Whenwehavelabelswecantrainonthelabelsto:• Identifysentiment• Classifythedata

WeArerecruiting:- 2 PostDoc (MachineLearning/NLP)- 1PostDoc (ParallelProgramming)- AlwayslookingforgoodPhDCandidates

[email protected]@dur.ac.uk