sentiment analysis through the use of unsupervised deep ......ibm. outline •the problem ... latent...

SentimentAnalysisThroughtheUseofUnsupervised

DeepLearningS7330

Monday8th May2017GPUTechnologyConference

A.StephenMcGough,Noura AlMoubayed

DurhamUniversity,UK

IntelParallelComputingCentre

Outline

• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification

TheProblem

• “90%ofallthedataintheworldhasbeengeneratedoverthelasttwoyears”…IBM

• “85%ofworldwidedataisheldinun-structuredformats”…BerryandKogan

• Howcanweunderstandit?….orbetterstillmakeuseofit?• Howcanwedeterminethemostpertinentinformation?…andthenactonit?• Howcanwefindtheneedleifwearenotsurewhatitlookslikeorwhathaylookslike?

• "Withoutlabelling,youcannottrainamachinewithanewtask”…IBM

Outline


DeepLearning:CategorizationandRegression

InputVectorButtheserequirelabelleddatafortraining

True False

HiddenLayer

Output

HiddenLayer

HiddenLayer

…

AnomalyDetection:UnsupervisedDeepLearning

StackedDenoisingAutoencoder (SDA)- Trainedlayerbylayer- Inputiscorrupted

withnoise

Reconstruct

Construct

InputData ReconstructedInputData

vvv v v v

h h h

hhhhh

h h h h h

hh h h h

h h h h hh h h h

v v v v v v v v v v v v

OutputData

We’reallgoingtothecinemaonSaturday

We’reallgoingtothetheatreonSaturday

Lowreconstructionerror

AnomalyDetection:UnsupervisedDeepLearning

StackedDenoisingAutoencoder (SDA)- Trainedlayerbylayer- Inputiscorrupted

withnoise

Reconstruct

Construct

InputData ReconstructedInputData

vvv v v v

h h h

hhhhh

h h h h h

hh h h h

h h h h hh h h h

v v v v v v v v v v v v

OutputData

BuyViagrafromouronlinemedsstore

Bythevalleyouroverallmedalsstore

Highreconstructionerror

Outline


ProbabilisticTopicModelling

• Unsupervisedanalysisoftext• Toomanydocumentstolabelmanually

• Allowsustouncoverautomaticallythemesthatarelatentinacollectionofdocuments• Samewordsmayhavedifferentmeaningsdependingontheirco-occurrencewithotherwordsinadocument• Statisticallyidentifythetopicsinasetofdocuments• Whichtopicsappearinwhichdocuments

• Statisticallyidentifywordsintopics• Whichwordsco-occurinadocumentcontainingtopicX

• Documenttransformedtofeaturevectoroftopics

Thisreportpresentsaproofofconceptofourapproachtosolveanomalydetectionproblemsusingunsuperviseddeeplearning. TheworkfocusesontwospecificmodelsnamelydeeprestrictedBoltzmannmachinesandstackeddenoising autoencoders.Theapproachistestedontwodatasets:VASTNewsfeedDataandtheCommissionforEnergyRegulationsmartmeterprojectdatasetwithtextdataandnumericdatarespectively.Topicmodelingisusedforfeaturesextractionfromtextualdata. Theresultsshowhighcorrelationbetweentheoutputofthetwomodelingtechniques.Theoutliersinenergydatadetectedbythedeeplearningmodelshowaclearpatternovertheperiodofrecordeddatademonstratingthepotentialofthisapproachinanomalydetectionwithinbigdataproblemswherethereislittleornopriorknowledgeorlabels.Theseresultsshowthepotentialofusingunsuperviseddeeplearningmethodstoaddressanomalydetectionproblems. Forexampleitcouldbeusedtodetectsuspiciousmoneytransactionsandhelpwithdetection ofterroristfundingactivitiesoritcouldalsobeappliedtothedetectionofpotentialcriminalorterroristactivityusingphoneordigitil).

Topics

TopicModelling:LatentDirichlet Allocation(LDA)

Topics

RandomlySelecttopics

US Govt Data Shows RussiaUsed Outdated Ukrainian PHPMalware

The United States governmentearlier this year officiallyaccused Russia of interferingwith the US elections. Earlierthis year on October 7th, theDepartment of HomelandSecurity and the Office of theDirector of National Intelligence

RandomlySelectwordsfromtopics

TopicModelling:LatentDirichlet Allocation(LDA)• Eachdocumentcontainsaproportion(θ)ofeachtopic

• Proportionmaybezero• Theprobabilityofeachwordappearinginatopic(β)• LDAassumesdocumentswereconstructedvia:Wordsindocument=NPerdocumenttopicproportions=θ =Dirichlet function(α)ForeachoftheNwordswn:

⎻ Chooseatopiczn ∼Multinomial(θ)⎻ Chooseawordwn fromtopiczn (basedonprobabilitiesβ)

• Needtodetermineα,β,θ givenwehavethedocuments• Solveusingvariational BayesianmethodsorGibbssampling

α

θ

z

β

w

Topic

Document

WordsinDocum

ent

Outline


Example:AnomalyIdentificationSPAMandHAMinSMS

• TrainLDAusingthelabelleddataonlye.g.SPAM

• Unlabeled(HAM+SPAM)usedtotrainSDA

• Auto-identificationofSPAMfromHAMinSMSmessages

Example:AnomalyIdentificationSPAMandHAMinSMS

Outline


MultipleLabelsExample:DeepLearningforSentimentClassification• Trainananomalydetectoroneachpolarityofsentiment• Canhavemany• E.g.positive&negativeview

• Generateoverallreconstructionerror(ORE)foreachlabel• UsesimpleclassifieronOREstoidentifyclass

(SDA)

(LDA)

(IDA)

PerformanceAnalysis

80 82 84 86 88 90 92 94 96 98 10080

82

84

86

88

90

92

94

96

98

100

BestinLiterature(%)

OurBestA

pproach(%

)

Problem BestLiterature

BestOurApproach

SMSSpam 97.64 97.92MDSD-B(%) 80.4 99.6MDSD-D(%) 82.4 99.4MDSD-E(%) 84.4 99.4MDSD-K(%) 87.7 99.45Movie-Sub 93.6 90.72Movie-Rev1 82.9 87.28Movie-Rev2 90.2 98.05IMDB 91.22 85.61

Summary

• StackedDenoising Autoencoder allowustospotanomalieswithinunlabeleddata• ProbabilisticTopicModellingallowsustolookatdocumentsatahigherlevelthanjustwords•Whenwehavelabelswecantrainonthelabelsto:• Identifysentiment• Classifythedata

WeArerecruiting:- 2 PostDoc (MachineLearning/NLP)- 1PostDoc (ParallelProgramming)- AlwayslookingforgoodPhDCandidates

[email protected]@dur.ac.uk

sentiment analysis through the use of unsupervised deep ......ibm. outline •the problem ... latent...

Documents