sentiment analysis through the use of unsupervised deep ......ibm. outline •the problem ... latent...
TRANSCRIPT
SentimentAnalysisThroughtheUseofUnsupervised
DeepLearningS7330
Monday8th May2017GPUTechnologyConference
A.StephenMcGough,Noura AlMoubayed
DurhamUniversity,UK
IntelParallelComputingCentre
Outline
• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification
TheProblem
• “90%ofallthedataintheworldhasbeengeneratedoverthelasttwoyears”…IBM
• “85%ofworldwidedataisheldinun-structuredformats”…BerryandKogan
• Howcanweunderstandit?….orbetterstillmakeuseofit?• Howcanwedeterminethemostpertinentinformation?…andthenactonit?• Howcanwefindtheneedleifwearenotsurewhatitlookslikeorwhathaylookslike?
• "Withoutlabelling,youcannottrainamachinewithanewtask”…IBM
Outline
• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification
DeepLearning:CategorizationandRegression
InputVectorButtheserequirelabelleddatafortraining
True False
HiddenLayer
Output
HiddenLayer
HiddenLayer
…
AnomalyDetection:UnsupervisedDeepLearning
StackedDenoisingAutoencoder (SDA)- Trainedlayerbylayer- Inputiscorrupted
withnoise
Reconstruct
Construct
InputData ReconstructedInputData
vvv v v v
h h h
hhhhh
h h h h h
hh h h h
h h h h hh h h h
v v v v v v v v v v v v
OutputData
We’reallgoingtothecinemaonSaturday
We’reallgoingtothetheatreonSaturday
Lowreconstructionerror
AnomalyDetection:UnsupervisedDeepLearning
StackedDenoisingAutoencoder (SDA)- Trainedlayerbylayer- Inputiscorrupted
withnoise
Reconstruct
Construct
InputData ReconstructedInputData
vvv v v v
h h h
hhhhh
h h h h h
hh h h h
h h h h hh h h h
v v v v v v v v v v v v
OutputData
BuyViagrafromouronlinemedsstore
Bythevalleyouroverallmedalsstore
Highreconstructionerror
Outline
• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification
ProbabilisticTopicModelling
• Unsupervisedanalysisoftext• Toomanydocumentstolabelmanually
• Allowsustouncoverautomaticallythemesthatarelatentinacollectionofdocuments• Samewordsmayhavedifferentmeaningsdependingontheirco-occurrencewithotherwordsinadocument• Statisticallyidentifythetopicsinasetofdocuments• Whichtopicsappearinwhichdocuments
• Statisticallyidentifywordsintopics• Whichwordsco-occurinadocumentcontainingtopicX
• Documenttransformedtofeaturevectoroftopics
Thisreportpresentsaproofofconceptofourapproachtosolveanomalydetectionproblemsusingunsuperviseddeeplearning. TheworkfocusesontwospecificmodelsnamelydeeprestrictedBoltzmannmachinesandstackeddenoising autoencoders.Theapproachistestedontwodatasets:VASTNewsfeedDataandtheCommissionforEnergyRegulationsmartmeterprojectdatasetwithtextdataandnumericdatarespectively.Topicmodelingisusedforfeaturesextractionfromtextualdata. Theresultsshowhighcorrelationbetweentheoutputofthetwomodelingtechniques.Theoutliersinenergydatadetectedbythedeeplearningmodelshowaclearpatternovertheperiodofrecordeddatademonstratingthepotentialofthisapproachinanomalydetectionwithinbigdataproblemswherethereislittleornopriorknowledgeorlabels.Theseresultsshowthepotentialofusingunsuperviseddeeplearningmethodstoaddressanomalydetectionproblems. Forexampleitcouldbeusedtodetectsuspiciousmoneytransactionsandhelpwithdetection ofterroristfundingactivitiesoritcouldalsobeappliedtothedetectionofpotentialcriminalorterroristactivityusingphoneordigitil).
Topics
TopicModelling:LatentDirichlet Allocation(LDA)
Topics
RandomlySelecttopics
US Govt Data Shows RussiaUsed Outdated Ukrainian PHPMalware
The United States governmentearlier this year officiallyaccused Russia of interferingwith the US elections. Earlierthis year on October 7th, theDepartment of HomelandSecurity and the Office of theDirector of National Intelligence
RandomlySelectwordsfromtopics
TopicModelling:LatentDirichlet Allocation(LDA)• Eachdocumentcontainsaproportion(θ)ofeachtopic
• Proportionmaybezero• Theprobabilityofeachwordappearinginatopic(β)• LDAassumesdocumentswereconstructedvia:Wordsindocument=NPerdocumenttopicproportions=θ =Dirichlet function(α)ForeachoftheNwordswn:
⎻ Chooseatopiczn ∼Multinomial(θ)⎻ Chooseawordwn fromtopiczn (basedonprobabilitiesβ)
• Needtodetermineα,β,θ givenwehavethedocuments• Solveusingvariational BayesianmethodsorGibbssampling
α
θ
z
β
w
Topic
Document
WordsinDocum
ent
Outline
• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification
Example:AnomalyIdentificationSPAMandHAMinSMS
• TrainLDAusingthelabelleddataonlye.g.SPAM
• Unlabeled(HAM+SPAM)usedtotrainSDA
• Auto-identificationofSPAMfromHAMinSMSmessages
Example:AnomalyIdentificationSPAMandHAMinSMS
Outline
• TheProblem• UnsupervisedDeepLearningModel• TextProcessingforTopicModelling• DetectingAnomaliesinText• SentimentClassification
MultipleLabelsExample:DeepLearningforSentimentClassification• Trainananomalydetectoroneachpolarityofsentiment• Canhavemany• E.g.positive&negativeview
• Generateoverallreconstructionerror(ORE)foreachlabel• UsesimpleclassifieronOREstoidentifyclass
(SDA)
(LDA)
(IDA)
PerformanceAnalysis
80 82 84 86 88 90 92 94 96 98 10080
82
84
86
88
90
92
94
96
98
100
BestinLiterature(%)
OurBestA
pproach(%
)
Problem BestLiterature
BestOurApproach
SMSSpam 97.64 97.92MDSD-B(%) 80.4 99.6MDSD-D(%) 82.4 99.4MDSD-E(%) 84.4 99.4MDSD-K(%) 87.7 99.45Movie-Sub 93.6 90.72Movie-Rev1 82.9 87.28Movie-Rev2 90.2 98.05IMDB 91.22 85.61
Summary
• StackedDenoising Autoencoder allowustospotanomalieswithinunlabeleddata• ProbabilisticTopicModellingallowsustolookatdocumentsatahigherlevelthanjustwords•Whenwehavelabelswecantrainonthelabelsto:• Identifysentiment• Classifythedata
WeArerecruiting:- 2 PostDoc (MachineLearning/NLP)- 1PostDoc (ParallelProgramming)- AlwayslookingforgoodPhDCandidates
[email protected]@dur.ac.uk