Download - Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services
TechniquesforAutomatingQualityAssessmentofContext-specificContentonSocialMediaServices
Prateek DewanPhDThesisDefense
November14,2017
CommitteemembersDr.AlessandraSala
Dr.Sanasam Ranbir Singh
Dr.AdityaTelang
Dr.Ponnurangam Kumaraguru (Advisor)
WhoamI?
• DataScientistatApple• PhDstudentsinceFebruary,2012– IIIT-Delhi• Masters(2010– 2012), IIIT-Delhi
• Collaborations• IBMIRL(DelhiandBengaluru), SymantecResearchLabs(Pune), DublinCityUniversity(Ireland),UFMG(Brazil)
• WorkedinPrivacyandSecurityonOnlineSocialMedia
• Researchinterests• AppliedMachineLearning
• NaturalLanguageProcessing• WebSecurity
2
OnlineSocialMedia:TheBigPicture
3
“Withgreatpowercomesgreatresponsibility”
4
Thesisstatement
• Todesignandevaluateautomatedtechniquesforqualityassessmentofcontext-specificcontentonsocialmediaservicesinrealtime
• Focus:Facebook• BiggestOnlineSocialMediaservice
• 2.01billionmonthlyactiveusers
• Every2outof7humanbeingsontheplanetusesFacebook
• Mostsought-afterOSNfornews
5
ProposedSolution
6
Identify Characterize Model
PrototypeDeployEvaluate
FacebookInspector:Demo
7
Scope
• Establishingthedefinitionofpoorqualitycontent•Whatallcontentispoorinquality?• Untrustworthy• Childunsafe• Misleadinginformation
• Hoaxes,scams,clickbait
• Violence,hatespeech• Definitionconformingto• Facebook’scommunitystandards1
• Definitionsofpagespam
81https://www.facebook.com/communitystandards
Approach
•Poorqualityposts published onFacebook• Facebook pages publishing poorqualitycontent•Misinformation spreadonFacebookthroughimages
Characterize
•GroundtruthextractionusingURLblacklists, andhumanannotation
•Experimentswithmultiple supervised learningtechniques
•Two-foldmodeltoidentifymalicious contentinrealtimeModel
•FacebookInspector (FbI)Architecture
• Livedeployment viaRESTAPIandbrowserplug-ins forChromeandFirefox
•3,000+downloads, 180+dailyactiveusers, 1 million+postsanalyzed
•Evaluation intermsofresponse time,performance,andusability
Implement
9
Approach
• Poorqualityposts publishedonFacebook•Facebook pages publishing poorqualitycontent•Misinformation spreadonFacebookthroughimages
Characterize
•GroundtruthextractionusingURLblacklists, andhumanannotation
•Experimentswithmultiple supervised learningtechniques
•Two-foldmodeltoidentifymalicious contentinrealtimeModel
•FacebookInspector (FbI)Architecture
• Livedeployment viaRESTAPIandbrowserplug-ins forChromeandFirefox
•3,000+downloads, 180+dailyactiveusers, 1 million+postsanalyzed
•Evaluation intermsofresponse time,performance,andusability
Implement
10
Dataset
DataType Quantity
Uniqueposts 4,465,371
Uniqueentities 3,373,953
Uniqueusers 2,983,707
Uniquepages 390,246
UniqueURLs 480,407
Uniquepostswithoneormore URLs 1,222,137
UniqueentitiespostingURLs 856,758
UniquepostswithoneormoremaliciousURLs 11,217
Uniqueentitiespostingone ormoremaliciousURLs 7,962
Unique maliciousURLs 4,622
11
EstablishingGroundTruth
• ExtractedpostscontainingoneormoreURLs• 1.2millionoutof4.4millionpostsintotal
• 480kuniqueURLs• UsedsixURLblacklists• GoogleSafebrowsing (malware/phishing)• VirusTotal (spam/malware/phishing)• Surbl (spam)• WebofTrust(trustscore)*
• SpamHaus (spam)• Phishtank (phishing)
• PostcontainingoneormoreblacklistedURLmarkedaspoorqualityposts (11,217inall)
12
WebofTrust
13
Reputation:Unsatisfactory/Poor/Verypoor (lessthan60)Confidence:High(greaterthan10)
ORCategory:Negative
Malicious
http://www.domain.com
Findings
• Facebook’scurrenttechniquesdonotsuffice• 65%ofallpoorqualitypostsexistedonFacebookafter4(ormore)months• Gatheredlikes from52,169uniqueusers;comments from8,784uniqueusers
• Facebook’spartnershipwithWebofTrust?• 88%ofallmaliciousURLshadpoorreputationonWOT
• Nowarningpages
14
Platformsusedtopost
15
Distributionofpoorqualityposts
16
Pages Users
Entities Posts
Approach
•Poorqualityposts published onFacebook• Facebook pages publishingpoorqualitycontent•Misinformation spreadonFacebookthroughimages
Characterize
•GroundtruthextractionusingURLblacklists, andhumanannotation
•Experimentswithmultiple supervised learningtechniques
•Two-foldmodeltoidentifymalicious contentinrealtimeModel
•FacebookInspector (FbI)Architecture
• Livedeployment viaRESTAPIandbrowserplug-ins forChromeandFirefox
•3,000+downloads, 180+dailyactiveusers, 1 million+postsanalyzed
•Evaluation intermsofresponse time,performance,andusability
Implement
17
FacebookPagespostingpoorqualitycontent
18
HidinginPlainSight:CharacterizingandDetectingMaliciousFacebookPages. Prateek Dewan,Shrey Bagroy,andPonnurangamKumaraguru (Shortpaper).PublishedatIEEE/ACMConferenceonAdvancesinSocialNetworksAnalysisandMining(ASONAM), San
Francisco,USA.2016.
GroundTruthextraction:Facebookpages
4.4millionposts
10,341maliciousposts
(1,557pages;5,868users)
627malicious
pages
19
1ormoremaliciousURLsin
themostrecent100posts
Datasetofpages postingpoorqualitycontent
WOTresponse No.ofpages No. ofposts
Childunsafe 387 10,891
Untrustworthy 317 8,057
Questionable 312 8,859
Negative 266 5,863
Adult content 162 3,290
Spam 124 4,985
Phishing 39 495
Total 627(31) 20,999
20
• NumbersinbracketsareVerifiedpages
Contentanalysis(pagenames)
21
• SentenceTokenizationàWordTokenizationà CasenormalizationàStemmingà Stopword removal
• N-gramanalysis(n=1,2,3)
• Politicallypolarizedentitiesamongstpoorqualitypages• BritishNationalParty(BNP),TheTeaParty,EnglishDefenseLeague,AmericanDefenseLeague,AmericanConservatives,GeertWilderssupporters…
Networkanalysis
22
• Collusivebehaviorwithinpages postingpoorqualitycontent
Shares LikesComments
Temporalactivity
• Activityratio:"#.#%&'()*"'&+,-&'.)&#&,/"#.#%&'()*"'&+ duringcompleteobservationperiod
• Maliciouspagesaremoreactivethanbenignpages
23
Approach
•Poorqualityposts published onFacebook• Facebook pages publishing poorqualitycontent• MisinformationspreadonFacebookthroughimages
Characterize
•GroundtruthextractionusingURLblacklists, andhumanannotation
•Experimentswithmultiple supervised learningtechniques
•Two-foldmodeltoidentifymalicious contentinrealtimeModel
•FacebookInspector (FbI)Architecture
• Livedeployment viaRESTAPIandbrowserplug-ins forChromeandFirefox
•3,000+downloads, 180+dailyactiveusers, 1 million+postsanalyzed
•Evaluation intermsofresponse time,performance,andusability
Implement
24
Why?:TheHumanBrain- Imagesversustext
• Humanbrainprocessesimages60,000timesfasterthantext
25
Arewedoingenoughto"understand" images?
• Mostresearchtoanalyzesocialmediacontentfocusesontext• Topicmodelling
• Sentimentanalysis
• Doesitcaptureeverything?• Studiesrelatedtoimagesarelimitedtosmallscale• Fewhundred imagesmanuallyannotatedandanalyzed
• Whatcanbedone?• Automated techniquesforimagesummarization;DeepLearningandConvolutionalNeuralNetworks(CNNs)toscaleacrosslargeno.ofimages
• Domaintransferlearning
• OpticalCharacterRecognition
26
Methodology
• ImagespostedonFacebookduringtheParisAttacks,November2015
• 3-tierpipelineforextractinghighlevelimagedescriptorsfromimages
27
Uniqueposts 131,548
Unique users 106,275
Postswithimages 75,277
Total imagesextracted 57,748
Totaluniqueimages 15,123
Images
Themes(Inceptionv3)
ImageSentiment(DeCAF trainedon
SentiBank)
OpticalCharacterRecognition
Humanunderstandabledescriptors
TextSentiment(LIWC) +Topics(TF)
Manualcalibration
Tier1:VisualThemes
Tier2:ImageSentiment
Tier3:Textembeddedinimages
TierI:VisualThemes
• ImageNetLargeScaleVisualRecognitionChallenge(ILSVRC),2012• 1.2millionimages,1,000categories
•Winner:Google’sInception-v3(top-1error:17.2%)• 48-layerDeepConvolutionalNeuralNetwork
28
TierI:VisualThemescontd.
• AllimageslabeledusingInception-v3
• Validation:• Randomsampleof2,545imagesannotatedby3humanannotators
• 38.87%accuracy(majorityvoting)
•Manualcalibration• Renamed7outofthetop30(mostfrequentlyoccurring)labels
• Newaccuracy:51.3%•Whyrename?à
29
BoloTie
(Inception-v3)
PeaceForParis
(Ourdataset)
TierII:ImageSentiment
• DomainTransferLearning
• Inception-v3’slastlayerretrainedusingSentiBank• SentiBank• ImagescollectedfromFlickrusingAdjectiveNounPairs(ANPs)assearchquery
• ANPs:happydog,adorablebaby,abandonedhouse• Weaklylabeleddatasetofimagescarryingemotion
• Finaltrainingset– 133,108negative+305,100positivesentimentimages
• 10-foldrandomsubsampling
• 69.8% accuracy
30
TierIII:Textembeddedinimages
• OpticalCharacterRecognition(OCR)• TesseractOCR(Python)
• 31,689imageshadtext
• Manuallyextractedtextfromarandomsampleof1,000images
• ComparedwithOCRoutputusingstringsimilaritymetrics
• ~62%accuracy
31
Tesseractoutput:
No-onethinksthatthesepeoplearerepresentativeofChristians.SowhydosomanythinkthatthesepeoplearerepresentativeofMuslims?
Imageandposttexthaddifferenttopics
• Textembeddedinimagesdepictedmorenegativesentimentthanusergeneratedtextualcontent
32
Textembedded inimages Usergeneratedtext
Sentiment:Imagesversustext
• Imagesentimentwasmorepositivethantextsentiment
33
0
0.1
0.2
0.3
0.4
0.5
0.6
8 24 40 56 72 88 104 120 136 152 168 184 200 216 232 248 264 280
Sentim
entValue
/Vo
lumeFractio
n
No.ofhoursaftertheattacks
PostText ImageTextImage VolumeFraction
Poorqualityimagecontent popularonFacebook
34
Approach
•Poorqualityposts published onFacebook• Facebook pages publishing poorqualitycontent•Misinformation spreadonFacebookthroughimages
Characterize
•GroundtruthextractionusingURLblacklists, andhumanannotation
•Experimentswithmultiple supervised learningtechniques
•Two-foldmodeltoidentifymalicious contentinrealtimeModel
•FacebookInspector (FbI)Architecture
• Livedeployment viaRESTAPIandbrowserplug-ins forChromeandFirefox
•3,000+downloads, 180+dailyactiveusers, 1 million+postsanalyzed
•Evaluation intermsofresponse time,performance,andusability
Implement
35
Revisiting-- EstablishingGroundTruth
• ExtractedpostscontainingoneormoreURLs• 1.2millionoutof4.4millionpostsintotal
• 480kuniqueURLs• UsedsixURLblacklists• GoogleSafebrowsing (malware/phishing)• VirusTotal (spam/malware/phishing)• Surbl (spam)• WebofTrust(trustscore)*
• SpamHaus (spam)• Phishtank (phishing)
• PostcontainingoneormoreblacklistedURLmarkedaspoorqualityposts (11,217inall)
36
GroundTruthextraction– DatasetII
•WhatifapostdoesnothaveaURL?
• 500randomFacebookpostsx17eventsx3annotators
• Definitionofmaliciouspost• “AnyirrelevantorunsolicitedmessagessentovertheInternet,typicallytolargenumbersofusers,forthepurposesofadvertising,phishing,spreadingmalware,etc.arecategorizedasspam.Intermsofonlinesocialmedia,socialspamisanycontentwhichisirrelevant/unrelatedtotheeventunderconsideration,and/oraimedatspreadingphishing,malware,advertisements,selfpromotionetc.,includingbulkmessages,profanity, insults,hatespeech,maliciouslinks,fraudulentreviews,scams,fakeinformationetc.”
• Finaldataset(all3annotatorsagreedonthesamelabel)• 571maliciousposts
• 3,841benignposts
37
Featureset:FacebookPosts
Source Features
Entity (9) isPage, gender,pageCategory,hasUsername,usernameLength,
nameLength,numWordsInName, locale,pageLikes
Textualcontent
(18)
Presenceof!,?,!!,??, emoticons(smile,frown),numWords,
avgWordLength,numSentences,avgSentenceLength,
numDictionaryWords,numHashtags,hashtagsPerWord,numCharacters,
numURLs,URLsPerWord,numUppercaseCharacters,numWords /
numUniqueWords
Metadata(10) Application,Presence offacebook.com URL,Presenceof
apps.facebook.com URL,PresenceofFacebookeventURL,hasMessage,
hasStory,hasPicture,hasLink,type, linkLength
Link(7) http/https,numHyphens, numParameters,avgParameterLength,
numSubdomains, pathLength
38
Supervisedlearning:DatasetI
Classifier/Features
Entity Text Metadata Link All Top 7
NaïveBayes 54.79 52.41 71.60 69.25 56.15 74.72
DecisionTree 63.02 64.78 80.56 82.34 84.67 86.17
RandomForest 63.47 66.25 80.67 82.56 85.05 86.62
SVMrbf 61.77 64.89 78.75 81.45 75.89 83.66
39
Supervisedlearning:DatasetII
Classifier/Features
Entity Text Metadata Link All
NaïveBayes 51.67 51.60 72.45 77.58 67.63
DecisionTree 51.66 73.16 79.01 81.04 76.17
RandomForest 52.86 76.56 79.87 81.49 80.56
SVMrbf 53.16 76.52 78.18 80.37 73.79
40
Featureset:FacebookPages
Pagefeatures Likes,talking about,descriptionlength,bio,category,name,location,check-ins,…
Postingbehavior
Dailyactivityratio,posttypes,postlikes,postcomments,postshares,postengagementratio,postlanguage,averagepostlength,no.ofuniqueURLsinposts,no.ofuniquedomainsinposts,etc.
41
• Supervised learning• Page+postfeatures• 55featuresfrompageinformation
• 41featuresfrompostingbehavior
• Bagofwords• Contentgeneratedbypages
Supervisedlearning:Page+postfeatures
Classifier Featureset Accuracy(%) ROCAUC
NaïveBayesian
Page 63.95 0.685
Post 69.61 0.753
Page+Post 70.81 0.776
LogisticRegression
Page 67.38 0.745
Post 76.55 0.825
Page+Post 76.71 0.846
DecisionTrees
Page 65.55 0.668
Post 71.37 0.720
Page+Post 70.81 0.758
Random Forest
Page 67.86 0.750
Post 74.95 0.829
Page+Post 75.27 0.83742
Supervisedlearning:Bagofwords
Classifier Featureset Accuracy (%) ROCAUC
NaïveBayesian
Unigrams 68.27 0.682
Bigrams 69.06 0.690
Trigrams 69.77 0.697
LogisticRegression
Unigrams 74.18 0.795
Bigrams 74.34 0.791
Trigrams 73.93 0.789
Decision Trees
Unigrams 68.12 0.678
Bigrams 67.05 0.678
Trigrams 66.63 0.672
RandomForest
Unigrams 72.26 0.794
Bigrams 71.80 0.802
Trigrams 72.18 0.794
Sparse NN
Unigrams 81.74 0.862
Bigrams 84.12 0.872
Trigrams 84.13 0.90043
Modelforrealtimedetection
•Modelforpagesdependsonpostspublishedbypages• Can’tbeusedfordetectioninrealtime
• Twofoldsupervisedlearningbasedmodelusingpostfeatures
• Utilizingclassprobabilitiesfordecisionmaking
44
Decisionboundary
45Classifier1
Classifier2
1
10
High
High
LowMalicious
Benign
Approach
•Poor qualityposts published onFacebook• Facebook pages publishing poorqualitycontent•Misinformation spreadonFacebookthroughimages
Characterize
•GroundtruthextractionusingURLblacklists, andhumanannotation
•Experimentswithmultiple supervised learningtechniques
•Two-foldmodeltoidentifymalicious contentinrealtimeModel
•FacebookInspector (FbI)Architecture
• Livedeployment viaRESTAPIandbrowserplug-ins forChromeandFirefox
•3,000+downloads, 180+dailyactiveusers, 1 million+postsanalyzed
•Evaluation intermsofresponse time,performance,andusability
Implement
46
FacebookInspector(FbI):Architecture
47
FbI stats
Dateofpublic launch August23,2015
Total IncomingRequests 9million+
Total publicpostsanalyzed 3.5million+
Totaldownloads 5,000+
Dailyactiveusers 250+
Totaluniquebrowsers 1,250+
Postsmarkedasmalicious 615,000+
Postsmarkedasbenign 2.9million+
48
FbI evaluation:Responsetime
49
• ~80%postsprocessedwithin3seconds
• Averagetimeperpost:2.635seconds
FbI evaluation:Usability
• Usabilitystudywith53participants• SUSscore:81.36(Agrade)• Higherperceivedusabilitythat>90%ofallsystemsevaluatedusingSUSscale
• 98.1%participantsfoundFbI “easytouse”• 67.9%participantswouldlikeuseFbI frequently• Quotesfromusers:• “Savesyourtimespentonspamlinksandhenceenhancesuserexperience.”• “[FacebookInspector]Canbeusefulforminorsandpeoplewholackthejudgementtodecidehowthepostis.”
50
Contributionssummary
• IdentifiedandcharacterizedpoorqualitycontentspreadonFacebook,withthepurposeofidentifyingpoorqualitypostspublishedduringnews-makingeventsinrealtime
• Evaluated supervisedlearningapproachesforidentifyingpoorqualitypostsonFacebookinrealtime,usingentity,textual,metadata,andURLfeatures
• Deployedandevaluated anovelframeworkandsystemforrealtimedetectionofpoorqualitypostsonFacebookduringnews-makingevents
51
Howdoesithelp?
• SocialmediaservicesaretheprimarysourceofinformationformajorityofInternetusers• Contentisunmoderatedandcrowd-sourced;everythingyouseemaynotbetrue
• FacebookInspectorprovidesausefulandusablerealworldsolution toassistusers
• Methodologyforfastandaccuratesummarizationofimagedatasetspertainingtoagiventopic• Governmentagencies/brandscanusethismethodology toquicklyproducehigh-levelsummariesofevents/productsandgaugethepulseofthemasses
52
Realworldimpact
• RealtimesystemFacebookInspectorbuilttoidentifypoorqualitycontentisusedby250+Facebookusers,andhasprocessedover9millionrequests
• AuniquedatasetofFacebookpostscontainingmaliciousURLs,pagespostingmaliciouscontent,andimagesdepictingmisinformationfrom20+news-makingevents
53
Limitationsandfuturework
• Currentsystemdoesnotincorporateuserfeedback• Wewould liketoenableuserstoprovide feedbacktomakeamorepersonalizeddetectionmodel
• Computervisiontechniqueshavelimitedaccuracyonsocialmediacontent• Objectdetection,sentimentanalysis,andopticalcharacterrecognitiontechniquesweusedarenottestedthoroughlyonsocialmediacontent
• Identifyandrankusersonthebasisofdegreeofmalice• Moremaliciouscontentgenerated,highertheranking
54
Acknowledgements
• NIXIfortravelsupport(eCRS,2014)• IIIT-Delhi fortravelsupport(ASONAM,2017)
• Govt.ofIndiaforfundingduringPhD• Collaboratorsandco-authors:Dr.Anand Kashyap,Shrey Bagroy,Anshuman Suri,VarunBharadhwaj,AditiMithal
• Monitoringcommittee:Dr.Vinayak andDr.Sambuddho
• Peers:Dr.Niharika Sachdeva,Anupama Aggarwal,Dr.Paridhi Jain,Dr.AditiGupta,Srishti Gupta,Rishabh Kaushal
• MembersofPrecog@IIITD andCERC
• Everyoneelsewhohasbeenpartofmyjourney…
55
Publications– Partofthesis
• Dewan,P.,Bagroy,S.,andKumaraguru,P.HidinginPlainSight:TheAnatomyofMaliciousPagesonFacebook.Bookchapter,LectureNotesinSocialNetworks,Springer2017(Toappear)
• Dewan,P.,Suri,A.,Bharadhwaj,V.,Mithal,A.,andKumaraguru,P.TowardsUnderstandingCrisisEventsOnOnlineSocialNetworksThroughPictures.IEEE/ACMInternationalConferenceonAdvancesinSocialNetworksAnalysisandMining(ASONAM),2017.
• Dewan,P.,andKumaraguru,P.FacebookInspector(FbI):TowardsAutomaticRealTimeDetectionofMaliciousContentonFacebook.SocialNetworkAnalysisandMiningJournal(SNAM),2017.Volume7,Issue1.
• Dewan,P.,Bagroy,S.,andKumaraguru,P.HidinginPlainSight:CharacterizingandDetectingMaliciousFacebookPages.IEEE/ACMInternationalConferenceonAdvancesinSocialNetworksAnalysisandMining(ASONAM),2016(Shortpaper)
• Dewan,P.,andKumaraguru,P.TowardsAutomaticRealTimeIdentificationofMaliciousPostsonFacebook.ThirteenthAnnualConferenceonPrivacy,SecurityandTrust(PST),2015
• Dewan,P.,Kashyap,A.,andKumaraguru,P.AnalyzingSocialandStylometric FeaturestoIdentifySpearphishingEmails.APWGeCrime ResearchSymposium(eCRS),2014
56
Publications– Other
• Kaushal,R.,Chandok,S.,JainP., Dewan,P.,Gupta,N.,andKumaraguru,P.NudgingNemo:HelpingUsersControlLinkability acrossSocialNetworks.9thInternationalConferenceonSocialInformatics(SocInfo),2017(Shortpaper).
• Deshpande,P.,Joshi,S., Dewan,P.,Murthy,K.,Mohania,M.,Agrawal,S.TheMaskofZoRRo:preventinginformationleakagefromdocuments.KnowledgeandInformationSystemsJournal,2014
• Mittal,S.,Gupta,N., Dewan,P.,Kumaraguru,P.Pinnedit!AlargescalestudyofthePinterestnetwork.1stACMIKDDConferenceonDataSciences(CoDS),2014
• Dewan,P.,Gupta,M.,Goyal,K.,andKumaraguru,P.MultiOSN:Realtime MonitoringofRealWorldEventsonMultipleOnlineSocialMediaIBMICARE2013
• Magalhães,T.,Dewan,P.,Kumaraguru,P.,Melo-Minardi,R.,andAlmeida,V.uTrack:TrackYourself!MonitoringInformationonOnlineSocialMedia.22ndInternationalWorldWideWebConference(WWW)(2013)
• ConwayM., DewanP.,Kumaraguru P.,McInerney L.'WhitePrideWorldwide':AMeta- analysisofStormfront.orgInternet,Politics,Policy2012:BigData,BigChallenges?,OxfordInternetInstitute,UniversityofOxford.
57