verifying multimedia use at mediaeval 2016ceur-ws.org/vol-1739/mediaeval_2016_paper_3.pdf · this...

Verifying Multimedia Use at MediaEval 2016

Christina Boididou1 Symeon Papadopoulos1 Duc-Tien Dang-Nguyen2 Giulia Boato2 MichaelRiegler3 Stuart E Middleton4 Andreas Petlund3 and Yiannis Kompatsiaris1

1Information Technologies Institute CERTH Greece [boididoupapadopikom]itigr

2University of Trento Italy [dangnguyenboato]disiunitnit

3Simula Research Laboratory Norway michaelsimulanoapetlundifiuiono

4University of Southampton IT Innovation Centre Southampton UK semit-innovationsotonacuk

ABSTRACTThis paper provides an overview of the Verifying Multime-dia Use task that takes places as part of the 2016 MediaEvalBenchmark The task motivates the development of auto-mated techniques for detecting manipulated and misleadinguse of web multimedia content Splicing tampering andreposting videos and images are examples of manipulationthat are part of the task definition For the 2016 editionof the task a corpus of imagesvideos and their associatedposts is made available together with labels indicating theappearance of misuse (fake) or not (real) in each case aswell as some useful post metadata

1 INTRODUCTIONSocial media such as Twitter and Facebook as means of

news sharing is very popular and also very often used by gov-ernment or politicians to reach the public The speed of newsspreading on such platforms often leads to the appearanceof large amounts of misleading multimedia content Giventhe need for automated real-time verification of this contentseveral techniques have been presented by researchers Forinstance previous work focused on the classification betweenfake and real tweets spread during Hurricane Sandy [6] andother events [2] or on automatic methods for assessing postsrsquocredibility [3] Several systems for checking content credi-bility have been proposed such as Truthy [8] TweetCred[5] and Hoaxy [9] The second edition of this task aims toencourage the development of new verification approachesThis year the task is extended by introducing a sub-task fo-cused on identifying digitally manipulated multimedia con-tent To this end we encourage participants to create text-focused andor image-focused approaches equally

2 TASK OVERVIEWMain task The definition of the main task is the followingldquoGiven a social media post comprising a text componentan associated piece of visual content (imagevideo) and a setof metadata originating from the social media platform thetask requires participants to return a decision (fake real orunknown) on whether the information presented by this post

Copyright is held by the authorowner(s)MediaEval 2016 Workshop Oct 20-21 2016 Hilversum Netherlands

(a) (b) (c)Figure 1 Examples of misleading (fake) image use (a)

reposting of real photo claiming to show two Vietnamese

siblings at Nepal 2015 earthquake (b) reposting of art-

work as a photo of solar eclipse (March 2015) (c) spliced

sharks on a photo during Hurricane Sandy in 2012

sufficiently reflects the realityrdquo In practice participants re-ceive a list of posts that are associated with images and arerequired to automatically predict for each post whether it istrustworthy or deceptive (real or fake respectively) In ad-dition to fully automated approaches the task also considershuman-assisted approaches provided that they are practical(ie fast enough) in real-world settings The following def-initions should be also taken into account

bull A post is considered fake when it shares multimedia con-tent that does not represent the event that it refers toFigure 1 presents examples of such content

bull A post is considered real when it shares multimedia thatlegitimately represents the event it refers to

bull A post that shares multimedia content that does not rep-resent the event it refers to but reports the false infor-mation or refers to it with a sense of humour is neitherconsidered fake nor real (and hence not included inthe task dataset)

Sub-task This version of the task addresses the problemof detecting digitally manipulated (tampered) images Thedefinition of the task is the following ldquoGiven an image thetask requires participants to return a decision (tamperednon-tampered or unknown) on whether the image has beendigitally modified or notrdquo In practice participants receivea list of images and are required to predict if this imageis tampered or not using multimedia forensic analysis Itshould also be noted that an image is considered tamperedwhen it is digitally altered

In both cases the task also asks participants to optionallyreturn an explanation (which can be a text string or URLspointing to evidence) that supports the verification decision

The explanation is not used for quantitative evaluation butrather for gaining qualitative insights into the results

3 VERIFICATION CORPUSDevelopment dataset (devset) This is provided togetherwith ground truth and is used by participants to developtheir approach For the main task it contains posts relatedto the 17 events of Table 1 comprising in total 193 casesof real and 220 cases of misused imagesvideos associatedwith 6225 real and 9596 fake posts posted by 5895 and9216 unique users respectively This data is the union oflast yearrsquos devset and testset [1] Note that several of theevents eg Columbian Chemicals and Passport Hoax arehoaxes hence all multimedia content associated with themis misused For several real events (eg MA flight 370) noreal images (and hence no real posts) are included in thedataset since none came up as a result of the data collectionprocess that is described below For the sub-task the devel-opment set contains 33 cases of non-tampered and 33 casesof tampered images derived from the same events alongwith their labels (tampered and non-tampered)Test dataset (testset) This is used for evaluation Forthe main task it comprises 104 cases of real and misusedimages and 25 cases of real and misused videos in totalassociated with 1107 and 1121 posts respectively For thesub-task it includes 64 cases of both tampered and non-tampered images from the testset events

The data for both datasets are publicly available1 Simi-lar to the 2015 edition of the task the posts were collectedaround a number of known events or news stories and con-tain fake and real multimedia content manually verified bycross-checking online sources (articles and blogs) Havingdefined a set of keywords K for each testset event wecollected a set of posts P (using Twitter API and specifickeywords) and a set of unique fake and real pictures aroundthese events resulting in the fake and real image sets IF IRrespectively We then used the image sets as seeds to cre-ate our reference verification corpus PC sub P which includesonly those posts that contain at least one image of the pre-defined sets IF IR However in order not to restrict theposts to the ones pointing to the exact image we employeda scalable visual near-duplicate search strategy [10] we usedthe IF IR as visual queries and for each query we checkedwhether each post image from the P set exists as an imageitem or a near-duplicate image item of the IF or the IR setIn addition to this process we also used a real-time systemthat collects posts using keywords and a location filter [7]This was performed mainly to increase the real samples forevents that occurred in known locations

To further extend the testset we carried out a crowd-sourcing campaign using the microWorkers platform2 Weasked each worker to provide three cases of manipulated mul-timedia content that they found on the web Furthermorethey had to provide a link with information and descriptionon each case along with online resources containing evidenceof its misleading nature We also asked them to provide theoriginal content if available To avoid cheating they hadto provide a manual description of the manipulation Wealso tested the task in two pilot studies to be sure that the

1httpsgithubcomMKLab-ITIimage-verification-corpus

treemastermediaeval20162httpsmicroworkerscom

Table 1 devset events For each event we reportthe numbers of unique real (if available) and fake im-agesvideos (IR IF respectively) unique posts thatshared those images (PR PF ) and unique Twitteraccounts that posted those tweets (UR UF )

Name IRIRIR PRPRPR URURUR IFIFIF PFPFPF UFUFUF

Hurricane Sandy 148 4664 4446 62 5559 5432Boston Marathon bombing 28 344 310 35 189 187Sochi Olympics - - - 26 274 252Bring Back Our Girls - - - 7 131 126MA flight 370 - - - 29 501 493Columbian Chemicals - - - 15 185 87Passport hoax - - - 2 44 44Rock Elephant - - - 1 13 13Underwater bedroom - - - 3 113 112Livr mobile app - - - 4 9 9Pig fish - - - 1 14 14Nepal earthquake 11 1004 934 21 356 343Solar Eclipse 4 140 133 6 137 135Garissa Attack 2 73 72 2 6 6Samurai and Girl - - - 4 218 212Syrian Boy - - - 1 1786 1692Varoufakis and ZDF - - - 1 61 59Total 193 6225 5895 220 9596 9216

information we got would also be useful Overall the datacollected was very useful We performed 75 tasks and eachworker earned 2 75$ per task

For every item of the datasets we extracted and madeavailable three types of features similar to the ones we madeavailable for the 2015 edition of the task (i) features ex-tracted from the post itself ie the number of wordshashtags mentions etc in the postrsquos text [1] (ii) featuresextracted from the user account ie number of friendsand followers whether the user is verified etc [1] and(iii) forensic features extracted from the image iethe probability map of the aligned double JPEG compres-sion the estimated quantization steps for the first six DCTcoefficients of the non-aligned JPEG compression and thePhoto-Response Non-Uniformity (PRNU) [4]

4 EVALUATIONOverall the main task is interested in the accuracy with

which an automatic method can distinguish between use ofmultimedia in posts in ways that faithfully reflect realityversus ways that spread false impressions Hence given aset of labelled instances (post + image + label) and a set ofpredicted labels (included in the submitted runs) for theseinstances the classic IR measures (ie Precision P RecallR and F -score) are used to quantify the classification per-formance where the target class is the class of fake tweetsSince the two classes (fakereal) are represented in a rela-tively balanced way in the testset the classic IR measuresare good proxies of the classifier accuracy Note that taskparticipants are allowed to classify a tweet as unknown Ob-viously in case a system produces many unknown outputsit is likely that its precision will benefit assuming that theselection of unknown is done wisely ie successfully avoid-ing erroneous classifications However the recall of such asystem will suffer in case the tweets that are labelled as un-known turn out to be fake (the target class) Similarly inthe sub-task case given the instances of (image + label) weuse the same IR measures to quantify the performance ofthe approach where the target class is tampered

5 ACKNOWLEDGEMENTSThis work is supported by the REVEAL and InVID projects

partially funded by the European Commission (FP7-610928 andH2020-687786 respectively)

6 REFERENCES[1] C Boididou K Andreadou S Papadopoulos D-T

Dang-Nguyen G Boato M Riegler and Y KompatsiarisVerifying multimedia use at mediaeval 2015 In WorkingNotes Proceedings of the MediaEval 2015 Workshop 2015

[2] C Boididou S Papadopoulos Y KompatsiarisS Schifferes and N Newman Challenges of computationalverification in social multimedia In Proceedings of the 23rdInternational Conference on World Wide Web pages743ndash748 ACM 2014

[3] C Castillo M Mendoza and B Poblete Informationcredibility on twitter In Proceedings of the 20thinternational conference on World wide web pages675ndash684 ACM 2011

[4] V Conotter D-T Dang-Nguyen M Riegler G Boatoand M Larson A crowdsourced data set of edited imagesonline In Proceedings of the 2014 International ACMWorkshop on Crowdsourcing for Multimedia CrowdMMrsquo14 pages 49ndash52 New York NY USA 2014 ACM

[5] A Gupta P Kumaraguru C Castillo and P MeierTweetcred Real-time credibility assessment of content ontwitter In International Conference on Social Informaticspages 228ndash243 Springer 2014

[6] A Gupta H Lamba P Kumaraguru and A Joshi Faking

Sandy characterizing and identifying fake images ontwitter during Hurricane Sandy In Proceedings of the 22ndinternational conference on World Wide Web companionpages 729ndash736 2013

[7] S E Middleton L Middleton and S Modafferi Real-timecrisis mapping of natural disasters using social mediaIEEE Intelligent Systems 29(2)9ndash17 2014

[8] J Ratkiewicz M Conover M Meiss B GoncalvesS Patil A Flammini and F Menczer Truthy mappingthe spread of astroturf in microblog streams In Proceedingsof the 20th international conference companion on Worldwide web pages 249ndash252 ACM 2011

[9] C Shao G L Ciampaglia A Flammini and F MenczerHoaxy A platform for tracking online misinformation InProceedings of the 25th International ConferenceCompanion on World Wide Web pages 745ndash750International World Wide Web Conferences SteeringCommittee 2016

[10] E Spyromitros-Xioufis S Papadopoulos I KompatsiarisG Tsoumakas and I Vlahavas A comprehensive studyover VLAD and Product Quantization in large-scale imageretrieval IEEE Transactions on Multimedia16(6)1713ndash1728 2014

Introduction

Task Overview

Verification corpus

Evaluation

Acknowledgements

References

The explanation is not used for quantitative evaluation butrather for gaining qualitative insights into the results

3 VERIFICATION CORPUSDevelopment dataset (devset) This is provided togetherwith ground truth and is used by participants to developtheir approach For the main task it contains posts relatedto the 17 events of Table 1 comprising in total 193 casesof real and 220 cases of misused imagesvideos associatedwith 6225 real and 9596 fake posts posted by 5895 and9216 unique users respectively This data is the union oflast yearrsquos devset and testset [1] Note that several of theevents eg Columbian Chemicals and Passport Hoax arehoaxes hence all multimedia content associated with themis misused For several real events (eg MA flight 370) noreal images (and hence no real posts) are included in thedataset since none came up as a result of the data collectionprocess that is described below For the sub-task the devel-opment set contains 33 cases of non-tampered and 33 casesof tampered images derived from the same events alongwith their labels (tampered and non-tampered)Test dataset (testset) This is used for evaluation Forthe main task it comprises 104 cases of real and misusedimages and 25 cases of real and misused videos in totalassociated with 1107 and 1121 posts respectively For thesub-task it includes 64 cases of both tampered and non-tampered images from the testset events

The data for both datasets are publicly available1 Simi-lar to the 2015 edition of the task the posts were collectedaround a number of known events or news stories and con-tain fake and real multimedia content manually verified bycross-checking online sources (articles and blogs) Havingdefined a set of keywords K for each testset event wecollected a set of posts P (using Twitter API and specifickeywords) and a set of unique fake and real pictures aroundthese events resulting in the fake and real image sets IF IRrespectively We then used the image sets as seeds to cre-ate our reference verification corpus PC sub P which includesonly those posts that contain at least one image of the pre-defined sets IF IR However in order not to restrict theposts to the ones pointing to the exact image we employeda scalable visual near-duplicate search strategy [10] we usedthe IF IR as visual queries and for each query we checkedwhether each post image from the P set exists as an imageitem or a near-duplicate image item of the IF or the IR setIn addition to this process we also used a real-time systemthat collects posts using keywords and a location filter [7]This was performed mainly to increase the real samples forevents that occurred in known locations

To further extend the testset we carried out a crowd-sourcing campaign using the microWorkers platform2 Weasked each worker to provide three cases of manipulated mul-timedia content that they found on the web Furthermorethey had to provide a link with information and descriptionon each case along with online resources containing evidenceof its misleading nature We also asked them to provide theoriginal content if available To avoid cheating they hadto provide a manual description of the manipulation Wealso tested the task in two pilot studies to be sure that the

1httpsgithubcomMKLab-ITIimage-verification-corpus

treemastermediaeval20162httpsmicroworkerscom

Table 1 devset events For each event we reportthe numbers of unique real (if available) and fake im-agesvideos (IR IF respectively) unique posts thatshared those images (PR PF ) and unique Twitteraccounts that posted those tweets (UR UF )

Name IRIRIR PRPRPR URURUR IFIFIF PFPFPF UFUFUF

Hurricane Sandy 148 4664 4446 62 5559 5432Boston Marathon bombing 28 344 310 35 189 187Sochi Olympics - - - 26 274 252Bring Back Our Girls - - - 7 131 126MA flight 370 - - - 29 501 493Columbian Chemicals - - - 15 185 87Passport hoax - - - 2 44 44Rock Elephant - - - 1 13 13Underwater bedroom - - - 3 113 112Livr mobile app - - - 4 9 9Pig fish - - - 1 14 14Nepal earthquake 11 1004 934 21 356 343Solar Eclipse 4 140 133 6 137 135Garissa Attack 2 73 72 2 6 6Samurai and Girl - - - 4 218 212Syrian Boy - - - 1 1786 1692Varoufakis and ZDF - - - 1 61 59Total 193 6225 5895 220 9596 9216

information we got would also be useful Overall the datacollected was very useful We performed 75 tasks and eachworker earned 2 75$ per task

For every item of the datasets we extracted and madeavailable three types of features similar to the ones we madeavailable for the 2015 edition of the task (i) features ex-tracted from the post itself ie the number of wordshashtags mentions etc in the postrsquos text [1] (ii) featuresextracted from the user account ie number of friendsand followers whether the user is verified etc [1] and(iii) forensic features extracted from the image iethe probability map of the aligned double JPEG compres-sion the estimated quantization steps for the first six DCTcoefficients of the non-aligned JPEG compression and thePhoto-Response Non-Uniformity (PRNU) [4]

4 EVALUATIONOverall the main task is interested in the accuracy with

which an automatic method can distinguish between use ofmultimedia in posts in ways that faithfully reflect realityversus ways that spread false impressions Hence given aset of labelled instances (post + image + label) and a set ofpredicted labels (included in the submitted runs) for theseinstances the classic IR measures (ie Precision P RecallR and F -score) are used to quantify the classification per-formance where the target class is the class of fake tweetsSince the two classes (fakereal) are represented in a rela-tively balanced way in the testset the classic IR measuresare good proxies of the classifier accuracy Note that taskparticipants are allowed to classify a tweet as unknown Ob-viously in case a system produces many unknown outputsit is likely that its precision will benefit assuming that theselection of unknown is done wisely ie successfully avoid-ing erroneous classifications However the recall of such asystem will suffer in case the tweets that are labelled as un-known turn out to be fake (the target class) Similarly inthe sub-task case given the instances of (image + label) weuse the same IR measures to quantify the performance ofthe approach where the target class is tampered

5 ACKNOWLEDGEMENTSThis work is supported by the REVEAL and InVID projects

partially funded by the European Commission (FP7-610928 andH2020-687786 respectively)













Introduction

Task Overview

Verification corpus

Evaluation

Acknowledgements

References













Introduction

Task Overview

Verification corpus

Evaluation

Acknowledgements

References

verifying multimedia use at mediaeval 2016ceur-ws.org/vol-1739/mediaeval_2016_paper_3.pdf · this...

Documents