crowdsourcing the annotation of rumourous … › documents › proceedings › companion ›...

7
Crowdsourcing the Annotation of Rumourous Conversations in Social Media Arkaitz Zubiaga 1 , Maria Liakata 1 , Rob Procter 1 , Kalina Bontcheva 2 , Peter Tolmie 1 1 University of Warwick, Coventry, UK 2 University of Sheffield, Sheffield, UK {a.zubiaga,m.liakata,rob.procter}@warwick.ac.uk k.bontcheva@sheffield.ac.uk,[email protected] ABSTRACT Social media are frequently rife with rumours, and the study of rumour conversational aspects can provide valuable knowl- edge about how rumours evolve over time and are discussed by others who support or deny them. In this work, we present a new annotation scheme for capturing rumour-bear- ing conversational threads, as well as the crowdsourcing methodology used to create high quality, human annotated datasets of rumourous conversations from social media. The rumour annotation scheme is validated through comparison between crowdsourced and reference annotations. We also found that only a third of the tweets in rumourous conver- sations contribute towards determining the veracity of ru- mours, which reinforces the need for developing methods to extract the relevant pieces of information automatically. Categories and Subject Descriptors I.2 [Applications and Expert Systems]: Natural lan- guage interfaces; H.5.2 [Information Interfaces and Pre- sentation]: User Interfaces—Evaluation/methodology, Nat- ural language 1. INTRODUCTION Social media has become a ubiquitous platform that ev- eryday more and more people use to communicate with one another, and to stay abreast of current affairs and breaking news as they unfold [2]. Popular social media, such as Twit- ter or Facebook, become even more important in emergency situations, such as shootings or social upheavals, where in- formation is often posted and shared first [7], before even news media [9]. However, the streams of posts associated with these crisis events are often riddled with rumours that are not verified and corroborated [20] and hence need to be handled carefully. The ease with which uncorroborated information can be propagated in social media, as well as the potential resulting damage to society from the dissemination of inaccurate in- Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author’s site if the Material is used in electronic media. WWW 2015 Companion, May 18–22, 2015, Florence, Italy. ACM 978-1-4503-3473-0/15/05. http://dx.doi.org/10.1145/2740908.2743052. formation [10], highlights the need to study and understand how rumours spread. However, little is known today about the way rumours propagate in social media, as well as how to mitigate their undesirable effects. Controversial informa- tion introduced by unconfirmed rumours usually sparks dis- cussion among the user community. Users contribute their opinions and provide more evidence that either backs or de- nies the rumour [14]. The conversations produced in these discussions can provide valuable information to help manage rumours posted in social media [12] and assess their verac- ity. Previous research in this area has investigated the way rumours are spread, by looking at whether individual social media posts support or deny a rumour [15, 14, 3, 18]. How- ever, to the best of our knowledge the present work is the first to look into the conversational aspects of rumours, as a more thorough analysis of social media posts in the context of rumours. We describe our method to identify and collect rumourous conversations on Twitter performed in collaboration with journalists, which improves on the shortcomings of previous rumour collection methods in the literature. We introduce an annotation scheme that enables the annotation of tweets participating in rumourous conversations, making it possible to track the trajectory of these conversations, e.g., towards a consensus clarifying the veracity of the rumourous story. We assess the validity of the annotation scheme through ex- periments using a crowdsourcing platform, and compare it to our reference annotations. The novelty of our approach consists in providing a methodology for obtaining rumour datasets semi-automatically from social media and provid- ing a framework for annotating and analysing rumourous conversations. 2. RELATED WORK The emergence and spread of rumours has been studied for decades in different fields, primarily from a psychological perspective [1, 16, 5]. However, the advent of the Internet and social media offers opportunities to transform the way we communicate, giving rise to new forms of communicating rumours to a broad community of users [10]. Previous work on the study and classification of rumourous tweets has ad- dressed the task of establishing veracity by looking only at whether each individual tweet supports or denies a rumour [15, 3, 14, 18]. While this is an important part of such anal- ysis, it fails to capture the interaction between tweets that leads to the eventual verification or discrediting of a rumour. Additionally, some of this work led to contradictory conclu- sions regarding the extent to which the community of social 347

Upload: others

Post on 23-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crowdsourcing the Annotation of Rumourous … › documents › proceedings › companion › p347.pdfCrowdsourcing the Annotation of Rumourous Conversations in Social Media Arkaitz

Crowdsourcing the Annotation of RumourousConversations in Social Media

Arkaitz Zubiaga1, Maria Liakata1, Rob Procter1, Kalina Bontcheva2, Peter Tolmie1

1University of Warwick, Coventry, UK2University of Sheffield, Sheffield, UK

{a.zubiaga,m.liakata,rob.procter}@[email protected],[email protected]

ABSTRACTSocial media are frequently rife with rumours, and the studyof rumour conversational aspects can provide valuable knowl-edge about how rumours evolve over time and are discussedby others who support or deny them. In this work, wepresent a new annotation scheme for capturing rumour-bear-ing conversational threads, as well as the crowdsourcingmethodology used to create high quality, human annotateddatasets of rumourous conversations from social media. Therumour annotation scheme is validated through comparisonbetween crowdsourced and reference annotations. We alsofound that only a third of the tweets in rumourous conver-sations contribute towards determining the veracity of ru-mours, which reinforces the need for developing methods toextract the relevant pieces of information automatically.

Categories and Subject DescriptorsI.2 [Applications and Expert Systems]: Natural lan-guage interfaces; H.5.2 [Information Interfaces and Pre-sentation]: User Interfaces—Evaluation/methodology, Nat-ural language

1. INTRODUCTIONSocial media has become a ubiquitous platform that ev-

eryday more and more people use to communicate with oneanother, and to stay abreast of current affairs and breakingnews as they unfold [2]. Popular social media, such as Twit-ter or Facebook, become even more important in emergencysituations, such as shootings or social upheavals, where in-formation is often posted and shared first [7], before evennews media [9]. However, the streams of posts associatedwith these crisis events are often riddled with rumours thatare not verified and corroborated [20] and hence need to behandled carefully.

The ease with which uncorroborated information can bepropagated in social media, as well as the potential resultingdamage to society from the dissemination of inaccurate in-

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to theauthor’s site if the Material is used in electronic media.WWW 2015 Companion, May 18–22, 2015, Florence, Italy.ACM 978-1-4503-3473-0/15/05.http://dx.doi.org/10.1145/2740908.2743052.

formation [10], highlights the need to study and understandhow rumours spread. However, little is known today aboutthe way rumours propagate in social media, as well as howto mitigate their undesirable effects. Controversial informa-tion introduced by unconfirmed rumours usually sparks dis-cussion among the user community. Users contribute theiropinions and provide more evidence that either backs or de-nies the rumour [14]. The conversations produced in thesediscussions can provide valuable information to help managerumours posted in social media [12] and assess their verac-ity. Previous research in this area has investigated the wayrumours are spread, by looking at whether individual socialmedia posts support or deny a rumour [15, 14, 3, 18]. How-ever, to the best of our knowledge the present work is thefirst to look into the conversational aspects of rumours, as amore thorough analysis of social media posts in the contextof rumours.

We describe our method to identify and collect rumourousconversations on Twitter performed in collaboration withjournalists, which improves on the shortcomings of previousrumour collection methods in the literature. We introducean annotation scheme that enables the annotation of tweetsparticipating in rumourous conversations, making it possibleto track the trajectory of these conversations, e.g., towardsa consensus clarifying the veracity of the rumourous story.We assess the validity of the annotation scheme through ex-periments using a crowdsourcing platform, and compare itto our reference annotations. The novelty of our approachconsists in providing a methodology for obtaining rumourdatasets semi-automatically from social media and provid-ing a framework for annotating and analysing rumourousconversations.

2. RELATED WORKThe emergence and spread of rumours has been studied

for decades in different fields, primarily from a psychologicalperspective [1, 16, 5]. However, the advent of the Internetand social media offers opportunities to transform the waywe communicate, giving rise to new forms of communicatingrumours to a broad community of users [10]. Previous workon the study and classification of rumourous tweets has ad-dressed the task of establishing veracity by looking only atwhether each individual tweet supports or denies a rumour[15, 3, 14, 18]. While this is an important part of such anal-ysis, it fails to capture the interaction between tweets thatleads to the eventual verification or discrediting of a rumour.Additionally, some of this work led to contradictory conclu-sions regarding the extent to which the community of social

347

Page 2: Crowdsourcing the Annotation of Rumourous … › documents › proceedings › companion › p347.pdfCrowdsourcing the Annotation of Rumourous Conversations in Social Media Arkaitz

media users manage to debunk inaccurate statements andbuttress truthful facts. While [3] found a similar numberof tweets supporting and denying false rumours during the2010 Chilean earthquake, [18] reported that the majorityof tweets supported false rumours during the 2013 BostonMarathon bombings. Factors such as people’s inherent trustin information they see repeatedly from multiple sources,irrespective of its actual validity, seems to play an impor-tant role [19]. We have collected a diverse set of rumourswith more detailed annotations, to help shed light on con-versational aspects of tweets. We have also defined a moredetailed annotation scheme which enables the annotation ofadditional features playing an important role in determin-ing the veracity of rumours and capturing the interactionbetween tweets around a rumourous story.

We adopt the definition of a rumour given in [20], as “acirculating story of questionable veracity, which is apparentlycredible but hard to verify, and produces sufficient skepticismand/or anxiety”. We collect rumours in a process guidedby journalists, and develop an annotation scheme for socialmedia rumours, which we test and validate through crowd-sourcing. While similar efforts have been made for the an-notation of social media messages posted during crises [8],to the best of our knowledge this is the first work to delveinto the data collection and sampling method, as well as toset forth and validate an annotation scheme for social mediamessages posted in the context of rumours.

3. DATA HARVESTING AND CURATIONThe process we followed to collect rumourous datasets

from Twitter is semi-automatic, from accessing the Twit-ter API through to collecting manual annotations.

3.1 Harvesting Source TweetsWe collect tweets from Twitter’s streaming API. Contrary

to existing approaches [15], we do not necessarily know inadvance what specific rumours we are collecting data for.Instead, we track events associated with breaking news thatare likely to be rife with rumours. By following this ap-proach, we aim to emulate the scenario faced by journalistswho are tracking breaking news in social media. That is,journalists follow an ongoing event and new rumours emergeconstantly as the event unfolds. Hence, for the collection ofevents, we ask journalists for keywords and hashtags asso-ciated with relevant events and collect associated tweets asthey are being posted. We applied this process to threedifferent events. Two of them were breaking news eventsexpected to generate rumours: (1) the Ferguson unrest inAugust 2014 (tracking “#ferguson”), where many citizensprotested after the killing of a black adolescent by the policein the United States; (2) the Ottawa shootings in October2014 (tracking “#ottawashooting” and “#ottawashootings”,and relevant keywords like “ottawa shooting” and “parlia-ment hill”), where a soldier was shot by a gunman in Canada.The third event is a specific rumour identified by the jour-nalists beforehand: on 12th October, 2014, a rumour cir-culated that AC Milan footballer Michael Essien had con-tracted Ebola (tracking“ebola AND essien”), which was laterdenied.

3.2 Obtaining Conversation ThreadsOnce the tweets for each event were collected, the next

step involved identifying rumour-bearing tweets and asso-

ciated conversations (tweet threads). This includes threesteps: (i) as the datasets were originally very large (i.e.,8.7 million tweets for Ferguson, and 1.1 million tweets forthe Ottawa shootings), we automatically sampled the databased on certain criteria to make manual annotation possi-ble, (ii) we collected conversations sparked by the sampledset of tweets, where conversations include all tweets whichreplied to the source tweets, directly or indirectly. Theseconversations/threads provide additional context for man-ual annotation. Finally in (iii) we separated rumours andnon-rumours using manual annotation.

To sample the large sets of tweets collected for each ofthe events, we relied on the number of retweets as a human-sourced signal that a tweet has produced interest, in linewith our definition of rumours. This filtered set of tweetswas manually curated by journalists, distinguishing betweenrumours and non-rumours for the most eventful days of the3 events under study. We refer to the latter sampled setof tweets as source tweets, for which associated conversa-tions were also collected. Given the very different nature ofthe datasets, we used 100 as the retweet threshold for theFerguson unrest and the Ottawa shootings, which were verypopular, and we used 2 as the threshold for the story ofMichael Essien having contracted Ebola.

Conversations associated with each of the source tweetswere obtained by retrieving all the tweets that replied tothe source tweets. Due to the absence in Twitter’s APIendpoint of a way to retrieve such replying tweets directly,we scraped the web page of each of the source tweets toretrieve the IDs of the replying tweets, which were then col-lected from Twitter’s API by tweet ID. This allows us toform the complete conversation sparked by a source tweet,which we have used for: (i) assisting the manual annotationof rumours vs non-rumours with additional context, and (ii)studying how conversations evolve around rumours. Figure1 shows an example of a conversation collected from Twitterand visualised as a forum-like thread.

3.3 Manual Annotation of Source TweetsFinally, having sampled the source tweets and collected

the conversations associated with them, we used manual an-notation to distinguish between rumours and non-rumours.This process was performed by a team of journalists at swiss-info.ch, following our definition of rumours. To facilitate theprocess, we developed an annotation tool that visualises atimeline of tweets. This tool makes it easier to visualise thethreads associated with source tweets and provide sourcetweet annotations. Each annotation is assigned a title whichallows us to group together different conversations on thesame category or story. More details are available in [20].The annotation of the 3 events mentioned above led to theidentification of 291 rumourous conversations for the Fer-guson unrest, 475 for the Ottawa shootings, and 18 for theEbola rumour.

4. ANNOTATION SCHEME FOR SOCIALMEDIA RUMOURS

In order to manually annotate tweets with respect to howthey contribute to rumourous conversations, we have devel-oped an annotation scheme which addresses different aspectsof rumours. The datasets produced using human annota-tion will then be used to train machine learning classifiers

348

Page 3: Crowdsourcing the Annotation of Rumourous … › documents › proceedings › companion › p347.pdfCrowdsourcing the Annotation of Rumourous Conversations in Social Media Arkaitz

Figure 1: Example of conversation sparked by a rumourous tweet in the context of Ferguson.

to recognise patterns characteristic of rumours observed innew events and associated conversations. The annotationscheme we introduce here has been developed through an it-erative process of rounds of annotation and evaluation withPhD students experienced in conversation analysis (we omitdetails due to lack of space).

The annotation scheme has been designed to annotateTwitter conversations both in terms of properties of eachtweet and the relation to its parent tweet, that is a tweet itis linked to through a reply. The conversation is thus mod-elled as a tree structure. Within the conversation tree, wedistinguish two types of tweets: (i) the source tweet, whichis the first tweet in the tree which has triggered the entireconversation, (ii) replies, which are tweets that respond toother tweets in the conversation tree. We make a distinc-tion between first-level replies, tweets that directly reply tothe source tweet, and deep replies, which are all subsequentreplies in the conversation which are not replying directly tothe source tweet but to other replies.

The main goal of the annotation scheme has been to coverthree aspects of the tweet conversation that are key in deter-mining the veracity of a story: (i) whether a post supportsor denies a story, as also used in previous work [14], (ii)the certainty with which the author of a post presents theirview, and (iii) the evidence that is being given along with thepost/statement to back up the author’s view. In this work,we introduce the latter two to enable a more thorough anal-ysis of posts in the context of rumours, beyond the fact thatthey support or deny a story. Also, we extend support toinclude responses to all tweets within the conversation, notjust the story in the source tweet, as done in previous work(response type). Our annotation scheme therefore includesfour different features, two of which apply to both sourcetweets and responses (certainty and evidentiality) and twovariations of support that are dependent on the tweet type(i.e. support for the story in the source tweets, and responsetype for the relation between replies or replies and the sourcetweet). Figure 2 shows a diagram depicting the annotationscheme designed for rumourous conversations in social me-dia. In what follows we present in detail each of the featuresforming the annotation schema.

Support: Support is only annotated for source tweets.It defines if the message in the source tweet is conveyedas a statement that supports or denies the rumour. It ishence different from the rumour’s truth value, and intendsto reflect what the tweet suggests is author’s view towards

the rumour’s veracity. The support given by the author ofa tweet can be deemed as: (1) supporting the rumour, (2)denying it, or (3) underspecified, when the author’s view isunclear. This feature is related to the “Polarity” feature inthe factuality scheme by Saurı et al. [17].

Response Type: Response type is used to designatesupport for the replying tweets. Given a source tweet thatintroduces a rumourous story, other users can reply to theauthor, leaning for instance in favour or against the state-ment. Some replies can be very helpful to determine the ve-racity of the rumour, and thus we annotate the type of replywith one of the following four values: (1) agreed, when theauthor of the reply supports the statement they are replyingto, (2) disagreeing, when they deny it, (3) appeal for moreinformation, when they ask for additional evidence to backup the original statement, or (4) comment, when the authorof the reply makes their own comment without adding any-thing to the veracity of the story. Note that the responsetype is annotated twice for deep replies, i.e., tweets that arenot directly replying to the source tweet. In these cases, theresponse type is annotated for a tweet determining two dif-ferent aspects: (i) how the tweet is replying with respect tothe rumour in the source tweet, and (ii) how the tweet is re-plying to the parent tweet, the one it is directly replying to.This double annotation allows us to better analyse the wayconversations flow, and how opinions evolve with respectto veracity. The inclusion of this feature in the annotationscheme was inspired by Procter et al. [14], who originallyintroduced these four types of responses for rumours.

Certainty: Certainty measures the degree of confidenceexpressed by the author when posting a statement in thecontext of a rumour and applies to both source tweets andreplies. The author can express different degrees of certaintywhen posting a tweet, from being 100% certain, to consid-ering it as a dubious or unlikely occurrence. Note that thevalue annotated for either support or response type has noeffect on the annotation of certainty, and thus it is codedregardless of the statement supporting or denying the ru-mour. The values for certainty include: (1) certain, whenthe author is fully confident or the author is not showingany kind of doubt, (2) somewhat certain, when they are notfully confident, and (3) uncertain, when the author is clearlyunsure. This feature and the possible values were inspiredby Saurı et al. [17], who referred to it as “modality” whenannotating the factuality of news headlines.

Evidentiality: Evidentiality determines the type of ev-

349

Page 4: Crowdsourcing the Annotation of Rumourous … › documents › proceedings › companion › p347.pdfCrowdsourcing the Annotation of Rumourous Conversations in Social Media Arkaitz

Support

Certainty Evidentiality

Source Tweet

Response Type(x2 for deep replies)

Replying Tweet

(+) Supporting(+) Denying(+) Underspecified

(+) First-hand experience(+) URL pointing to evi-dence(+) Quotation of person ororganisation(+) Attachment of picture(+) Quotation of unverifi-able source(+) Employment of rea-soning(+) No evidence

(+) Certain(+) Somewhat certain(+) Uncertain

(+) Agreed(+) Disagreed(+) Appeal for more infor-mation(+) Comment

Figure 2: Annotation scheme for rumourous social media conversations.

idence (if any) provided by an author and applies to bothsource tweets and replying tweets. It is important to notethat the evidence has to be directly related to the rumourbeing discussed in the conversation, and any other kind ofevidence that is irrelevant in that context should not be an-notated here. Evidentiality can have the following values:(1) first-hand experience, when the author claims to havewitnessed events associated with the rumour (2) attachmentof a URL pointing to evidence, (3) quotation of a person ororganisation, when an accessible source is being quoted as asource of evidence, (4) attachment of a picture, (5) quotationof an unverifiable source, when the source being mentionedis not accessible, such as“my friend said that...”, (6) employ-ment of reasoning, when the author explains the reasoningbehind their view, and (7) lack of evidence, when none of theother types of evidence is given in the tweet. Contrary tothe rest of the features, more than one value can be pickedfor evidentiality, except when “lack of evidence” is selected.Hence, we cater for the fact that a tweet can provide morethan one type of evidence, e.g. quoting a news organisationwhile also attaching a picture that provides evidence.

5. CROWDSOURCING ANNOTATION OFRU-MOUROUS CONVERSATIONS

In order to annotate the harvested Twitter conversationswith the annotation scheme described above, we used acrowdsourcing platform, so as to maximise speed [13]. Crowd-sourcing has been used extensively for the annotation ofTwitter corpora [6, 11]. We have used CrowdFlower1 asthe platform for crowdsourcing as it provides a flexible in-terface and has fewer restrictions than Amazon MechanicalTurk. None commercial crowdsourcing platforms were nota viable alternative at this stage. To validate and assess theviability of crowdsourcing annotations using our scheme, wesampled 8 different source tweets and their associated con-versations from the 784 rumours identified for the 3 events.This includes 4 source tweets for the Ferguson unrest, and2 source tweets each for the Ottawa shootings and the ru-mourous story of Essien having contracted Ebola (Table 1

1http://www.crowdflower.com/

shows the number of source tweets and replies included ineach case).

Event Src. tweets 1st rep. 2nd rep.Ferguson unrest 4 63 58Ottawa shootings 2 20 35Ebola 2 22 10

TOTAL 8 105 103

Table 1: Tweets sampled for annotation.

In order to have a set of reference annotations to com-pare the crowdsourced annotations against, the whole an-notation task was also performed by one of the authors ofthis paper, which we use as a reference annotation (REF).A second annotator, one of the journalists that contributedto the manual annotation of source tweets, annotated onethird of the whole (REF2). The inter-annotator agreementbetween REF and REF2 was 78.57% measured as the over-lap. This serves as a reference to assess the performance ofthe crowdsourced annotations in subsequent steps.

5.1 Disaggregating Annotation Task into Mi-crotasks

The annotation of an entire rumourous conversation canbecome time-consuming and cumbersome as it involves theannotation of all four features for all tweets in a conversa-tion. As a first step we split the conversation into triples,where each triple consists of a tweet, which replies to thesource tweet either directly or indirectly, its parent tweet(the tweet it replies directly to) and the source tweet. Ifthe tweet replies directly to the source tweet and no otherprevious tweet in the conversation then this is a tuple ratherthan a triple. Where the objective is to annotate the sourcetweet, this will appear on its own. Along with these tweets,we also show annotators the title assigned to the conversa-tion during the rumour identification phase (see section 3.3),which facilitates crowdsourced annotation of conversationsby keeping in focus what the rumour is about.

To facilitate the task of the annotators further [4], we nar-rowed down the annotation unit to a single feature for eachtweet triple, i.e., an annotator that accepts to take up a

350

Page 5: Crowdsourcing the Annotation of Rumourous … › documents › proceedings › companion › p347.pdfCrowdsourcing the Annotation of Rumourous Conversations in Social Media Arkaitz

microtask would be able to focus on a single feature (e.g.Response Type) without having to switch to other features.This can significantly speed up the process of annotatingthe same feature across triples or even different conversa-tion threads. An alternative way of narrowing down theannotation unit would be to ask each worker to annotate allthe features for a single tweet. However, this would involvehaving to focus on different features, understanding the an-notation guidelines for all of them at the same time, and re-quires more effort and concentration. Instead, our approachlets workers focus on a single feature, which makes the taskguidelines easier to read and understand well. The disag-gregation produced a total of 10 different microtasks thatwe set up in the crowdsourcing platform. These 10 micro-tasks include 3 tasks for source tweets (annotation of eachof support, certainty, evidentiality), 3 tasks for first-levelreplies (annotation of response type wrt the source tweet,certainty, evidentiality), and 4 for deep replies (annotationof response type wrt the source tweet, response type wrtthe previous tweet, certainty, evidentiality). Each of theserepresent a separate job on the crowdsourcing platform.

5.2 Crowdsourcing Tasks ParametersOur annotation units consist of either a triple/tuple of

tweets or a single source tweet, annotated for a particularfeature. For each annotation unit we collected annotationsfrom at least 5 different workers. Each CrowdFlower jobconsists of 10 annotation units as described above. Thusthis is the minimum an annotator commits to when accept-ing a job. We paid $.15 for the annotation of each set of10 units. In order to make sure that the annotators hada good command of English, we restricted participants tothose from the United States and the United Kingdom.

We performed an initial test on CrowdFlower to evaluatethese parameters, which allowed further optimisation for thefinal crowdsourcing task. All 8 threads were annotated inthis initial test which lead us to optimise the following twoaspects. Firstly, we identified that having always 5 annota-tors (as was our initial configuration) was not optimal, asoften more annotators were needed to reach agreement (de-fined below) in difficult cases. Thus, we enabled the variablejudgments mode which allows us to have at least 5 annota-tors per unit, and occasionally more, up to a maximum num-ber of annotators until a confidence value of 0.65 is reached.In most cases it was sufficient to set the maximum numberof annotators to 7, apart from evidentiality where it was setto 10. Evidentiality is more challenging as one can assign7 different values and more than one option can be picked,thus increasing the chance for a diverse set of annotations.Secondly, we noticed that some annotators were completingthe task too fast, annotating a set of 10 units in a few sec-onds. To avoid this, we changed the settings to force theannotators to spend at least 60 seconds annotating sets of10 source tweets, and at least 90 seconds annotating sets of10 units of replying tweets. With the revised settings, thecost for the annotation of all 8 threads amounted to $102.78.

5.3 Crowdsourcing Task ResultsThe Crowdsourcing task involved the annotation of 216

different tweets, which are part of the 8 tweet threads sam-pled for the three events. This amounts to the annotation of4,974 units (tweet triple+feature combination), and was per-formed by 98 different contributors. The final set of annota-

tions was obtained by combining annotations by all workersthrough majority voting for each annotation unit. In orderto report inter-annotator agreements, we rely on the percentof overlap between annotators, as the ratio of annotationsthat they agreed upon. When we compare the decisions ofeach of the annotators against the majority vote, we ob-serve an overall inter-annotator agreement of 60.2%. Whenwe compare the majority vote against our reference annota-tions, REF, they achieved an overall agreement of 68.84%.While this agreement is somewhat lower than the 78.57%agreement between REF and REF2, it is only worse whenannotating for ”certainty”, as we will show later. This alsorepresents a significant increase from earlier crowdsourcingtests performed before revising the settings, where the an-notators achieved a lower agreement rate of 62.5%. Whenbreaking down the agreement rate for each of the features(see Table 2), we see that the agreement values range from58.17% for certainty in reply tweets, to 100% for support insource tweets. The agreement rates are significantly higherfor source tweets, given that the annotation is easier asthere is only the need to look at one tweet, instead of tu-ples/triples. This analysis also allows us to compare theagreement by feature between the crowdsourced annotations(CS) and REF, as well as between REF and REF2. We ob-serve that agreements are comparable in most cases, exceptfor the agreement on certainty, which is significantly higherbetween REF and REF2. The latter represents the ma-jor concern here, where the crowdsourcing annotators per-formed worse, which we explain later.

Source tweetsSupport Certainty Evident.

CS vs REF 100% 87.5% 87.5%REF vs REF2 100% 62.5% 87.5%

Replying tweetsResp. type Certainty Evident.

CS vs REF 70.42% 58.17% 74.52%REF vs REF2 71.82% 87.14% 78.89%

Table 2: Inter-annotator agreement by feature.

In more detail, Table 3 shows the distribution of anno-tated categories, as well as the agreement rates for eachfeature when compared to the reference annotations, REF.Looking at the agreement rates, annotators agreed substan-tially with the reference annotations for source tweets (100%agreement). For replying tweets, as discussed above, thedepth of the conversation and the additional context leadto lower agreement rates, especially for some of the cate-gories. The agreement rates are above 60% for the mostfrequent types of values, including response types that are”comments” (67.69%), authors that are ”certain” (60.22%),and tweets with ”no evidence” (85.37%). The agreementis lower for the other annotations, which appear less fre-quently. This certainly proves that the annotation of repliesis harder than the annotation of source tweets, as the conver-sation gets deeper and occasionally deviates from the topicdiscussed in the source tweet. One of the cases with a lowagreement rate is when the evidence provided is “reasoning”.This shows the need to emphasise even more in subsequentcrowdsourcing tasks the way this type of evidence should beannotated, by remarking that the reasoning that is beinggiven in a tweet must be related to the rumourous story andnot another type of reasoning.

When we look at the distribution of values the annotators

351

Page 6: Crowdsourcing the Annotation of Rumourous … › documents › proceedings › companion › p347.pdfCrowdsourcing the Annotation of Rumourous Conversations in Social Media Arkaitz

Source tweetsSupport Certainty Evidentiality

% of times agreem. % of times agreem. % of times agreem.supporting (100%) 100% certain (75%) 100% no evidence (37.5%) 100%denying, underspecified (0%) – somewhat certain (12.5%) 100% author quoted (37.5%) 100%

– uncertain (75%) 100% picture attached (25%) 50%URL given, unverifiable source, –witnessed, reasoning (0%) –

Replying tweetsResponse type Certainty Evidentiality

% of times agreem. % of times agreem. % of times agreem.comment (66.56%) 67.69% certain (54.33%) 60.22% no evidence (79.81%) 85.37%disagreed (15.43%) 53.70% somewhat certain (25.96%) 40% reasoning (9.62%) 29.17%agreed (10.61%) 50% uncertain (19.71%) 41.18% author quoted (3.37%) 62.5%appeal for more info (7.40%) 33.33% URL given (3.37%) 50%

picture attached (2.89%) 33.33%witnessed (0.48%) 0%unverifiable source (0.48%) 0%

Table 3: Distribution of annotations: percent of times that each category was picked, and the agreementwith respect to our reference annotations (CS vs REF).

chose, we observe an imbalance in most cases. For responsetype, we see that as many as 66.5% of the replies are com-ments, which shows that only the remainder 33.5% provideany information that adds something to the veracity of thestory. The evidentiality is even more skewed towards tweetsthat provide no evidence at all, which amount to 85.4% ofthe cases. Both the abundance of comments, and the dearthof evidence, emphasise the need for carefully analysing theseconversations when building machine learning tools to pickout content that is useful to determine the veracity of ru-mourous stories. The certainty feature is slightly better dis-tributed, but still skewed towards more than 54% cases ofcertain statements; this could be due to the fact that manyusers do not express uncertainty in short, written texts evenwhen they are not 100% sure.

To better understand how the different features that havebeen annotated fit together, we investigated the combina-tions of values selected for the replying tweets. Interest-ingly, we observe that among the replying tweets annotatedas comments as many as 80.3% were annotated as havingno evidence, and 47.5% were annotated as being certain.Given that comments do not add anything to the veracityof the rumour, it is to be expected that there would be noevidence. We also investigated several cases to understandhow certainty was being annotated for comments; we ob-served that different degrees of certainty were being assignedto comments where certainty can hardly be determined asit does not seem to apply, e.g., in the tweet “My heart goesout to his family”. This also helped us understand the lowagreement rate between CS and REF for certainty, whichmay drop due to the comments with an unclear value of cer-tainty. For these two reasons, together with the fact thatcomments represent tweets that do not add anything to theveracity of the story, we consider revising the annotationscheme so that these two features should not be annotatedfor comments. This, in turn, reduces significantly the cost ofrunning the crowdsourcing tasks, given that for as many as66.5% replying tweets that represent comments, we wouldavoid the need for two annotation tasks.

6. DISCUSSIONWe have described a novel method to collect and annotate

rumourous conversations from Twitter, and introduced anannotation scheme specifically designed for the annotation

of tweets taking part in these rumourous conversations. Thisscheme has been revised iteratively and used for the crowd-sourced categorisation of rumour-bearing messages. As faras we know, this is the first conversation-based annotationscheme specifically designed for social media rumours andcomplements related annotation work in the literature. Ear-lier work only considered whether a message supported ordenied a certain rumour. Our annotation scheme is ableto annotate the certainty with which those messages areposted, the evidence that accompanies them, as well as theflow of support for a rumour within a conversation, whichare all key additional aspects when considering the verac-ity of a story. The agreement achieved between the crowd-sourced annotations and our reference annotation, which iscomparable to and occasionally better than the agreementbetween our own reference annotations, has enabled us tovalidate both the crowdsourcing process and the annota-tion scheme. While the annotation scheme has so far beenapplied to a relatively small data sample, it reveals someinteresting patterns, especially suggesting that to a greatextent conversations around rumours in social media mostlyinvolve comments, which do not add anything to the verac-ity of the story. This reinforces the motivation of our workof categorising tweets in these conversations so as to identifythe tweets that do provide useful knowledge and evidence todetermine the veracity of a rumourous story. The annota-tion tests have also helped identify suitable settings for thecrowdsourcing tasks, and have ultimately revealed a formof simplifying the scheme while keeping the main, requiredannotations and reducing the cost of running the task.

Our next plan is to apply the annotation scheme to alarger dataset of social media rumours, collected for a broaderset of events and including tweets in other languages besidesEnglish. The creation of this dataset will then enable us toperform a conversation analysis study, as well as to developmachine learning tools to deal with social media rumours.

AcknowledgmentsWe would like to thank the team of journalists at swissinfo.chfor the annotation work as well as for the discussion. Thiswork has been supported by the PHEME FP7 project (grantNo. 611233).

352

Page 7: Crowdsourcing the Annotation of Rumourous … › documents › proceedings › companion › p347.pdfCrowdsourcing the Annotation of Rumourous Conversations in Social Media Arkaitz

7. REFERENCES[1] G. W. Allport and L. Postman. The psychology of

rumor. 1947.

[2] M. Anderson and A. Caumont. How social media isreshaping news.http://www.pewresearch.org/fact-tank/2014/09/

24/how-social-media-is-reshaping-news/, 2014.

[3] C. Castillo, M. Mendoza, and B. Poblete. Predictinginformation credibility in time-sensitive social media.Internet Research, 23(5):560–588, 2013.

[4] J. Cheng, J. Teevan, S. T. Iqbal, and M. S. Bernstein.Break it down: A comparison of macro- andmicrotasks. In Proceedings of CHI, 2015.

[5] L. Festinger, D. Cartwright, K. Barber, J. Fleischl,J. Gottsdanker, A. Keysen, and G. Leavitt. A study ofa rumor: its origin and spread. Human Relations,1948.

[6] T. Finin, W. Murnane, A. Karandikar, N. Keller,J. Martineau, and M. Dredze. Annotating namedentities in twitter data with crowdsourcing. InProceedings of the NAACL HLT 2010 Workshop onCreating Speech and Language Data with Amazon’sMechanical Turk, pages 80–88. Association forComputational Linguistics, 2010.

[7] M. Imran, C. Castillo, F. Diaz, and S. Vieweg.Processing social media messages in mass emergency:A survey. arXiv preprint arXiv:1407.7071, 2014.

[8] M. Imran, C. Castillo, J. Lucas, M. Patrick, andJ. Rogstadius. Coordinating human and machineintelligence to classify microblog communications incrises. Proceedings of the International Conference onInformation Systems for Crisis Response andManagement, 2014.

[9] H. Kwak, C. Lee, H. Park, and S. Moon. What istwitter, a social network or a news media? InProceedings of the 19th international conference onWorld wide web, pages 591–600. ACM, 2010.

[10] S. Lewandowsky, U. Ecker, C. Seifert, N. Schwarz, andJ. Cook. Misinformation and its correction: Continued

influence and successful debiasing. PsychologicalScience in the Public Interest, 13(3):106–131, 2012.

[11] S. A. Paul, L. Hong, and E. Chi. What is a question?crowdsourcing tweet categorization. CHI 2011, 2011.

[12] R. Procter, J. Crump, S. Karstedt, A. Voss, andM. Cantijoch. Reading the riots: What were the policedoing on twitter. Policing and Society, 23(4):413–436,2013.

[13] R. Procter, W. Housley, M. Williams, A. Edwards,P. Burnap, J. Morgan, O. Rana, E. Klein, M. Taylor,A. Voss, et al. Enabling social media research throughcitizen social science. In ECSCW 2013 AdjunctProceedings, 2013.

[14] R. Procter, F. Vis, and A. Voss. Reading the riots ontwitter: methodological innovation for the analysis ofbig data. International Journal of Social ResearchMethodology, 16(3):197–214, 2013.

[15] V. Qazvinian, E. Rosengren, D. R. Radev, andQ. Mei. Rumor has it: Identifying misinformation inmicroblogs. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing,pages 1589–1599. Association for ComputationalLinguistics, 2011.

[16] R. L. Rosnow and G. A. Fine. Rumor and gossip: Thesocial psychology of hearsay. Elsevier, 1976.

[17] R. Saurı and J. Pustejovsky. Factbank: A corpusannotated with event factuality. Language resourcesand evaluation, 43(3):227–268, 2009.

[18] K. Starbird, J. Maddock, M. Orand, P. Achterman,and R. M. Mason. Rumors, false flags, and digitalvigilantes: Misinformation on twitter after the 2013boston marathon bombing. In Proceedings ofiConference. iSchools, 2014.

[19] A. Zubiaga and H. Ji. Tweet, but verify: epistemicstudy of information verification on twitter. SocialNetwork Analysis and Mining, 4(1):1–12, 2014.

[20] A. Zubiaga, M. Liakata, R. Procter, K. Bontcheva,and P. Tolmie. Towards detecting rumours in socialmedia. In AAAI Workshop on AI for Cities, 2015.

353