online human-bot interactions: detection, estimation, and ... · study how pos tags are...

Online Human-Bot Interactions: Detection, Estimation, and Characterization

Onur Varol,1,* Emilio Ferrara,2 Clayton A. Davis,1 Filippo Menczer,1 Alessandro Flammini11Center for Complex Networks and Systems Research, Indiana University, Bloomington, US2Information Sciences Institute, University of Southern California, Marina del Rey, CA, US

Abstract

Increasing evidence suggests that a growing amount of socialmedia content is generated by autonomous entities knownas social bots. In this work we present a framework to de-tect such entities on Twitter. We leverage more than a thou-sand features extracted from public data and meta-data aboutusers: friends, tweet content and sentiment, network patterns,and activity time series. We benchmark the classificationframework by using a publicly available dataset of Twitterbots. This training data is enriched by a manually annotatedcollection of active Twitter users that include both humansand bots of varying sophistication. Our models yield high ac-curacy and agreement with each other and can detect bots ofdifferent nature. Our estimates suggest that between 9% and15% of active Twitter accounts are bots. Characterizing tiesamong accounts, we observe that simple bots tend to interactwith bots that exhibit more human-like behaviors. Analysis ofcontent flows reveals retweet and mention strategies adoptedby bots to interact with different target groups. Using cluster-ing analysis, we characterize several subclasses of accounts,including spammers, self promoters, and accounts that postcontent from connected applications.

IntroductionSocial media are powerful tools connecting millions of peo-ple across the globe. These connections form the substratethat supports information dissemination, which ultimatelyaffects the ideas, news, and opinions to which we are ex-posed. There exist entities with both strong motivation andtechnical means to abuse online social networks — from in-dividuals aiming to artificially boost their popularity, to or-ganizations with an agenda to influence public opinion. Itis not difficult to automatically target particular user groupsand promote specific content or views (Ferrara et al. 2016a;Bessi and Ferrara 2016). Reliance on social media maytherefore make us vulnerable to manipulation.

Social bots are accounts controlled by software, algo-rithmically generating content and establishing interactions.Many social bots perform useful functions, such as dis-semination of news and publications (Lokot and Diakopou-los 2016; Haustein et al. 2016) and coordination of vol-unteer activities (Savage, Monroy-Hernandez, and Hollerer2016). However, there is a growing record of malicious ap-plications of social bots. Some emulate human behavior tomanufacture fake grassroots political support (Ratkiewicz

et al. 2011), promote terrorist propaganda and recruit-ment (Berger and Morgan 2015; Abokhodair, Yoo, and Mc-Donald 2015; Ferrara et al. 2016c), manipulate the stockmarket (Ferrara et al. 2016a), and disseminate rumors andconspiracy theories (Bessi et al. 2015).

A growing body of research is addressing social bot ac-tivity, its implications on the social network, and the de-tection of these accounts (Lee, Eoff, and Caverlee 2011;Boshmaf et al. 2011; Beutel et al. 2013; Yang et al. 2014;Ferrara et al. 2016a; Chavoshi, Hamooni, and Mueen 2016).The magnitude of the problem was underscored by a Twit-ter bot detection challenge recently organized by DARPA tostudy information dissemination mediated by automated ac-counts and to detect malicious activities carried out by thesebots (Subrahmanian et al. 2016).

Contributions and OutlineHere we demonstrate that accounts controlled by soft-ware exhibit behaviors that reflects their intents and modusoperandi (Bakshy et al. 2011; Das et al. 2016), and that suchbehaviors can be detected by supervised machine learningtechniques. This paper makes the following contributions:• We propose a framework to extract a large collection

of features from data and meta-data about social mediausers, including friends, tweet content and sentiment, net-work patterns, and activity time series. We use these fea-tures to train highly-accurate models to identify bots. Fora generic user, we produce a [0, 1] score representing thelikelihood that the user is a bot.

• The performance of our detection system is evaluatedagainst both an existing public dataset and an additionalsample of manually-annotated Twitter accounts collectedwith a different strategy. We enrich the previously-trainedmodels using the new annotations, and investigate the ef-fects of different datasets and classification models.

• We classify a sample of millions of English-speaking ac-tive users. We use different models to infer thresholds inthe bot score that best discriminate between humans andbots. We estimate that the percentage of Twitter accountsexhibiting social bot behaviors is between 9% and 15%.

• We characterize friendship ties and information flow be-tween users that show behaviors of different nature: hu-man and bot-like. Humans tend to interact with more

arX

iv:1

703.

0310

7v2

[cs

.SI]

27

Mar

201

7

human-like accounts than bot-like ones, on average. Reci-procity of friendship ties is higher for humans. Some botstarget users more or less randomly, others can choose tar-gets based on their intentions.

• Clustering analysis reveals certain specific behavioralgroups of accounts. Manual investigation of samples ex-tracted from each cluster points to three distinct botgroups: spammers, self promoters, and accounts that postcontent from connected applications.

Bot Detection FrameworkIn the next section, we introduce a Twitter bot detectionframework (truthy.indiana.edu/botornot) thatis freely available online. This system leverages more thanone thousand features to evaluate the extent to which a Twit-ter account exhibits similarity to the known characteristicsof social bots (Davis et al. 2016).

Feature ExtractionData collected using the Twitter API are distilled in 1,150features in six different classes. The classes and types of fea-tures are reported in Table 1 and discussed next.

User-based features. Features extracted from user meta-data have been used to classify users and patterns be-fore (Mislove et al. 2011; Ferrara et al. 2016a). We ex-tract user-based features from meta-data available throughthe Twitter API. Such features include the number of friendsand followers, the number of tweets produced by the users,profile description and settings.

Friends features. Twitter actively fosters inter-connectivity. Users are linked by follower-friend (fol-lowee) relations. Content travels from person to person viaretweets. Also, tweets can be addressed to specific usersvia mentions. We consider four types of links: retweeting,mentioning, being retweeted, and being mentioned. Foreach group separately, we extract features about languageuse, local time, popularity, etc. Note that, due to Twitter’sAPI limits, we do not use follower/followee informationbeyond these aggregate statistics.

Network features. The network structure carries crucialinformation for the characterization of different types ofcommunication. In fact, the usage of network features sig-nificantly helps in tasks like political astroturf detection(Ratkiewicz et al. 2011). Our system reconstructs three typesof networks: retweet, mention, and hashtag co-occurrencenetworks. Retweet and mention networks have users asnodes, with a directed link between a pair of users that fol-lows the direction of information spreading: toward the userretweeting or being mentioned. Hashtag co-occurrence net-works have undirected links between hashtag nodes whentwo hashtags occur together in a tweet. All networks areweighted according to the frequency of interactions or co-occurrences. For each network, we compute a set of fea-tures, including in- and out-strength (weighted degree) dis-tributions, density, and clustering. Note that out-degree andout-strength are measures of popularity.

Temporal features. Prior research suggests that the tem-poral signature of content production and consumption mayreveal important information about online campaigns andtheir evolution (Ghosh, Surachawala, and Lerman 2011;Ferrara et al. 2016b; Chavoshi, Hamooni, and Mueen 2016).To extract this signal we measure several temporal featuresrelated to user activity, including average rates of tweet pro-duction over various time periods and distributions of timeintervals between events.

Content and language features. Many recent papers havedemonstrated the importance of content and language fea-tures in revealing the nature of social media conversa-tions (Danescu-Niculescu-Mizil et al. 2013; McAuley andLeskovec 2013; Mocanu et al. 2013; Botta, Moat, and Preis2015; Letchford, Moat, and Preis 2015; Das et al. 2016).For example, deceiving messages generally exhibit informallanguage and short sentences (Briscoe, Appling, and Hayes2014). Our system does not employ features capturing thequality of tweets, but collects statistics about length and en-tropy of tweet text. Additionally, we extract language fea-tures by applying the Part-of-Speech (POS) tagging tech-nique, which identifies different types of natural languagecomponents, or POS tags. Tweets are therefore analyzed tostudy how POS tags are distributed.

Sentiment features. Sentiment analysis is a powerful toolto describe the emotions conveyed by a piece of text, andmore broadly the attitude or mood of an entire conversa-tion. Sentiment extracted from social media conversationshas been used to forecast offline events including financialmarket fluctuations (Bollen, Mao, and Zeng 2011), and isknown to affect information spreading (Mitchell et al. 2013;Ferrara and Yang 2015). Our framework leverages sev-eral sentiment extraction techniques to generate varioussentiment features, including arousal, valence and domi-nance scores (Warriner, Kuperman, and Brysbaert 2013),happiness score (Kloumann et al. 2012), polarization andstrength (Wilson, Wiebe, and Hoffmann 2005), and emoti-con score (Agarwal et al. 2011).

Model EvaluationTo train our system we initially used a publicly availabledataset consisting of 15K manually verified Twitter botsidentified via a honeypot approach (Lee, Eoff, and Caver-lee 2011) and 16K verified human accounts. We collectedthe most recent tweets produced by those accounts using theTwitter Search API. We limited our collection to 200 publictweets from a user timeline and up to 100 of the most recentpublic tweets mentioning that user. This procedure yielded adataset of 2.6 million tweets produced by manually verifiedbots and 3 million tweets produced by human users.

We benchmarked our system using several off-the-shelfalgorithms provided in the scikit-learn library (Pedregosa etal. 2011). In a generic evaluation experiment, the classifierunder examination is provided with numerical vectors, eachdescribing the features of an account. The classifier returns anumerical score in the unit interval. A higher score indicatesa stronger belief that the account is a bot. A model’s accu-racy is evaluated by measuring the Area Under the receiver

truthy.indiana.edu/botornot

Table 1: List of 1150 features extracted by our framework.

Use

rm

eta-

data

Screen name length

Sent

imen

t

(***) Happiness scores of aggregated tweetsNumber of digits in screen name (***) Valence scores of aggregated tweetsUser name length (***) Arousal scores of aggregated tweetsTime offset (sec.) (***) Dominance scores of single tweetsDefault profile (binary) (*) Happiness score of single tweetsDefault picture (binary) (*) Valence score of single tweetsAccount age (days) (*) Arousal score of single tweetsNumber of unique profile descriptions (*) Dominance score of single tweets(*) Profile description lengths (*) Polarization score of single tweets(*) Number of friends distribution (*) Entropy of polarization scores of single tweets(*) Number of followers distribution (*) Positive emoticons entropy of single tweets(*) Number of favorites distribution (*) Negative emoticons entropy of single tweetsNumber of friends (signal-noise ratio and rel. change) (*) Emoticons entropy of single tweetsNumber of followers (signal-noise ratio and rel. change) (*) Positive and negative score ratio of single tweetsNumber of favorites (signal-noise ratio and rel. change) (*) Number of positive emoticons in single tweetsNumber of tweets (per hour and total) (*) Number of negative emoticons in single tweetsNumber of retweets (per hour and total) (*) Total number of emoticons in single tweetsNumber of mentions (per hour and total) Ratio of tweets that contain emoticonsNumber of replies (per hour and total)Number of retweeted (per hour and total)

Frie

nds(†)

Number of distinct languages

Net

wor

k(‡

)

Number of nodesEntropy of language use Number of edges (also for reciprocal)(*) Account age distribution (*) Strength distribution(*) Time offset distribution (*) In-strength distribution(*) Number of friends distribution (*) Out-strength distribution(*) Number of followers distribution Network density (also for reciprocal)(*) Number of tweets distribution (*) Clustering coeff. (also for reciprocal)(*) Description length distributionFraction of users with default profile and default picture

Con

tent (*,**) Frequency of POS tags in a tweet

Tim

ing (*) Time between two consecutive tweets

(*,**) Proportion of POS tags in a tweet (*) Time between two consecutive retweets(*) Number of words in a tweet (*) Time between two consecutive mentions(*) Entropy of words in a tweet

† We consider four types of connected users: retweeting, mentioning, retweeted, and mentioned.‡ We consider three types of network: retweet, mention, and hashtag co-occurrence networks.* Distribution types. For each distribution, the following eight statistics are computed and used as individual fea-tures: min, max, median, mean, std. deviation, skewness, kurtosis, and entropy.** Part-Of-Speech (POS) tag. There are nine POS tags: verbs, nuns, adjectives, modal auxiliaries, pre-determiners,interjections, adverbs, wh-, and pronouns.*** For each feature, we compute mean and std. deviation of the weighted average across words in the lexicon.

operating characteristic Curve (AUC) with 5-fold cross val-idation, and computing the average AUC score across thefolds using Random Forests, AdaBoost, Logistic Regressionand Decision Tree classifiers. The best classification perfor-mance of 0.95 AUC was obtained by the Random Forest al-gorithm. In the rest of the paper we use the Random Forestmodel trained using 100 estimators and the Gini coefficientto measure the quality of splits.

Large-Scale EvaluationWe realistically expect that the nature and sophistication ofbots evolves over time and changes in specific conversa-tional domains. It is therefore important to determine howreliable and consistent are the predictions produced by asystem trained on a dataset but tested on different data (inthe wild). Also, the continuously-evolving nature of botsdictates the need to constantly update the models based on

newly available training data.To obtain an updated evaluation of the accuracy of our

model, we constructed an additional, manually-annotatedcollection of Twitter user accounts. We hypothesize that thisrecent collection includes some bots that are more sophisti-cated than the ones obtained years earlier with the honeypotmethod. We leveraged these manual annotations to evalu-ate the model trained using the honeypot dataset and thento update the classifier’s training data, producing a mergeddataset to train a new model that ensures better generaliza-tion to more sophisticated accounts. User IDs and annota-tion labels in our extended dataset are publicly available(truthy.indiana.edu/botornot/data).

Data CollectionOur data collection focused on users producing contentin English, as inferred from profile meta-data. We iden-

truthy.indiana.edu/botornot/data

tified a large, representative sample of users by monitor-ing a Twitter stream, accounting for approximately 10%of public tweets, for 3 months starting in October 2015.This approach avoids known biases of other methods suchas snowball and breadth-first sampling, which rely on theselection of an initial group of users (Gjoka et al. 2010;Morstatter et al. 2013). We focus on English speaking usersas they represent the largest group on Twitter (Mocanu et al.2013).

To restrict our sample to recently active users, we intro-duce the further criteria that they must have produced at least200 tweets in total and 90 tweets during the three-month ob-servation window (one per day on average). Our final sampleincludes approximately 14 million user accounts that meetboth criteria. For each of these accounts, we collected theirtweets through the Twitter Search API. We restricted the col-lection to the most recent 200 tweets and 100 mentions ofeach user, as described earlier. Owing to Twitter API limits,this greatly improved our data collection speed. This choicealso reduces the response time of our service and API. How-ever the limitation adds noise to the features, due to thescarcity of data available to compute them.

Manual AnnotationsWe computed classification scores for each of the active ac-counts using our initial classifier trained on the honeypotdataset. We then grouped accounts by their bot scores, allow-ing us to evaluate our system across the spectrum of humanand bot accounts without being biased by the distribution ofbot scores. We randomly sampled 300 accounts from eachbot-score decile. The resulting balanced set of 3000 accountswere manually annotated by inspecting their public Twitterprofiles. Some accounts have obvious flags, such as using astock profile image or retweeting every message of anotheraccount within seconds. In general, however, there is no sim-ple set of rules to assess whether an account is human or bot.With the help of four volunteers, we analyzed profile appear-ance, content produced and retweeted, and interactions withother users in terms of retweets and mentions. Annotatorswere not given a precise set of instructions to perform theclassification task, but rather shown a consistent number ofboth positive and negative examples. The final decisions re-flect each annotator’s opinion and are restricted to: human,bot, or undecided. Accounts labeled as undecided were elim-inated from further analysis.

We annotated all 3000 accounts. We will refer to this setof accounts as the manually annotated data set. Each anno-tator was assigned a random sample of accounts from eachdecile. We enforced a minimum 10% overlap between an-notations to assess the reliability of each annotator. Thisyielded an average pairwise agreement of 75% and moder-ate inter-annotator agreement (Cohen’s κ = 0.41). We alsocomputed the agreement between annotators and classifieroutcomes, assuming that a classification score above 0.5 isinterpreted as a bot. This resulted in an average pairwiseagreement of 79% and a moderately high Cohen’s κ = 0.5.These results suggest high confidence in the annotation pro-cess, as well as in the agreement between annotations andmodel predictions.

Figure 1: ROC curves of models trained and tested on dif-ferent datasets. Accuracy is measured by AUC.

Evaluating Models Using Annotated DataTo evaluate our classification system trained on the honeypotdataset, we examined the classification accuracy separatelyfor each bot-score decile of the manually annonated dataset.We achieved classification accuracy greater than 90% forthe accounts in the (0.0, 0.4) range, which includes mostlyhuman accounts. We also observe accuracy above 70% forscores in the (0.8, 1.0) range (mostly bots). Accuracy for ac-counts in the grey-area range (0.4, 0.8) fluctuates between60% and 80%. Intuitively, this range contains the most chal-lenging accounts to label, as reflected also in the low inter-annotators overlap in this region. When the accuracy of eachbin is weighted by the population density in the large datasetfrom which the manually annonated has been extracted, weobtain 86% overall classification accuracy.

We also compare annotator agreement scores for the ac-counts in each bot-score decile. We observe that agreementscores are higher for accounts in the (0.0, 0.4) range andlower for accounts in the (0.8, 1.0) range, indicating that itis more difficult for human annotators to identify bot-like asopposed to human-like behavior.

We observe a similar pattern for the amount of time re-quired on average to annotate human and bot accounts. An-notators employed on average 33 seconds to label humanaccounts and 37 seconds for bot accounts.

Fig. 1 shows the results of experiments designed to in-vestigate our ability to detect manually annotated bots. Thebaseline ROC curve is obtained by testing the honeypotmodel on the manually annotated dataset. Unsurprisingly,the baseline accuracy (0.85 AUC) is lower than that obtainedcross-validating on the honeypot data (0.95 AUC), becausethe model is not trained on the newer bots.

Dataset Effect on Model AccuracyWe can update our models by combining the manually-annotated and honeypot datasets. We created multiple bal-anced datasets and performed 5-fold cross-validation toevaluate the accuracy of the corresponding models:

• Annotation: We trained and tested a model by only using

Figure 2: Distribution of classifier score for human and botaccounts in the two datasets.

annotated accounts and labels assigned by the majority ofannotators. This yields 0.89 AUC, a reasonable accuracyconsidering that the dataset contains recent and possiblysophisticated bots.

• Merged: We merged the honeypot and annotationdatasets for training and testing. The resulting classifierachieves 0.94 AUC, only slightly worse than the honey-pot (training and test) model although the merged datasetcontains a variety of more recent bots.

• Mixture: Using mixtures with different ratios of accountsfrom the manually annotated and honeypot datasets, weobtain an accuracy ranging between 0.90 and 0.94 AUC.

In Fig 2, we plot the distributions of classification scoresfor human and bot accounts according to each dataset. Themixture model trained on 2K annotated and 10K honeypotaccounts is used to compute the scores. Human accountsin both datasets have similar distributions, peaked around0.1. The difference between bots in the two datasets is moreprominent. The distribution of simple, honeypot bots peaksaround 0.9. The newer bots from the manually annotateddataset have typically smaller scores, with a distributionpeaked around 0.6. They are more sophisticated, and ex-hibit characteristics more similar to human behavior. Thisraises the issue of how to properly set a threshold on thescore when a strictly binary classification between humanand bots is needed. To infer a suitable threshold, we com-pute classification accuracies for varying thresholds consid-ering all accounts scoring below each threshold as human,and then select the threshold that maximizes accuracy.

We compared scores for accounts in the manually anno-tated dataset by pairs of models (i.e. trained with differentmixtures) for labeled human, bot, and a random subset ofaccounts (Fig. 3). As expected, both models assign lowerscores for humans and higher for bots. High correlation co-efficients indicate agreement between the models.

Feature Importance AnalysisTo compare the usefulness of different features, we trainedmodels using each class of features alone. We achieved the

Figure 3: Comparison of scores for different models. Eachaccount is represented as a point in the scatter plot with acolor determined by its category. Test points are randomlysampled from our large-scale collection. Pearson correla-tions between scores are also reported, along with estimatedthresholds and corresponding accuracies.

best performance with user meta-data features; content fea-tures are also effective. Both yielded AUC above 0.9. Otherfeature classes yielded AUC above 0.8.

We analyzed the importance of single features using theGini impurity score produced by our Random Forests model.To rank the top features for a given dataset, we randomly se-lect a subset of 10,000 accounts and compute the top featuresacross 100 randomized experiments. The top 10 features aresufficient to reach performance of 0.9 AUC. Sentiment andcontent of mentioned tweets are important features alongwith the statistical properties of retweet networks. Featuresof the friends with whom a user interacts are strong predic-tors as well. We observed the redundancy among many cor-related features, such as distribution-type features (cf. Ta-ble 1), especially in the content and sentiment categories.Further analysis of feature importance is the subject of on-going investigation.

False Positive and False Negative CasesNeither human annotators nor machine-learning models per-form flawlessly. Humans are better at generalizing and learn-ing new features from observed data. Machines outperformhuman annotators at processing large numbers of relationsand searching for complex patterns. We analyzed our an-notated accounts and their bot scores to highlight whendisagreement occurs between annotators and classificationmodels. Using an optimal threshold, we measured false pos-itive and false negative rates at 0.15 and 0.11 respectively inour extended dataset. In these experiments, human annota-tion is considered as ground truth.

We identified the cases when the disagreement betweenclassifier score and annotations occurs. We manually exam-ined a sample from these accounts to investigate these er-rors. Accounts annotated as human can be classified as botwhen an account posts tweets created by connected appli-cations from other platforms. Some unusually active usersare also classified as bots. Those users tend to have moreretweets in general. This is somewhat intuitive as retweet-ing has lower cost than creating new content. We encoun-

tered examples of misclassification for organizational andpromotional accounts. Such accounts are often operated bymultiple individuals, or combinations of users and automatictools, generating misleading cues for the classifiers. Finally,the language of the content can also cause errors: our modelstend to assign high bot scores to users who tweet in multiplelanguages. To mitigate this problem, the public version ofour system now includes a classifier that ignores language-dependent features.

Estimation of Bot PopulationIn a 2014 report by Twitter to the US Securities and Ex-change Commission, the company put forth an estimate thatbetween 5% and 8.5% of their user base consists of bots.1We would like to offer our own assessment of the propor-tion of bot accounts as measured with our approach. Sinceour framework provides a continuous bot score as opposedto a discrete bot/human judgement, we must first determinean appropriate bot-score threshold separating human and botaccounts to estimate the proportion of bot accounts.

To infer a suitable threshold, we computed classificationaccuracies for varying thresholds considering all accountsscoring below each threshold as human. We then selected thethreshold yielding maximum accuracy (see insets of Fig. 4).

We estimated the population of bots using different mod-els. This approach allows us to identify lower and upperbounds for the prevalence of Twitter bots. Models trained us-ing the annotated dataset alone yield estimates of up to 15%of accounts being bots. Recall that the honeypot dataset wasobtained earlier and therefore does not include newer, moresophisticated bots. Thus models trained on the honeypot dataalone are less sensitive to these sophisticated bots, yielding amore conservative estimate of 9%. Mixing the training datafrom these two sources results in estimates between thesebounds depending on the ratio of the mixture, as illustratedin Fig. 4. Taken together, these numbers suggest that esti-mates about the prevalence of Twitter bots are highly depen-dent on the definition and sophistication of the bots.

Some other remarks are in order. First, we do not excludethe possibility that very sophisticated bots can systemati-cally escape a human annotator’s judgement. These complexbots may be active on Twitter, and therefore present in ourdatasets, and may have been incorrectly labeled as humans,making even the 15% figure a conservative estimate. Sec-ond, increasing evidence suggests the presence on social me-dia of hybrid human-bot accounts (sometimes referred to ascyborgs) that perform automated actions with some humansupervision (Chu et al. 2012; Clark et al. 2016). Some havebeen allegedly used for terrorist propaganda and recruitmentpurposes. It remains unclear how these accounts should belabeled, and how pervasive they are.

Characterization of User InteractionsLet us next characterize social connectivity, informationflow, and shared properties of users. We analyze the cre-ation of social ties by accounts with different bot scores, and

1www.sec.gov/Archives/edgar/data/1418091/000156459014003474/twtr-10q_20140630.htm

their interactions through shared content. We also clusteraccounts and investigate shared properties of users in eachcluster. Here and in the remainder of this paper, bot scoresare computed with a model trained on the merged dataset.

Social connectivity

To characterize the social connectivity, we collected the so-cial networks of the accounts in our dataset using the TwitterAPI. Resulting friend and follower relations account for 46billion social ties, 7 billion of which represent ties betweenthe initially collected user set.

Our observations on social connectivity are presented inFig. 5. We computed bot-score distributions of friends andfollowers of accounts for each score interval. The dark linein the top panel shows that human accounts (low score)mostly follow other human accounts. The dark line in thebottom panel shows a principal peak around 0.1 and a sec-ondary one around 0.5. This indicates that humans are typ-ically followed by other humans, but also by sophisticatedbots (intermediate scores). The lines corresponding to highscores in the two panels show that bots tend to follow otherbots and they are mostly followed by bots. However sim-ple bots (0.8–1.0 ranges) can also attract human attention.This happens when, e.g., humans follow benign bots such asthose that share news. This gives rise to the secondary peakof the red line in the bottom panel. In summary, the creationof social ties leads to a homophily effect.

Fig. 6 illustrates the extent to which connections are re-ciprocated, given the nature of the accounts forming the ties.The reciprocity score of a user is defined as the fraction offriends who are also followers. We observe that human ac-counts reciprocate more (dark line). Increasing bot scorescorrelate with lower reciprocity. We also observe that sim-ple bot accounts (0.8–1.0 ranges) have bimodal reciprocitydistributions, indicating the existence of two distinct behav-iors. The majority of high-score accounts have reciprocityscore smaller than 0.2, possibly because simple bots followusers at random. The slight increase as the reciprocity scoreapproaches one may be due to botnet accounts that coordi-nate by following each other.

Information flow

Twitter is a platform that fosters social connectivity and thebroadcasting of popular content. In Fig. 7 we analyze infor-mation flow in terms of mentions/retweets as a function ofthe score of the account being mentioned or retweeted.

Simple bots tend to retweet each other (lines for scoresin the 0.8–1.0 ranges peak around 0.8 in the bottom panel),while they frequently mention sophisticated bots (peakingaround 0.5 in the top panel). More sophisticated bots (scoresin the 0.5–0.7 ranges) retweet, but do not mention humans.They might be unable to engage in meaningful exchangeswith humans. While humans also retweet bots, as they maypost interesting content (see peaks of the dark lines in thebottom panel), they have no interest in mentioning bots di-rectly (dark lines in the top panel).

www.sec.gov/Archives/edgar/data/1418091/000156459014003474/twtr-10q_20140630.htm

www.sec.gov/Archives/edgar/data/1418091/000156459014003474/twtr-10q_20140630.htm

Figure 4: Estimation of bot population obtained from models with different sensitivity to sophisticated bots. The main chartsshow the score distributions based on our dataset of 14M users; accounts identified as bots are highlighted. The inset plots showhow the thresholds are computed by maximizing accuracy. The titles of each subplot reflect the number of accounts from theannotated and honeypot datasets, respectively.

Figure 5: Distributions of bot scores for friends (top) andfollowers (bottom) of accounts in different score intervals.

Figure 6: Distribution of reciprocity scores for accounts indifferent score intervals.

Clustering accountsTo characterize different account types, let us group ac-counts into behavioral clusters. We apply K-Means to nor-malized vectors of the 100 most important features selected

Figure 7: Bot score distributions of users mentioned (top)and retweeted (bottom) by accounts with different scores.

by our Random Forests model. We identify 10 distinct clus-ters based on different evaluation criteria, such as silhouettescores and percentage of variance explained. In Fig 8, wepresent a 2-dimensional projection of users obtained by a di-mensionality reduction technique called t-SNE (Maaten andHinton 2008). In this method, the similarity between usersis computed based on their 100-dimensional representationin the feature space. Similar users are projected into nearbypoints and dissimilar users are kept distant from each other.

Let us investigate shared cluster properties by manual in-spection of random subsets of accounts from each cluster.Three of the clusters, namely C0–C2, have high average botscores. The presence of significant amounts of bot accountsin these clusters was manually verified. These bot clustersexhibit some prominent properties: cluster C0, for exam-ple, consists of legit-looking accounts that are promotingthemselves (recruiters, porn actresses, etc.). They are con-centrated in the lower part of the 2-dimensional embedding,suggesting homogeneous patterns of behaviors. C1 contains

Figure 8: t-SNE embedding of accounts. Points are coloredbased on clustering in high-dimensional space. For eachcluster, the distribution of scores is presented on the right.

spam accounts that are very active but have few followers.Accounts in C2 frequently use automated applications toshare activity from other platforms like YouTube and Insta-gram, or post links to news articles. Some of the accounts inC2 might belong to actual humans who are no longer activeand their posts are mostly sent by connected apps.

Cluster C3 contain a mix of sophisticated bots, cyborg-like accounts (mix of bot and human features), and humanusers. Clusters of predominantly human accounts, namelyC4–C9, separate from one another in the embedding due todifferent activity styles, user popularity, content productionand consumption patterns. For instance, accounts in C7 en-gage more with their friends, unlike accounts from C8 thatmostly retweet with little other forms of interaction. ClustersC5, C6, and C9 contain common Twitter users who produceexperiential tweets, share pictures, and retweet their friends.

Related WorkAlso known as “sybil” accounts, social bots can pollute on-line discussion by lending false credibility to their messagesand influence other users (Ferrara et al. 2016a; Aiello etal. 2012). Recent studies quantify the extent to which au-tomated systems can dominate discussions on Twitter abouttopics ranging from electronic cigarettes (Clark et al. 2015)to elections (Bessi and Ferrara 2016). Large collections ofsocial bots, also known as botnets, are controlled by botmas-ters and used for coordinated activities. Examples of suchbotnets identified for advertisement (Echeverrıa and Zhou2017) and influence about Syrian civic war (Abokhodair,Yoo, and McDonald 2015). Social bots also vary greatlyin terms of their behavior, intent, and vulnerabilities, as il-lustrated in a categorization scheme for bot attacks (Mitter,Wagner, and Strohmaier 2013).

Much of the previous work on detecting bots is fromthe perspective of the social network platform operators,

implying full access to all data. These studies focus oncollecting large-scale data to either cluster behavioral pat-terns of users (Wang et al. 2013a) or classify accountsusing supervised learning techniques (Yang et al. 2014;Lee, Eoff, and Caverlee 2011). For instance, Beutel et al. de-composed event data in time, user, and activity dimensionsto extract similar behaviors (Beutel et al. 2013). These tech-niques are useful to identify coordinated large-scale attacksdirected at a common set of targets at the same time, butaccounts with similar strategies might also target differentgroups and operate separately from each other.

Structural connectivity may provide important cues. How-ever, Yang et al. studied large-scale sybil attacks and ob-served sophisticated sybils that develop strategies for build-ing normal-looking social ties, making themselves harder todetect (Yang et al. 2014). Some sybil attacks analyze thesocial graph of targeted groups to infiltrate specific organi-zations (Elyashar et al. 2013). SybilRank is a system devel-oped to identify attacks from their underlying topology (Caoet al. 2012). Alvisi et al. surveyed the evolution of sybil de-fense protocols that leverage the structural properties of thesocial graph (Alvisi et al. 2013).

The work presented here follows several previous contri-butions to the problem of social bot detection that leveragelearning models trained with data collected from human andbot accounts. Chu et al. built a classification system identify-ing accounts controlled by humans, bots, and cyborgs (Chuet al. 2010; Chu et al. 2012). Wang et al. analyzed sybilattacks using annotations by experts and crowd-sourcingworkers to evaluate consistency and effectiveness of differ-ent detection systems (Wang et al. 2013b). Clark et al. la-beled 1,000 accounts by hand and found natural languagetext features to be very effective at discriminating betweenhuman and automated accounts (Clark et al. 2016). Lee etal. used a honeypot approach to collect the largest sampleof bot accounts available to date (Lee, Eoff, and Caverlee2011). That study generated the honeypot dataset used in thepresent paper. Here, we extend this body of prior work byexploring many different categories of features, contribut-ing a new labeled dataset, estimating the number of bot ac-counts, analyzing information flow among accounts, iden-tifying several classes of behaviors, and providing a publicbot detection service.

An alternative approach to study social bots and sybil at-tacks is to understand what makes certain groups and indi-viduals more appealing as targets. Wald et al. studied thefactors affecting the likelihood of a users being targeted bysocial bots (Wald et al. 2013). These approaches point to ef-fective strategies that future social bots might develop.

Recently, we have observed efforts to facilitate researchcollaborations on the topic of social bots. DARPA organizeda bot detection challenge in the domain of anti-vaccine cam-paigns on Twitter (Subrahmanian et al. 2016). We releasedour Twitter bot detection system online for public use (Daviset al. 2016). Since its release, our system has received mil-lions of requests and we are improving models based onfeedback we received from our users. The increasing avail-ability of software and datasets on social bots will help de-sign systems that are capable of co-evolving with recent so-

cial bots and hopefully mitigating the effects of their mali-cious activities.

ConclusionsSocial media make it easy for accounts controlled by hy-brid or automated approaches to create content and interactwith other accounts. Our project aims to identify these bots.Such a classification task could be a first step toward study-ing modes of communication among different classes of en-tities on social media.

In this article, we presented a framework for bot detec-tion on Twitter. We introduced our machine learning sys-tem that extracts more than a thousand features in six dif-ferent classes: users and friends meta-data, tweet contentand sentiment, network patterns, and activity time series. Weevaluated our framework when initially trained on an avail-able dataset of bots. Our initial classifier achieves 0.95 AUCwhen evaluated by using 5-fold cross validation. Our analy-sis on the contributions of different feature classes suggeststhat user meta-data and content features are the two mostvaluable sources of data to detect simple bots.

To evaluate the performance of our classifier on a more re-cent and challenging sample of bots, we randomly selectedTwitter accounts covering the whole spectrum of classifi-cation scores. The accuracy of our initial classifier trainedon the honeypot dataset decreased to 0.85 AUC when testedon the more challenging dataset. By retraining the classifierwith the two datasets merged, we achieved high accuracy(0.94 AUC) in detecting both simple and sophisticated bots.

We also estimated the fraction of bots in the ac-tive English-speaking population on Twitter. We classifiednearly 14M accounts using our system and inferred the op-timal threshold scores that separate human and bot accountsfor several models with different mixes of simple and so-phisticated bots. Training data have an important effect onclassifier sensitivity. Our estimates for the bot populationrange between 9% and 15%. This points to the importanceof tracking increasingly sophisticated bots, since deceptionand detection technologies are in a never-ending arms race.

To characterize user interactions, we studied socialconnectivity and information flow between different usergroups. We showed that selection of friends and followersare correlated with accounts bot-likelihood. We also high-lighted how bots use different retweet and mention strategieswhen interacting with humans or other bots.

We concluded our analysis by characterizing subclassesof account behaviors. Clusters identified by this analysispoint mainly to three types of bots. These results emphasizethat Twitter hosts a variety of users with diverse behaviors;this is true for both human and bot accounts. In some cases,the boundary separating these two groups is not sharp andan account can exhibit characteristics of both.

Acknowledgments. We thank M. JafariAsbagh, P. Shiralkarfor helpful discussions. We also want to thank undergraduatestudents A. Toms, A. Fulton, A. Witulski, and M. Johnston forcontributing data annotation. This work was supported in partby ONR (N15A-020-0053), DARPA (W911NF-12-1-0037), NSF(CCF-1101743), and the J.S. McDonnell Foundation.

References[Abokhodair, Yoo, and McDonald 2015] Abokhodair, N.; Yoo, D.;

and McDonald, D. W. 2015. Dissecting a social botnet: Growth,content and influence in twitter. In Proc. of the 18th ACM Conf.on Computer Supported Cooperative Work & Social Computing,839–851. ACM.

[Agarwal et al. 2011] Agarwal, A.; Xie, B.; Vovsha, I.; Rambow,O.; and Passonneau, R. 2011. Sentiment analysis of Twitter data.In Proc. of the Workshop on Languages in Social Media, 30–38.ACL.

[Aiello et al. 2012] Aiello, L.; Deplano, M.; Schifanella, R.; andRuffo, G. 2012. People are strange when you’re a stranger: Im-pact and influence of bots on social networks. In Proc. 6th Intl.AAAI Conf. on Weblogs & Soc. Media (ICWSM).

[Alvisi et al. 2013] Alvisi, L.; Clement, A.; Epasto, A.; Lattanzi, S.;and Panconesi, A. 2013. Sok: The evolution of sybil defense viasocial networks. In Proc. IEEE Symposium on Security and Privacy(SP), 382–396.

[Bakshy et al. 2011] Bakshy, E.; Hofman, J. M.; Mason, W. A.; andWatts, D. J. 2011. Everyone’s an influencer: quantifying influenceon Twitter. In Proc. 4th ACM Intl. Conf. on Web Search and DataMining, 65–74.

[Berger and Morgan 2015] Berger, J., and Morgan, J. 2015. Theisis twitter census: Defining and describing the population of isissupporters on twitter. The Brookings Project on US Relations withthe Islamic World 3:20.

[Bessi and Ferrara 2016] Bessi, A., and Ferrara, E. 2016. Socialbots distort the 2016 us presidential election online discussion.First Monday 21(11).

[Bessi et al. 2015] Bessi, A.; Coletto, M.; Davidescu, G. A.; Scala,A.; Caldarelli, G.; and Quattrociocchi, W. 2015. Science vs con-spiracy: Collective narratives in the age of misinformation. PLoSONE 10(2):e0118093.

[Beutel et al. 2013] Beutel, A.; Xu, W.; Guruswami, V.; Palow, C.;and Faloutsos, C. 2013. Copycatch: stopping group attacks byspotting lockstep behavior in social networks. In Prov. 22nd Intl.ACM Conf. World Wide Web (WWW), 119–130.

[Bollen, Mao, and Zeng 2011] Bollen, J.; Mao, H.; and Zeng, X.2011. Twitter mood predicts the stock market. Journal of Com-putational Science 2(1):1–8.

[Boshmaf et al. 2011] Boshmaf, Y.; Muslukhov, I.; Beznosov, K.;and Ripeanu, M. 2011. The socialbot network: when bots social-ize for fame and money. In Proc. 27th Annual Computer SecurityApplications Conf.

[Botta, Moat, and Preis 2015] Botta, F.; Moat, H. S.; and Preis, T.2015. Quantifying crowd size with mobile phone and twitter data.Royal Society open science 2(5):150162.

[Briscoe, Appling, and Hayes 2014] Briscoe, E.; Appling, S.; andHayes, H. 2014. Cues to deception in social media communi-cations. In Hawaii Intl. Conf. on Syst Sci.

[Cao et al. 2012] Cao, Q.; Sirivianos, M.; Yang, X.; and Pregueiro,T. 2012. Aiding the detection of fake accounts in large scale so-cial online services. In 9th USENIX Symp on Netw Sys Design &Implement, 197–210.

[Chavoshi, Hamooni, and Mueen 2016] Chavoshi, N.; Hamooni,H.; and Mueen, A. 2016. Identifying correlated bots in twitter.In Social Informatics: 8th Intl. Conf., 14–21.

[Chu et al. 2010] Chu, Z.; Gianvecchio, S.; Wang, H.; and Jajodia,S. 2010. Who is tweeting on twitter: human, bot, or cyborg? InProc. 26th annual computer security applications conf., 21–30.

[Chu et al. 2012] Chu, Z.; Gianvecchio, S.; Wang, H.; and Jajodia,S. 2012. Detecting automation of twitter accounts: Are you a hu-man, bot, or cyborg? IEEE Tran Dependable & Secure Comput9(6):811–824.

[Clark et al. 2015] Clark, E.; Jones, C.; Williams, J.; Kurti, A.; Nor-totsky, M.; Danforth, C.; and Dodds, P. 2015. Vaporous marketing:Uncovering pervasive electronic cigarette advertisements on twit-ter. arXiv preprint arXiv:1508.01843.

[Clark et al. 2016] Clark, E.; Williams, J.; Jones, C.; Galbraith, R.;Danforth, C.; and Dodds, P. 2016. Sifting robotic from organic text:a natural language approach for detecting automation on twitter.Journal of Computational Science 16:1–7.

[Danescu-Niculescu-Mizil et al. 2013] Danescu-Niculescu-Mizil,C.; West, R.; Jurafsky, D.; Leskovec, J.; and Potts, C. 2013. Nocountry for old members: user lifecycle and linguistic change inonline communities. In Proc. of the 22nd Intl. Conf. on WorldWide Web, 307–318.

[Das et al. 2016] Das, A.; Gollapudi, S.; Kiciman, E.; and Varol,O. 2016. Information dissemination in heterogeneous-intent net-works. In Proc. ACM Conf. on Web Science.

[Davis et al. 2016] Davis, C. A.; Varol, O.; Ferrara, E.; Flammini,A.; and Menczer, F. 2016. BotOrNot: A system to evaluate socialbots. In Proc. 25th Intl. Conf. Companion on World Wide Web,273–274.

[Echeverrıa and Zhou 2017] Echeverrıa, J., and Zhou, S. 2017.The ‘star wars’ botnet with ¿350k twitter bots. arXiv preprintarXiv:1701.02405.

[Elyashar et al. 2013] Elyashar, A.; Fire, M.; Kagan, D.; andElovici, Y. 2013. Homing socialbots: intrusion on a specific or-ganization’s employee using socialbots. In Proc. IEEE/ACM Intl.Conf. on Advances in Social Networks Analysis and Mining, 1358–1365.

[Ferrara and Yang 2015] Ferrara, E., and Yang, Z. 2015. Quantify-ing the effect of sentiment on information diffusion in social media.PeerJ Comp. Sci. 1:e26.

[Ferrara et al. 2016a] Ferrara, E.; Varol, O.; Davis, C.; Menczer, F.;and Flammini, A. 2016a. The rise of social bots. Comm. ACM59(7):96–104.

[Ferrara et al. 2016b] Ferrara, E.; Varol, O.; Menczer, F.; and Flam-mini, A. 2016b. Detection of promoted social media campaigns.In Proc. Intl. AAAI Conference on Web and Social Media.

[Ferrara et al. 2016c] Ferrara, E.; Wang, W.-Q.; Varol, O.; Flam-mini, A.; and Galstyan, A. 2016c. Predicting online extremism,content adopters, and interaction reciprocity. In Social Informat-ics: 8th Intl. Conf., SocInfo 2016, Bellevue, WA, USA, 22–39.

[Ghosh, Surachawala, and Lerman 2011] Ghosh, R.; Surachawala,T.; and Lerman, K. 2011. Entropy-based classification of retweet-ing activity on twitter. In Proc. of KDD workshop on Social Net-work Analysis.

[Gjoka et al. 2010] Gjoka, M.; Kurant, M.; Butts, C. T.; andMarkopoulou, A. 2010. Walking in facebook: A case study ofunbiased sampling of osns. In Proc. IEEE INFOCOM, 1–9.

[Haustein et al. 2016] Haustein, S.; Bowman, T. D.; Holmberg, K.;Tsou, A.; Sugimoto, C. R.; and Lariviere, V. 2016. Tweets asimpact indicators: Examining the implications of automated “bot”accounts on twitter. Journal of the Association for InformationScience and Technology 67(1):232–238.

[Kloumann et al. 2012] Kloumann, I. M.; Danforth, C. M.; Harris,K. D.; Bliss, C. A.; and Dodds, P. S. 2012. Positivity of the englishlanguage. PLoS ONE 7(1):e29484.

[Lee, Eoff, and Caverlee 2011] Lee, K.; Eoff, B. D.; and Caverlee,J. 2011. Seven months with the devils: A long-term study of con-tent polluters on twitter. In Proc. 5th AAAI Intl. Conf. on Web andSocial Media.

[Letchford, Moat, and Preis 2015] Letchford, A.; Moat, H. S.; andPreis, T. 2015. The advantage of short paper titles. Royal SocietyOpen Science 2(8):150266.

[Lokot and Diakopoulos 2016] Lokot, T., and Diakopoulos, N.2016. News bots: Automating news and information dissemina-tion on twitter. Digital Journalism 4(6):682–699.

[Maaten and Hinton 2008] Maaten, L. v. d., and Hinton, G. 2008.Visualizing data using t-sne. Journal of Machine Learning Re-search 9(Nov):2579–2605.

[McAuley and Leskovec 2013] McAuley, J., and Leskovec, J.2013. From amateurs to connoisseurs: modeling the evolution ofuser expertise through online reviews. In Proc. 22nd Intl. ACMConf. World Wide Web, 897–908.

[Mislove et al. 2011] Mislove, A.; Lehmann, S.; Ahn, Y.-Y.; On-nela, J.-P.; and Rosenquist, J. N. 2011. Understanding the de-mographics of Twitter users. In Proc. of the 5th Intl. AAAI Conf.on Weblogs and Social Media.

[Mitchell et al. 2013] Mitchell, L.; Harris, K. D.; Frank, M. R.;Dodds, P. S.; and Danforth, C. M. 2013. The geography of happi-ness: Connecting Twitter sentiment and expression, demographics,and objective characteristics of place. PLoS ONE 8(5):e64417.

[Mitter, Wagner, and Strohmaier 2013] Mitter, S.; Wagner, C.; andStrohmaier, M. 2013. A categorization scheme for socialbot attacksin online social networks. In Proc. of the 3rd ACM Web ScienceConference.

[Mocanu et al. 2013] Mocanu, D.; Baronchelli, A.; Perra, N.;Goncalves, B.; Zhang, Q.; and Vespignani, A. 2013. The Twitterof Babel: Mapping world languages through microblogging plat-forms. PLoS ONE 8(4):e61981.

[Morstatter et al. 2013] Morstatter, F.; Pfeffer, J.; Liu, H.; and Car-ley, K. 2013. Is the sample good enough? comparing data fromtwitter’s streaming api with twitter’s firehose. In 7th Int Conf onWeblogs & Soc Med.

[Pedregosa et al. 2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.;Michel, V.; Thirion, B.; Grisel, O.; et al. 2011. Scikit-learn: Ma-chine learning in Python. Journal of Machine Learning Research12:2825–2830.

[Ratkiewicz et al. 2011] Ratkiewicz, J.; Conover, M.; Meiss, M.;Goncalves, B.; Flammini, A.; and Menczer, F. 2011. Detectingand tracking political abuse in social media. In 5th Int Conf onWeblogs & Soc Med, 297–304.

[Savage, Monroy-Hernandez, and Hollerer 2016] Savage, S.;Monroy-Hernandez, A.; and Hollerer, T. 2016. Botivist: Callingvolunteers to action using online bots. In Proceedings of the 19thACM Conference on Computer-Supported Cooperative Work &Social Computing, 813–822. ACM.

[Subrahmanian et al. 2016] Subrahmanian, V.; Azaria, A.; Durst,S.; Kagan, V.; Galstyan, A.; Lerman, K.; Zhu, L.; Ferrara, E.; Flam-mini, A.; Menczer, F.; et al. 2016. The DARPA Twitter Bot Chal-lenge. IEEE Computer 6(49):38–46.

[Wald et al. 2013] Wald, R.; Khoshgoftaar, T. M.; Napolitano, A.;and Sumner, C. 2013. Predicting susceptibility to social bots ontwitter. In Proc. 14th Intl. IEEE Conf. on Information Reuse andIntegration, 6–13.

[Wang et al. 2013a] Wang, G.; Konolige, T.; Wilson, C.; Wang, X.;Zheng, H.; and Zhao, B. Y. 2013a. You are how you click: Click-

stream analysis for sybil detection. In Proc. USENIX Security, 1–15. Citeseer.

[Wang et al. 2013b] Wang, G.; Mohanlal, M.; Wilson, C.; Wang,X.; Metzger, M.; Zheng, H.; and Zhao, B. Y. 2013b. Social turingtests: Crowdsourcing sybil detection. In Proc. of the 20th Network& Distributed System Security Symposium (NDSS).

[Warriner, Kuperman, and Brysbaert 2013] Warriner, A. B.; Kuper-man, V.; and Brysbaert, M. 2013. Norms of valence, arousal, anddominance for 13,915 english lemmas. Behavior research methods1–17.

[Wilson, Wiebe, and Hoffmann 2005] Wilson, T.; Wiebe, J.; andHoffmann, P. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In ACL Conf on Human Language Techn& Empirical Methods in NLP, 347–354.

[Yang et al. 2014] Yang, Z.; Wilson, C.; Wang, X.; Gao, T.; Zhao,B. Y.; and Dai, Y. 2014. Uncovering social network sybils in thewild. ACM Trans. Knowledge Discovery from Data 8(1):2.

online human-bot interactions: detection, estimation, and ... · study how pos tags are...

Documents