sune lehman: #socialbots and how we created the boston banksy hoax

59
the thrilling story of the #socialbots Sune Lehmann Associate Professor Department of Applied Mathematics and Computer Science Technical University of Denmark @suneman

Upload: arjan-haring

Post on 11-May-2015

288 views

Category:

Documents


1 download

DESCRIPTION

Prof. Sune Lehman used his course machine learning to create socialbots, that in their turn created a Boston Banksy Hoax.

TRANSCRIPT

Page 1: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

the thrilling story of the

#socialbots

Sune Lehmann Associate Professor Department of Applied Mathematics and Computer Science Technical University of Denmark @suneman

Page 2: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

Art by Allan Criss

Page 3: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

Artwork: C. Hidalgo

network structure

Page 4: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

LETTERS

Link communities reveal multiscale complexity innetworksYong-Yeol Ahn1,2*, James P. Bagrow1,2* & Sune Lehmann3,4*

Networks have become a key approach to understanding systemsof interacting objects, unifying the study of diverse phenomenaincluding biological organisms and human society1–3. One crucialstep when studying the structure and dynamics of networks is toidentify communities4,5: groups of related nodes that correspondto functional subunits such as protein complexes6,7 or socialspheres8–10. Communities in networks often overlap9,10 such thatnodes simultaneously belong to several groups. Meanwhile, manynetworks are known to possess hierarchical organization, wherecommunities are recursively grouped into a hierarchical struc-ture11–13. However, the fact that many real networks have com-munities with pervasive overlap, where each and every nodebelongs to more than one group, has the consequence that a globalhierarchy of nodes cannot capture the relationships between over-lapping groups. Here we reinvent communities as groups of linksrather than nodes and show that this unorthodox approach suc-cessfully reconciles the antagonistic organizing principles of over-lapping communities and hierarchy. In contrast to the existingliterature, which has entirely focused on grouping nodes, linkcommunities naturally incorporate overlap while revealing hier-archical organization. We find relevant link communities in manynetworks, including major biological networks such as protein–protein interaction6,7,14 and metabolic networks11,15,16, and showthat a large social network10,17,18 contains hierarchically organizedcommunity structures spanning inner-city to regional scales whilemaintaining pervasive overlap. Our results imply that link com-munities are fundamental building blocks that reveal overlap andhierarchical organization in networks to be two aspects of thesame phenomenon.

Although no common definition has been agreed upon, it is widelyaccepted that a community should have more internal than externalconnections19–24. Counterintuitively, highly overlapping communitiescan have many more external than internal connections (Fig. 1a, b).Because pervasive overlap breaks even this fundamental assumption, anew approach is needed.

The discovery of hierarchy and community organization has alwaysbeen considered a problem of determining the correct membership(or memberships) of each node. Notice that, whereas nodes belong tomultiple groups (individuals have families, co-workers and friends;Fig. 1c), links often exist for one dominant reason (two people are inthe same family, work together or have common interests). Instead ofassuming that a community is a set of nodes with many links betweenthem, we consider a community to be a set of closely interrelated links.

Placing each link in a single context allows us to reveal hierarchicaland overlapping relationships simultaneously. We use hierarchicalclustering with a similarity between links to build a dendrogramwhere each leaf is a link from the original network and branches

represent link communities (Fig. 1d, e and Methods). In this den-drogram, links occupy unique positions whereas nodes naturallyoccupy multiple positions, owing to their links. We extract link com-munities at multiple levels by cutting this dendrogram at variousthresholds. Each node inherits all memberships of its links and canthus belong to multiple, overlapping communities. Even though weassign only a single membership per link, link communities can alsocapture multiple relationships between nodes, because multiplenodes can simultaneously belong to several communities together.

The link dendrogram provides a rich hierarchy of structure, but toobtain the most relevant communities it is necessary to determine thebest level at which to cut the tree. For this purpose, we introduce anatural objective function, the partition density, D, based on linkdensity inside communities; unlike modularity20, D does not sufferfrom a resolution limit25 (Methods). Computing D at each level of thelink dendrogram allows us to pick the best level to cut (althoughmeaningful structure exists above and below that threshold). It isalso possible to optimize D directly. We can now formulate overlap-ping community discovery as a well-posed optimization problem,accounting for overlap at every node without penalizing that nodesparticipate in multiple communities.

As an illustrative example, Fig. 1f shows link communities aroundthe word ‘Newton’ in a network of commonly associated Englishwords. (See Supplementary Information, section 6, for details onnetworks used throughout the text.) The ‘clever, wit’ community iscorrectly identified inside the ‘smart/intellect’ community. Thewords ‘Newton’ and ‘Gravity’ both belong to the ‘smart/intellect’,‘weight’ and ‘apple’ communities, illustrating that link communitiescapture multiple relationships between nodes. See SupplementaryInformation, section 3.6, for further visualizations.

Having unified hierarchy and overlap, we provide quantitative,real-world evidence that a link-based approach is superior to exist-ing, node-based approaches. Using data-driven performance mea-sures, we analyse link communities found at the maximum partitiondensity in real-world networks, compared with node communitiesfound by three widely used and successful methods: clique percola-tion9, greedy modularity optimization26 and Infomap21. Clique per-colation is the most prominent overlapping community algorithm,greedy modularity optimization is the most popular modularity-based20 technique and Infomap is often considered the most accuratemethod available27.

We compiled a test group of 11 networks covering many domainsof active research and representing the wide body of available data(Supplementary Table 2). These networks vary from small to large,from sparse to dense, and from those with modular structure to thosewith highly overlapping structure. We highlight a few data sets ofparticular scientific importance: The mobile phone network is the

*These authors contributed equally to this work.

1Center for Complex Network Research, Department of Physics, Northeastern University, Boston, Massachusetts 02115, USA. 2Center for Cancer Systems Biology, Dana-FarberCancer Institute, Harvard University, Boston, Massachusetts 02215, USA. 3Institute for Quantitative Social Science, Harvard University, Cambridge, Massachusetts 02138, USA.4College of Computer and Information Science, Northeastern University, Boston, Massachusetts 02115, USA.

doi:10.1038/nature09182

1Macmillan Publishers Limited. All rights reserved©2010

most comprehensive proxy of a large-scale social network currentlyin existence17,18; the metabolic network iAF1260, from Escherichia coliK-12 MG1655 strain, is one of the most elaborate reconstructionscurrently available16; and the three protein–protein interaction net-works of Saccharomyces cerevisiae are the most recent and completeprotein–protein interaction data yet published14.

These networks possess rich metadata that allow us to describe thestructural and functional roles of each node. For example, the bio-logical roles of each protein in the protein–protein interaction net-work can be described by a controlled vocabulary (Gene Ontologyterms28). By calculating metadata-based similarity measures betweennodes (Methods and Supplementary Information, section 5), we candetermine the quality of communities by the similarity of the nodesthey contain (‘community quality’). Likewise, we can use metadata toestimate the expected amount of overlap around a node, testing thequality of the discovered overlap according to the metadata (‘overlapquality’). For example, metabolites that participate in more meta-bolic pathways are expected to belong to more communities thanmetabolites that participate in fewer pathways. Some methods mayfind high-quality communities but only for a small fraction of thenetwork; coverage measures describe how much of the network wasclassified by each algorithm (‘community coverage’) and how muchoverlap was discovered (‘overlap coverage’). Each community algo-rithm is tested by comparing its output with the metadata, to deter-mine how well the discovered community structure reflects themetadata, according to the four measures. Each measure is normalizedsuch that the best method attains a value of one. ‘Composite perfor-mance’ is the sum of these four normalized measures, such that themaximum achievable score is four. Full details are in Methods andSupplementary Information, sections 5 and 6.

Figure 2 displays the results of this quantitative comparison, show-ing that link communities reveal more about every network’s meta-data than other tested methods. Not only is our approach the overallleader in every network, it is also the winner in most individual aspectsof the composite performance for all networks, particularly the qualitymeasures. The performance of link communities stands out for densenetworks, such as the metabolic and word association networks,which are expected to have pervasively overlapping structure.

University

Joint appointment

Home and work

FamilyBuildings in same neighborhood

Inertia

Law

Newton

Physics

Lab Biology

Scientific

Chemical

Chemistry

Science

EinsteinTheory

Hypothesis

Theorem

Gravity

Relativity

Biologist

Smart

Scientist

Sly

Bright

Genius

Intelligence

Clever

IntelligentGifted

Wise

Inventor Brilliant

Wisdom

Kinetic

Exceptional

Retarded

Invent

Chemist

Wit

Velocity

Intellect

Cunning

Outfox

Flask Beaker

Test tube

Experiment

Research

Apple

Weight

Experiment, science

Newton, gravity, apple

Smart, intellect, scientists

Clever, wit

Science, scientists

21

3

5

3–42–41–42–31–21–34–75–64–64–57–97–88–9

4

67

8

9

a b

c

d

f

e

MeasuresOverlap coverage

Overlap qualityCommunity coverage

Community quality

MethodsLCGI

––––

LinksClique percolationGreedy modularity Infomap

Other networksSocial networksBiological networks

885,9896.34⟨k⟩

N 1,04216.81

1,6473.06

1,00416.57

1,2134.21

2,7298.92

67,4118.90

39038.95

1,2199.80

5,01822.02

18,1425.09

0

1

2

3

4

L C G I L C G I L C G I L C G I L C G I L C G I L C G I L C G I L C G I L C G I L C G I

Com

posi

te p

erfo

rman

ce

Amazon.comWord assoc.PhilosopherUS CongressActorPhonePPI (all)PPI (LC)PPI (AP/MS)PPI (Y2H)Metabolic

Figure 2 | Assessing the relevance of link communities using real-worldnetworks. Composite performance (Methods and SupplementaryInformation) is a data-driven measure of the quality (relevance of discoveredmemberships) and coverage (fraction of network classified) of communityand overlap. Tested algorithms are link clustering, introduced here; cliquepercolation9; greedy modularity optimization26; and Infomap21. Test

networks were chosen for their varied sizes and topologies and to representthe different domains where network analysis is used. Shown for each are thenumber of nodes, N, and the average number of neighbours per node, Ækæ.Link clustering finds the most relevant community structure in real-worldnetworks. AP/MS, affinity-purification/mass spectrometry; LC, literaturecurated; PPI, protein–protein interaction; Y2H, yeast two-hybrid.

Figure 1 | Overlapping communities lead to dense networks and preventthe discovery of a single node hierarchy. a, Local structure in manynetworks is simple: an individual node sees the communities it belongs to.b, Complex global structure emerges when every node is in the situationdisplayed in a. c, Pervasive overlap hinders the discovery of hierarchicalorganization because nodes cannot occupy multiple leaves of a nodedendrogram, preventing a single tree from encoding the full hierarchy.d, e, An example showing link communities (colours in d), the link similaritymatrix (e; darker entries show more similar pairs of links) and the linkdendrogram (e). f, Link communities from the full word association networkaround the word ‘Newton’. Link colours represent communities and filledregions provide a guide for the eye. Link communities capture conceptsrelated to science and allow substantial overlap. Note that the words wereproduced by experiment participants during free word associations.

LETTERS NATURE

2Macmillan Publishers Limited. All rights reserved©2010

It is instructive to examine further the statistics of link communitiesin the metabolic and mobile phone networks (Fig. 3). The communitysize distribution at the optimum value of D is heavy tailed for bothnetworks, whereas the number of communities per node distinguishesthem (Fig. 3, insets): Mobile phone users are limited to a smaller rangeof community memberships, most likely as a result of social and timeconstraints. Meanwhile, the membership distribution of the metabolicnetwork displays the universality of currency metabolites (water, ATPand so on) through the large number of communities they participatein. Notable previous work11,15 removed currency metabolites beforeidentifying meaningful community structure. The statistics presentedhere match current knowledge about the two systems, further con-firming the communities’ relevance.

Having established that link communities at the maximal partitiondensity are meaningful and relevant, we now show that the linkdendrogram reveals meaningful communities at different scales.Figure 4a–c shows that mobile phone users in a community arespatially co-located. Figure 4a maps the most likely geographic loca-tions of all users in the network; several cities are present. In Fig. 4b,we show (insets) several communities at different cuts above theoptimum threshold, revealing small, intra-city communities. Belowthe optimum threshold, larger, yet still spatially correlated, com-munities exist (Fig. 4c). Because we expect a tight-knit communityto have only small geographical dispersion, the clustered structureson the map indicate that the communities are meaningful. The geo-graphical correlation of each community does not suddenly breakdown, but is sustained over a wide range of thresholds. In Fig. 4d, welook more closely at the social network of the largest community inFig. 4c, extracting the structure of its largest subcommunity alongwith its remaining hierarchy and revealing the small-scale structuresencoded in the link dendrogram. This example provides evidence forthe presence of spatial, hierarchical organization at a societal scale. Tovalidate the hierarchical organization of communities quantitatively

throughout the dendrogram, we use a randomized control dendro-gram that quantifies how community quality would evolve if therewere no hierarchical organization beyond a certain point. Figure 4eshows that the quality of the actual communities decays much moreslowly than the control, indicating that real link dendrograms possessa large range of high quality community structures. The quantitativeresults of Fig. 4 are typical for the full test group, implying that rich,meaningful community structure is contained within the link den-drogram. Additional results supporting these conclusions are pre-sented in Supplementary Information, section 7.

Many cutting-edge networks are far from complete. For example,an ambitious project to map all protein–protein interactions in yeastis currently estimated to detect approximately 20% of connections14.As the rate of data collection continues to increase, networks become

Num

ber o

f com

mun

ities

Number of users per community

106

105

104

103

102

101

1000 5 10 15 20 25 30 35

Num

ber o

f use

rs

Number of communitiesper user

103

102

101

106

105

104

103

102

101

100

100

101 102 103

101 102 103

Num

ber o

f com

mun

ities

Number of metabolites per community

103

102

101

1000 50 100 150 200

Num

ber o

fm

etab

olite

s

Number of communitiesper metabolite

Mobile phone

Metabolic

H2O, H+

ATPADP

Pi

Figure 3 | Community and membership distributions for the metabolic andmobile phone networks. The distribution of community sizes and nodememberships (insets). Community size shows a heavy tail. The number ofmemberships per node is reasonable for both networks: we do not observephone users that belong to large numbers of communities and we correctlyidentify currency metabolites, such as water, ATP and inorganic phosphate(Pi), that are prevalently used throughout metabolism. The appearance ofcurrency metabolites in many metabolic reactions is naturally incorporatedinto link communities, whereas their presence hindered communityidentification in previous work11,15.

Threshold, t = 0.20

t = 0.24

t = 0.27

t = 0.27

50 km

a

0.4

D

0.6 0.8 1

d  Largest community Largestsubcommunity

Remaininghierarchy

t

e

b

c

Q/Q

max

0 0.2 0.4 0.6 0.8

Word association

0 0.2 0.4 0.6 0.8Link dendrogram threshold, t

Metabolic

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

Phone

ActualControl

LargestcommunitySecondlargestThird largest

Figure 4 | Meaningful communities at multiple levels of the linkdendrogram. a–c, The social network of mobile phone users displays co-located, overlapping communities on multiple scales. a, Heat map of themost likely locations of all users in the region, showing several cities.b, Cutting the dendrogram above the optimum threshold yields small, intra-city communities (insets). c, Below the optimum threshold, the largestcommunities become spatially extended but still show correlation. d, Thesocial network within the largest community in c, with its largestsubcommunity highlighted. The highlighted subcommunity is shown alongwith its link dendrogram and partition density, D, as a function of threshold,t. Link colours correspond to dendrogram branches. e, Community quality,Q, as a function of dendrogram level, compared with random control(Methods).

NATURE LETTERS

3Macmillan Publishers Limited. All rights reserved©2010

Page 5: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 6: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 7: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 8: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 9: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

relapse

Page 10: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 11: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

02805 Social graphs and interactions

Page 12: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 13: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

twitterbots!

Page 14: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

twitterbots!

Page 15: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 16: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 17: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

anecdote #1: you must be careful what you say

Page 18: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 19: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 20: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 21: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

[FAIL]

Page 22: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

competition #1

[v2.0 bots inspired by Tim Hwang]

who can get the most Justin Bieber followers?

Page 23: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

strategy

1. manually create a “realistic” profile (including a few tweets)

2. pick users with 50 - 200 followers 3. follow 10-20 new users per day 4. un-follow whoever doesn’t follow you back

within 24 hours 5. repeat step 2-4

Page 24: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 25: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

some early insights

Page 26: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

strategy

1. manually create a “realistic” profile (including a few tweets)

2. pick users with 50 - 200 followers 3. follow 10-20 new users per day 4. un-follow whoever doesn’t follow you back

within 24 hours 5. repeat step 2-4

100[initial goal was approx 50 followers per bot]

Page 27: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 28: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

bieber has a lot of (dark matter) spam followers!

Page 29: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

anecdote #2: an accidental turing test

Page 30: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

Noise

1

Twitter mood predicts the stock market.Johan Bollen1,?,Huina Mao1,?,Xiao-Jun Zeng2.

?: authors made equal contributions.

Abstract—Behavioral economics tells us that emotions canprofoundly affect individual behavior and decision-making. Doesthis also apply to societies at large, i.e. can societies experiencemood states that affect their collective decision making? Byextension is the public mood correlated or even predictive ofeconomic indicators? Here we investigate whether measurementsof collective mood states derived from large-scale Twitter feedsare correlated to the value of the Dow Jones Industrial Average(DJIA) over time. We analyze the text content of daily Twitterfeeds by two mood tracking tools, namely OpinionFinder thatmeasures positive vs. negative mood and Google-Profile of MoodStates (GPOMS) that measures mood in terms of 6 dimensions(Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validatethe resulting mood time series by comparing their ability todetect the public’s response to the presidential election andThanksgiving day in 2008. A Granger causality analysis anda Self-Organizing Fuzzy Neural Network are then used toinvestigate the hypothesis that public mood states, as measured bythe OpinionFinder and GPOMS mood time series, are predictiveof changes in DJIA closing values. Our results indicate that theaccuracy of DJIA predictions can be significantly improved bythe inclusion of specific public mood dimensions but not others.We find an accuracy of 87.6% in predicting the daily up anddown changes in the closing values of the DJIA and a reductionof the Mean Average Percentage Error by more than 6%.

Index Terms—stock market prediction — twitter — moodanalysis.

I. INTRODUCTION

STOCK market prediction has attracted much attentionfrom academia as well as business. But can the stock

market really be predicted? Early research on stock marketprediction [1], [2], [3] was based on random walk theoryand the Efficient Market Hypothesis (EMH) [4]. Accordingto the EMH stock market prices are largely driven by newinformation, i.e. news, rather than present and past prices.Since news is unpredictable, stock market prices will follow arandom walk pattern and cannot be predicted with more than50 percent accuracy [5].

There are two problems with EMH. First, numerous studiesshow that stock market prices do not follow a random walkand can indeed to some degree be predicted [5], [6], [7], [8]thereby calling into question EMH’s basic assumptions. Sec-ond, recent research suggests that news may be unpredictablebut that very early indicators can be extracted from onlinesocial media (blogs, Twitter feeds, etc) to predict changesin various economic and commercial indicators. This mayconceivably also be the case for the stock market. For example,[11] shows how online chat activity predicts book sales. [12]uses assessments of blog sentiment to predict movie sales.[15] predict future product sales using a Probabilistic LatentSemantic Analysis (PLSA) model to extract indicators of

sentiment from blogs. In addition, Google search queries havebeen shown to provide early indicators of disease infectionrates and consumer spending [14]. [9] investigates the relationsbetween breaking financial news and stock price changes.Most recently [13] provide a ground-breaking demonstrationof how public sentiment related to movies, as expressed onTwitter, can actually predict box office receipts.

Although news most certainly influences stock marketprices, public mood states or sentiment may play an equallyimportant role. We know from psychological research thatemotions, in addition to information, play an significant rolein human decision-making [16], [18], [39]. Behavioral financehas provided further proof that financial decisions are sig-nificantly driven by emotion and mood [19]. It is thereforereasonable to assume that the public mood and sentiment candrive stock market values as much as news. This is supportedby recent research by [10] who extract an indicator of publicanxiety from LiveJournal posts and investigate whether itsvariations can predict S&P500 values.

However, if it is our goal to study how public moodinfluences the stock markets, we need reliable, scalable andearly assessments of the public mood at a time-scale andresolution appropriate for practical stock market prediction.Large surveys of public mood over representative samples ofthe population are generally expensive and time-consumingto conduct, cf. Gallup’s opinion polls and various consumerand well-being indices. Some have therefore proposed indirectassessment of public mood or sentiment from the results ofsoccer games [20] and from weather conditions [21]. Theaccuracy of these methods is however limited by the lowdegree to which the chosen indicators are expected to becorrelated with public mood.

Over the past 5 years significant progress has been madein sentiment tracking techniques that extract indicators ofpublic mood directly from social media content such as blogcontent [10], [12], [15], [17] and in particular large-scaleTwitter feeds [22]. Although each so-called tweet, i.e. anindividual user post, is limited to only 140 characters, theaggregate of millions of tweets submitted to Twitter at anygiven time may provide an accurate representation of publicmood and sentiment. This has led to the development of real-time sentiment-tracking indicators such as [17] and “Pulse ofNation”1.

In this paper we investigate whether public sentiment, asexpressed in large-scale collections of daily Twitter posts, canbe used to predict the stock market. We use two tools tomeasure variations in the public mood from tweets submitted

1http://www.ccs.neu.edu/home/amislove/twittermood/

arX

iv:1

010.

3003

v1 [

cs.C

E] 1

4 O

ct 2

010

?

Page 31: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 32: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

a robot education• recognize bots • use ML to recognize “good” content (to tweet and re-

tweet (using features, sentiment, etc) • detect topics in your tweet stream • analyze the twitter network to find communities,

influentials, etc

Page 33: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

social influence?

Page 34: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 35: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

trends are local

so bots must be Bostonian

little is know about exact form of trending algorithm (twitter secret sauce)

we know that there’s a focus on novelty, and that sustained growth + retweets + faves might matter

Page 36: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

how?

Page 37: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

build convincing avatars and use high follower counts as part of your disguise

how?

Page 38: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

build convincing avatars and use high follower counts as part of your disguise

how?

Page 39: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

build convincing avatars and use high follower counts as part of your disguise

how?

Page 40: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

build convincing avatars and use high follower counts as part of your disguiseuse machine learning to separate bots from humans (so we can focus on humans)

how?

Page 41: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

build convincing avatars and use high follower counts as part of your disguiseuse machine learning to separate bots from humans (so we can focus on humans)use natural language processing and ML to find quality content & topics to tweet and re-tweet

how?

Page 42: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

build convincing avatars and use high follower counts as part of your disguiseuse machine learning to separate bots from humans (so we can focus on humans)

use network analysis to explore communities surrounding existing followers, making sure bot actions reach entire communities

use natural language processing and ML to find quality content & topics to tweet and re-tweet

how?

Page 43: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

use network analysis to explore communities surrounding existing followers, making sure bot actions reach entire communities

difficult41,42, and we interpret complex contagion broadly to includehomophily; we focus on how both social reinforcement and homo-phily effects collectively boost the trapping of memes within densecommunities, not on the distinctions between them.

To examine and quantify the spreading patterns of memes, weanalyze a dataset collected from Twitter, a micro-blogging platformthat allows millions of people to broadcast short messages (‘tweets’).People can ‘follow’ others to receive their messages, forward(‘retweet’ or ‘‘RT’’ in short) tweets to their own followers, or mention(‘@’ in short) others in tweets. People often label tweets with topicalkeywords (‘hashtags’). We consider each hashtag as a meme.

ResultsCommunities and communication volume. Do memes spread likecomplex contagions in general? If social reinforcement andhomophily significantly influence the spread of memes, we expectmore communication within than across communities. Let us definethe weight w of an edge by the frequency of communication betweenthe users connected by the edge. Nodes are partitioned into densecommunities based on the structure of the network, but withoutknowledge of the weights (see Methods). For each community c,the average edge weights of intra- and inter-community links,Æw æc and Æw æc, quantify how much information flows withinand across communities, respectively. We measure weights byaggregating all the meme spreading events in our data. If memesspread obliviously to community structure, like simple contagions,we would expect no difference between intra- and inter-communitylinks. By contrast, we observe that the intra-community links carry

more messages (Fig. 2(A)). Similar results have been reported fromother datasets35,37. In addition, by defining the focus of an individualas the fraction of activity that is directed to each neighbor in the samecommunity, f , or in different communities, f , we find that peopleinteract more with members of the same community (Fig. 2(B)). Allthe results are statistically significant (p=0:001) and robust acrosscommunity detection methods (see Supplementary Information foradditional details).

Meme concentration in communities. These results suggest thatcommunities strongly trap communication. To quantify this effectfor individual memes, let us define the concentration of a meme incommunities. We expect more concentrated communication and

Figure 1 | The importance of community structure in the spreading of social contagions. (A) Structural trapping: dense communities with few outgoinglinks naturally trap information flow. (B) Social reinforcement: people who have adopted a meme (black nodes) trigger multiple exposures to others (rednodes). In the presence of high clustering, any additional adoption is likely to produce more multiple exposures than in the case of low clustering,inducing cascades of additional adoptions. (C) Homophily: people in the same community (same color nodes) are more likely to be similar and to adoptthe same ideas. (D) Diffusion structure based on retweets among Twitter users sharing the hashtag #USA. Blue nodes represent English users and rednodes are Arabic users. Node size and link weight are proportional to retweet activity. (E) Community structure among Twitter users sharing the hashtags#BBC and #FoxNews. Blue nodes represent #BBC users, red nodes are #FoxNews users, and users who have used both hashtags are green. Node size isproportional to usage (tweet) activity, links represent mutual following relations.

Figure 2 | Meme concentration in communities. We measure weightsand focus in terms of retweets (RT) or mentions (@). We show (A)community edge weight and (B) user community focus using box plots. Boxescover 50% of data and whisker cover 95%. The line and triangle in a boxrepresent the median and mean, respectively.

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 3 : 2522 | DOI: 10.1038/srep02522 2

Lilian Weng, Filippo Menczer & Yong-Yeol Ahn Virality Prediction and Community Structure in Social Networks Scientific Reports, 3:2522

Page 44: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

interventions: collaboration is key

share list of followers (yy’s result/multiple exposures)

favorite everything with hashtag

coordinate timing

geotag & time w Bostontwo manual tweets per bot

retweet everything with hashtag (but don’t go crazy)

Page 45: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

interventions: collaboration is key

share list of followers (yy’s result/multiple exposures)

favorite everything with hashtaggeotag & time w Bostontwo manual tweets per bot

retweet everything with hashtag (but don’t go crazy)

Page 46: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

interventions: collaboration is key

share list of followers (yy’s result/multiple exposures)

favorite everything with hashtaggeotag & time w Bostontwo manual tweets per bot

retweet everything with hashtag (but don’t go crazy)

Page 47: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

#bostonthanks

Page 48: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

#MeInThree

coffee, coffee, coffeemath, insight, irreverenceiron & wine + white buffalo + donavon frankenreiter

red, white, and blue

[intended to work via engagement, but failed]

Page 49: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

#banksyinboston

Page 50: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

[crude photoshop work]

Page 51: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 52: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 53: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax
Page 54: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

real world action

“discovery of existing art”

work to verify rumors (similar to london riots)

great investigative journalism by @abtran

Page 55: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

work to verify rumors (similar to london riots)

Page 56: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

lessons

surprisingly easy to get followers, impersonate humans using crude meansin some areas Twitter is very different from what we experience

Page 57: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

perspectivesprobably no real influence (“stickyness” still central)

bots can make a difference

Page 58: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

perspectivesprobably no real influence (“stickyness” still central)

bots can make a difference

frightening/dystopian perspectives

Page 59: Sune Lehman: #socialbots and how we created the Boston Banksy Hoax

[thanks!]