who says what to whom on twitter

Who Says What to Whom on Twitter

Shaomei WuCornell University, [email protected]

Jake M. HofmanYahoo! Research, NY, [email protected]

Winter A. MasonYahoo! Research, NY, USAwinteram@yahoo-

inc.com

Duncan J. WattsYahoo! Research, NY, [email protected]

ABSTRACTWe study several longstanding questions in media communi-cations research, in the context of the microblogging serviceTwitter, regarding the production, ow, and consumption ofinformation. To do so, we exploit a recently introduced fea-ture of Twitter known as \lists" to distinguish between eliteusers|by which we mean celebrities, bloggers, and represen-tatives of media outlets and other formal organizations|andordinary users. Based on this classication, we nd a strik-ing concentration of attention on Twitter, in that roughly50% of URLs consumed are generated by just 20K eliteusers, where the media produces the most information, butcelebrities are the most followed. We also nd signicanthomophily within categories: celebrities listen to celebrities,while bloggers listen to bloggers etc; however, bloggers ingeneral rebroadcast more information than the other cate-gories. Next we re-examine the classical \two-step ow" the-ory of communications, nding considerable support for iton Twitter. Third, we nd that URLs broadcast by dierentcategories of users or containing dierent types of contentexhibit systematically dierent lifespans. And nally, we ex-amine the attention paid by the dierent user categories todierent news topics.

Categories and Subject DescriptorsH.1.2 [Models and Principles]: User/Machine Systems;J.4 [Social and Behavioral Sciences]: Sociology

General Termstwo-step ow, communications, classication

KeywordsCommunication networks, Twitter, information ow

1. INTRODUCTIONA longstanding objective of media communications re-

search is encapsulated by what is known as Lasswell's maxim:

Part of this research was performed while the author wasvisiting Yahoo! Research, New York. The author was alsosupported by NSF grant IIS-0910664.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2011, March 28April 1, 2011, Hyderabad, India.ACM 978-1-4503-0637-9/11/03.

\who says what to whom in what channel with what ef-fect" [12], so-named for one of the pioneers of the eld,Harold Lasswell. Although simple to state, Laswell's maximhas proven dicult to answer in the more-than 60 yearssince he stated it, in part because it is generally dicult toobserve information ows in large populations, and in partbecause dierent channels have very dierent attributes andeects. As a result, theories of communications have tendedto focus either on \mass" communication, dened as \one-way message transmissions from one source to a large, rela-tively undierentiated and anonymous audience," or on \in-terpersonal" communication, meaning a \two-way messageexchange between two or more individuals." [16].Correspondingly, debates among communication theorists

have tended to revolve around the relative importance ofthese two putative modes of communication. For exam-ple, whereas early theories such as the \hypodermic needle"model posited that mass media exerted direct and relativelystrong eects on public opinion, mid-century researchers [13,9, 14, 4] argued that the mass media inuenced the pub-lic only indirectly, via what they called a two-step ow ofcommunications, where the critical intermediate layer wasoccupied by a category of media-savvy individuals calledopinion leaders. The resulting \limited eects" paradigmwas then subsequently challenged by a new generation ofresearchers [6], who claimed that the real importance of themass media lay in its ability to set the agenda of publicdiscourse. But in recent years rising public skepticism ofmass media, along with changes in media and communica-tion technology, have tilted conventional academic wisdomonce more in favor of interpersonal communication, whichsome identify as a \new era" of minimal eects [2].Recent changes in technology, however, have increasingly

undermined the validity of the mass vs. interpersonal di-chotomy itself. On the one hand, over the past few decadesmass communication has experienced a proliferation of newchannels, including cable television, satellite radio, special-ist book and magazine publishers, and of course an arrayof web-based media such as sponsored blogs, online com-munities, and social news sites. Correspondingly, the tra-ditional mass audience once associated with, say, networktelevision has fragmented into many smaller audiences, eachof which increasingly selects the information to which it isexposed, and in some cases generates the information it-self [15]. Meanwhile, in the opposite direction interpersonalcommunication has become increasingly amplied throughpersonal blogs, email lists, and social networking sites to

aord individuals ever-larger audiences. Together, thesetwo trends have greatly obscured the historical distinctionbetween mass and interpersonal communications, leadingsome scholars to refer instead to \masspersonal" communi-cations [16].A striking illustration of this erosion of traditional me-

dia categories is provided by the micro-blogging platformTwitter. For example, the top ten most-followed users onTwitter are not corporations or media organizations, butindividual people, mostly celebrities. Moreover, these indi-viduals communicate directly with their millions of followersvia their tweets, often managed by themselves or publicists,thus bypassing the traditional intermediation of the massmedia between celebrities and fans. Next, in addition toconventional celebrities, a new class of \semi-public" individ-uals like bloggers, authors, journalists, and subject matterexperts has come to occupy an important niche on Twit-ter, in some cases becoming more prominent (at least interms of number of followers) than traditional public guressuch as entertainers and elected ocials. Third, in spiteof these shifts away from centralized media power, mediaorganizations|along with corporations, governments, andNGOs|all remain well represented among highly followedusers, and are often extremely active. And nally, Twitteris primarily made up of many millions of users who seemto be ordinary individuals communicating with their friendsand acquaintances in a manner largely consistent with tra-ditional notions of interpersonal communication.Twitter, therefore, represents the full spectrum of commu-

nications from personal and private to\masspersonal"to tra-ditional mass media. Consequently it provides an interestingcontext in which to address Lasswell's maxim, especially asTwitter|unlike television, radio, and print media|enablesone to easily observe information ows among the membersof its ecosystem. Unfortunately, however, the kinds of ef-fects that are of most interest to communications theorists,such as changes in behavior, attitudes, etc., remain dicultto measure on Twitter. Therefore in this paper we limitour focus to the \who says what to whom" part of Laswell'smaxim.To this end, our paper makes three main contributions:

We introduce a method for classifying users using Twit-ter Lists into \elite" and \ordinary" users, further clas-sifying elite users into one of four categories of interest|media, celebrities, organizations, and bloggers.

We investigate the ow of information among thesecategories, nding that although audience attention ishighly concentrated on a minority of elite users, muchof the information they produce reaches the massesindirectly via a large population of intermediaries.

We nd that dierent categories of users emphasize dif-ferent types of content, and that dierent content typesexhibit dramatically dierent characteristic lifespans,ranging from less than a day to months.

The remainder of the paper proceeds as follows. In thenext section, we review related work. In Section 3 we dis-cuss our data and methods, including Section 3.3 in whichwe describe how we use Twitter Lists to classify users, out-line two dierent sampling methods, and show that theydeliver qualitatively similar results. In Section 4 we ana-lyze the production of information on Twitter, particularly

who pays attention to whom. In section 4.1, we revisit thetheory of the two-step ow|arguably the dominant theoryof communications for much of the past 50 years|ndingconsiderable support for the theory. In Section 5, we con-sider \who listens to what", examining rst who shares whatkinds of media content, and second the lifespan of URLs as afunction of their origin and their content. Finally, in Section6 we conclude with a brief discussion of future work.

2. RELATEDWORKAside from the communications literature surveyed above,

a number of recent papers have examined information dif-fusion on Twitter. Kwak et al. [11] studied the topologicalfeatures of the Twitter follower graph, concluding from thehighly skewed nature of the distribution of followers and thelow rate of reciprocated ties that Twitter more closely resem-bled an information sharing network than a social network|a conclusion that is consistent with our own view. In ad-dition, Kwak et al. compared three dierent measures ofinuence|number of followers, page-rank, and number ofretweets|nding that the ranking of the most inuentialusers diered depending on the measure. In a similar vein,Cha et al. [3] compared three measures of inuence|numberof followers, number of retweets, and number of mentions|and also found that the most followed users did not neces-sarily score highest on the other measures. Weng et al. [17]compared number of followers and page rank with a modiedpage-rank measure which accounted for topic, again ndingthat ranking depended on the inuence measure. Finally,Bakshy et al. [1] studied the distribution of retweet cascadeson Twitter, nding that although users with large followercounts and past success in triggering cascades were on aver-age more likely to trigger large cascades in the future, thesefeatures are in general poor predictors of future cascade size.Our paper diers from this earlier work by shifting atten-

tion from the ranking of individual users in terms of variousinuence measures to the ow of information among dier-ent categories of users. In this sense, it is related to recentwork by Crane and Sornette [5], who posited a mathemati-cal model of social inuence to account for observed tempo-ral patterns in the popularity of YouTube videos, and alsoto Gomez et al [7], who studied the diusion of informa-tion among blogs and online news sources. Here, however,our focus is on identifying specic categories of \elite" users,who we dierentiate from \ordinary" users in terms of theirvisibility, and understanding their role in introducing infor-mation into Twitter, as well as how information originatingfrom traditional media sources reaches the masses.

3. DATA AND METHODS

3.1 Twitter Follower GraphIn order to understand how information is transmitted on

Twitter, we need to know the channels by which it ows;that is, who is following whom on Twitter. To this end, weused the follower graph studied by Kwak et al. [11], whichincluded 42M users and 1.5B edges. This data representsa crawl of the graph seeded with all users on Twitter asobserved by July 31st, 2009, and is publicly available1. Asreported by Kwak et al. [11], the follower graph is a directed

1The data is free to download fromhttp://an.kaist.ac.kr/traces/WWW2010.html

network characterized by highly skewed distributions both ofin-degree (# followers) and out-degree (#\friends", Twitternomenclature for how many others a user follows); however,the out-degree distribution is even more skewed than thein-degree distribution. In both friend and follower distribu-tions, for example, the median is less than 100, but the max-imum # friends is several hundred thousand, while a smallnumber of users have millions of followers. In addition, thefollower graph is also characterized by extremely low reci-procity (roughly 20%)|in particular, the most-followed in-dividuals typically do not follow many others. The Twitterfollower graph, in other words, does not conform to the usualcharacteristics of social networks, which exhibit much higherreciprocity and far less skewed degree distributions [10], butinstead resembles more the mixture of one-way mass com-munications and reciprocated interpersonal communicationsdescribed above.

3.2 Twitter FirehoseIn addition to the follower graph, we are interested in the

content being shared on Twitter, and so we examined thecorpus of all 5B tweets generated over a 223 day period fromJuly 28, 2009 to March 8, 2010 using data from the Twitter\rehose," the complete stream of all tweets2. Because ourobjective is to understand the ow of information, it is use-ful for us to restrict attention to tweets containing URLs,for two reasons. First, URLs add easily identiable tags toindividual tweets, allowing us to observe when a particularpiece of content is either retweeted or subsequently reintro-duced by another user. And second, because URLs pointto online content outside of Twitter, they provide a muchricher source of variation than is possible in the typical 140character tweet 3. Finally, we note that almost all URLsbroadcast on Twitter have been shortened using one of anumber of URL shorteners, of which the most popular ishttp://bit.ly/. From the total of 5B tweets recorded duringour observation period, therefore, we focus our attention onthe subset of 260M containing bit.ly URLs; thus all subse-quent counts are implicitly understood to be restricted tothis content.

3.3 Twitter ListsOur method for classifying users exploits a relatively re-

cent feature of Twitter: Twitter Lists. Since its launch onNovember 2, 2009, Twitter Lists have been used extensivelyto group sets of users into topical or other categories, andthereby to better organize and/or lter incoming tweets. Tocreate a Twitter List, a user provides a name (required) anddescription (optional) for the list, and decides whether thenew list is public (anyone can view and subscribe to this list)or private (only the list creator can view or subscribe to thislist). Once a list is created, the user can add/edit/deletelist members. As the purpose of Twitter Lists is to helpusers organize users they follow, the name of the list canbe considered a meaningful label for the listed users. The

2http://dev.twitter.com/doc/get/statuses/rehose3Naturally, this restriction also has downsides, in particularthat some users may be more likely to include URLs in theirtweets than others, and thus will appear to be relativelymore active and/or have more impact than if we were insteadto consider all tweets. For our purposes, however, we believethat the practical advantages of the restriction outweigh thepotential for bias.

classication of users can therefore eectively exploit the\wisdom of crowds" with these created lists, both in termsof their importance to the community (number of lists onwhich they appear), and also how they are perceived (e.g.news organization vs. celebrity, etc.).Before describing our methods for classifying users in terms

of the lists on which they appear, we emphasize that weare motivated by a particular set of substantive questionsarising out of communications theory. In particular, weare interested in the relative importance of mass commu-nications, as practiced by media and other formal organiza-tions, masspersonal communications as practiced by celebri-ties and prominent bloggers, and interpersonal communica-tions, as practiced by ordinary individuals communicatingwith their friends. In addition, we are interested in the re-lationships between these categories of users, motivated bytheoretical arguments such as the theory of the two-step

ow [9]. Rather than pursuing a strategy of automatic clas-sication, therefore, our approach depends on dening andidentifying certain predetermined classes of theoretical in-terest, where both approaches have advantages and disad-vantages. In particular, we restrict our attention to fourclasses of what we call \elite" users: media, celebrities, orga-nizations, and bloggers, as well as the relationships betweenthese elite users and the much larger population of \ordi-nary" users.Analytically, our approach has some disadvantages. In

particular, by determining the categories of interest in ad-vance, we reduce the possibility of discovering unanticipatedcategories that may be of equal or greater relevance thanthose we selected. Thus although we believe that for our par-ticular purposes, the advantages of our approach|namelyconceptual clarity and ease of interpretation|outweigh thedisadvantages, automated classication methods remain aninteresting topic for future work. Finally, in addition tothese theoretically-imposed constraints, our proposed clas-sication method must also satisfy a practical constraint|namely that the rate limits established by Twitter's APIeectively preclude crawling all lists for all Twitter users4.Thus we instead devised two dierent sampling schemes|asnowball sample and an activity sample|each with someadvantages and disadvantages, discussed below.

3.3.1 Snowball sample of Twitter ListsThe rst method for identifying elite users employed snow-

ball sampling. For each category, we chose a number u0 ofseed users that were highly representative of the desired cat-egory and appeared on many category-related lists. For eachof the four categories above, the following seeds were chosen:

Celebrities: Barack Obama, Lady Gaga, Paris Hilton

Media: CNN, New York Times

Organizations: Amnesty International, WorldWildlifeFoundation, Yahoo! Inc., Whole Foods

4The Twitter API allows only 20K calls per hour, where atmost 20 lists can be retrieved for each API call. Under themodest assumption of 40M users, where each user is includedon at most 20 lists, this would require roughly 11 weeks.Clearly this time could be reduced by deploying multipleaccounts, but it also likely underestimates the real time quitesignicantly, as many users appear on many more than 20lists (e.g. Lady Gaga appears on nearly 140,000).

Blogs5: BoingBoing, FamousBloggers, problogger, mash-able. Chrisbrogan, virtuosoblogger, Gizmodo, Ileane,dragonblogger, bbrian017, hishaman, copyblogger, en-gadget, danielscocco, BlazingMinds, bloggersblog, Ty-coonBlogger, shoemoney, wchingya, extremejohn,GrowMap, kikolani, smartbloggerz, Element321, bran-donacox, remarkablogger, jsinkeywest, seosmarty, No-tAProBlog, kbloemendaal, JimiJones, ditesco

After reviewing the lists associated with these seeds, thefollowing keywords were hand-selected based on (a) theirrepresentativeness of the desired categories; and (b) theirlack of overlap between categories:

Celebrities: star, stars, hollywood, celebs, celebrity,celebrities, celebsveried, celebrity-list,celebrities-on-twitter, celebrity-tweets

Media: news, media, news-media Organizations: company, companies, organization,organisation, organizations, organisations, corporation,brands, products, charity, charities, causes, cause, ngo

Blogs: blog, blogs, blogger, bloggers

Having selected the seeds and the keywords for each cate-gory, we then performed a snowball sample of the bipartitegraph of users and lists (see Figure 1). For each seed, wecrawled all lists on which that seed appeared. The resulting\list of lists" was then pruned to contain only the l0 listswhose names matched at least one of the chosen keywordsfor that category. For instance, Lady Gaga is on lists called\faves", \celebs", and\celebrity", but only the latter two listswould be kept after pruning. We then crawled all u1 usersappearing in the pruned \list of lists" (for instance, nd-ing all users that appeared in the \celebrity" list with LadyGaga), and then repeated these last two steps to completethe crawl. In total, 524; 116 users were obtained, who ap-peared on 7; 000; 000 lists; however, many of the more promi-nent users appeared on lists in more than one category|forexample Oprah Winfrey was frequently included in lists of\celebrity" as well as \media." To resolve this ambiguity, wecomputed a user i's membership score in category c:

wic =nicNc

;

where nic is the number of lists in category c that containuser i and Nc is the total number of lists in category c.We then assigned each user to the category in which heor she had the highest membership score (i.e., belonged tothe highest fraction of the category's lists). The number ofusers assigned in this manner to each category is reportedin Table 1.

3.3.2 Activity Sample of Twitter ListsAlthough the snowball sampling method is convenient and

is easily interpretable with respect to our theoretical moti-vation, it is also potentially biased by our particular choiceof seeds. To address this concern, we also generated a sam-ple of users based on their activity. Specically, we crawled

5The blogger category required many more seeds becausebloggers are in general lower prole than the seeds for theother categories

u0

l0

u1

l1

u2

l2

Figure 1: Schematic of the Snowball SamplingMethod

Table 1: Distribution of users over categories

Snowball Sample Activity Samplecategory # of users % of users # of users % of usersceleb 82,770 15:8% 14,778 13:0%media 216,010 41:2% 40,186 35:3%org 97,853 18:7% 14,891 13:1%blog 127,483 24:3% 43,830 38:6%total 524,116 100% 113,685 100%

all lists associated with all users who tweeted at least onceevery week for our entire observation period.This \activity-based" sampling method is also clearly bi-

ased towards users who are consistently active. Importantly,however, the bias is likely to be quite dierent from any in-troduced by the snowball sample; despite these dierences,the qualitative results that follow are similar for both sam-ples, providing evidence that our ndings are not artifactsof the sampling procedures. This method initially yielded750k users and 5M lists; however, after pruning the lists tothose that contained at least one of the keywords above, andassigning users to unique categories (as described above), weobtained a rened sample of 113,685 users, where Table 1reports the number of users assigned to each category. Wenote that the number of lists obtained by the activity sam-pling methods is considerably smaller than that obtainedby the snowball sample, and that bloggers are more heav-ily represented among the activity sample at the expense ofthe other three categories|consistent with our claim thatthe two methods introduce dierent biases. Interestingly,however, 97,614 of the activity sample, or 85%, also appearin the snowball sample, suggesting that the two samplingmethods identify similar populations of elite users|as in-deed we conrm in the next section.

3.3.3 Classifying Elite UsersHaving classied users into the desired categories, we next

rened the categories to identify\elite"users within each set.In doing so, we sought to reduce the size of each categorywhile still accounting for a large fraction of content con-sumed from these categories. In addition, we xed the fourcategories to be of the same size, as categories of very dier-ent sizes would require us to draw two sets of comparisons|one on the basis of total activity/impact, the other on aper-capita basis|rather than just one. To this end, we rstranked all users in each of category by how frequently theyare listed in that category. Next, we measured the ow of in-formation from the top k users in each of the four categoriesto a random sample of 100K ordinary (i.e. unclassied) users

010

2030

4050

friends

top k

aver

age

%

1000 4000 7000 10000

celebmediaorgblog

010

2030

4050

tweets received

top k

aver

age

%

1000 4000 7000 10000

celebmediaorgblog

(a) Snowball sample

010

2030

4050

friends

top k

aver

age

%

1000 4000 7000 10000

celebmediaorgblog

010

2030

4050

tweets received

top k

aver

age

%

1000 4000 7000 10000

celebmediaorgblog

(b) Activity sample

Figure 2: Average fraction of # following (blue line)and # tweets (red line) for a random user that areaccounted for by the top K elites users crawled

in two ways: the proportion of accounts the user follows ineach category, and the proportion of tweets the user receivedfrom everyone the user follows in each category.Figures 2(a) and 2(b) show the fraction of following links

(square symbols) and tweets received (diamonds) by an av-erage user from each category, respectively. Although thenumerical values dier slightly, the two sets of results arequalitatively similar. In particular, for both sampling meth-ods, celebrities outrank all other categories, followed by themedia, organizations, and bloggers. Also in both cases, thebulk of the attention is accounted for by a relatively smallnumber of users within each category, as evidenced by therelatively at slope of the attention curves, where we notethat the curve for celebrities asymptotes more slowly thanfor the other three categories. Balancing the requirementsdescribed above, therefore, we chose k = 5000 as a cut-ofor the elite categories, where all remaining users are hence-forth classied as ordinary. Naturally, imposing categoricaldistinctions of any kind articially transforms dierences ofdegree (e.g. more or less prominent users) into dierencesof kind (\elite" vs. \ordinary"), but again we feel the in-terpretability gained by this distinction outweighs the costs.Moreover, because the choice of k = 5000 is arbitrary, wereplicated our analysis with a range of values of k, ndingqualitatively indistinguishable results. Thus, from this pointon, we restrict our analysis to the top 5,000 users in eachcategory identied by the snowball sampling method, notingthat both methods generate similar results.Based on this denition of elite users, Table 2 shows that

although ordinary users collectively introduce by far thehighest number of URLs, members of the elite categories arefar more active on a per-capita basis. In particular, usersclassied as \media" easily outproduce all other categories,

Table 2: # of URLs initiated by category

# of URLscategory # of URLs per-capitaceleb 139,058 27.81media 5,119,739 1023.94org 523,698 104.74blog 1,360,131 272.03ordinary 244,228,364 6.10

Table 3: Top 5 users in each category

Celebrity Media Org Blogaplusk cnnbrk google mashableladygaga nytimes Starbucks problogger

TheEllenShow asahi twitter kibelocotaylorswift13 BreakingNews joinred naosalvo

Oprah TIME ollehkt dooce

followed by bloggers, organizations, and celebrities. Ordi-nary users originate on average only about 6 URLs each,compared with over 1,000 for media users. In the rest ofthis paper, therefore, when we talk about \celebrity", \me-dia", \organization", \blog", we refer the top 5K users drawnfrom the snowball sample listed as \celebrity", \media", \or-ganization", \blog", respectively.Table 3, which shows the top 5 users in each of the four

categories, suggests that the sampling method yields re-sults that are consistent with our objective of identifyingusers who are prominent exemplars of our target categories.Among the celebrity list, for example, \aplusk," is the han-dle for actor Ashton Kusher, one of the rst celebrities toembrace Twitter and still one of the most followed users,while the remaining celebrity users|Lady Gaga, Ellen De-generes, Oprah Winfrey, and Taylor Swift, are all householdnames. In the media category, CNN Breaking News and theNew York Times are most prominent, followed by BreakingNews, Time, and Asahi, a leading Japanese daily newspa-per. Among organizations, Google, Starbucks, and Twit-ter are obviously large and socially prominent corporations,while JoinRed is the charity organization started by Bono ofU2, and ollehkt is the Twitter account for KT, formerly Ko-rean Telecom. Finally, among the blogging category, Mash-able and ProBlogger are both prominent US blogging sites,while Kibe Loco and Nao Salvo are popular blogs in Brazil,and dooce is the blog of Heather Armstrong, a widely read\mommy blogger" with over 1.5M followers.

4. WHO LISTENS TOWHOMThe results of the previous section provide qualied sup-

port for the conventional wisdom that audiences have be-come increasingly fragmented. Clearly, ordinary users onTwitter are receiving their information from many thou-sands of distinct sources, most of which are not traditionalmedia organizations|even though media outlets are by farthe most active users on Twitter, only about 15% of tweetsreceived by ordinary users are received directly from themedia. Equally interesting, however, is that in spite of thisfragmentation, it remains the case that 20K elite users, com-prising less than 0.05% of the user population, attract almost50% of all attention within Twitter. Thus, while attentionthat was formerly restricted to mass media channels is now

Celeb Media

Org Blog

A B

Category of Twitter Users

B receive tweets from A

% of tweets received fromCeleb Media Org Blog

Celeb 38.27 6.23 1.55 3.98Media 3.91 26.22 1.66 5.69Org 4.64 6.41 8.05 8.70Blog 4.94 3.89 1.58 22.55

Figure 3: Share of tweets received among elite cat-egories

shared amongst other \elites", information ows have notbecome egalitarian by any means.The prominence of elite users also raises the question of

how these dierent categories listen to each other. To ad-dress this issue, we compute the volume of tweets exchangedbetween elite categories. Specically, Figure 3 shows theaverage percentage of tweets that category i receives fromcategory j (indicated by edge thickness), exhibiting notice-able homophily with respect to attention: celebrities over-whelmingly pay attention to other celebrities, media actorspay attention to other media actors, and so on. The oneslight exception to this rule is that organizations pay moreattention to bloggers than to themselves. In general, in fact,attention paid by organizations is more evenly distributedacross categories than for any other category.Figure 3, it should be noted, shows only how many URLs

are received by category i from category j, a particularlyweak measure of attention for the simple reason that manytweets go unread. A stronger measure of attention, there-fore, is to consider instead only those URLs introduced bycategory i that are subsequently retweeted by category j.Figure 4 shows how much information originating from eachcategory is retweeted by other categories. As with our previ-ous measure of attention, retweeting is strongly homophilousamong elite categories; however, bloggers are disproportion-ately responsible for retweeting URLs originated by all cate-gories, issuing 93 retweets per person, compared to only 1.1retweets per person for ordinary users. This result thereforereects the conventional characterization of bloggers as re-cyclers and lters of information. Interestingly, however, wealso note that the total number of URLs retweeted by blog-gers (465k) is vastly outweighed by the number retweeted byordinary users (46M); thus in spite of the much greater per-capita activity, their overall impact is still relatively small.

4.1 Two-Step Flow of InformationExamining information ow on Twitter also sheds new

light on the theory of the two-step ow [8], arguably the the-ory that has most successfully captured the dueling impor-tance of mass media and interpersonal inuence. As we havealready noted, on Twitter the ow of information from themedia to the masses accounts for only a fraction of the totalvolume of information. Nevertheless, it is still a substantialfraction, so it is still interesting to ask: for the special caseof information originating from media sources, what propor-tion is broadcast directly to the masses, and what proportionis transmitted indirectly via some population of intermedi-

Celeb Media

Org Blog

A B

Category of Twitter Users

A retweet B

# of retweets byCeleb Media Org Blog

Celeb 4,334 1,489 1,543 5,039Media 4,624 40,263 7,628 32,027Org 1,570 2,539 18,937 11,175Blog 3,710 6,382 5,762 99,818

Figure 4: RT behavior among elite categories

aries? In addition, we may inquire whether these interme-diaries, to the extent they exist, are drawn from other elitecategories or from ordinary users, as claimed by the two-step ow theory; and if the latter, in what respects theydier from other ordinary users.Before proceeding with this analysis, we note that there

are two ways information can pass through an intermediaryin Twitter. The rst is via retweeting, which occurs whena users explicitly rebroadcasts a URL that he or she has re-ceived from a friend, along with an explicit acknowledgementof the source|either using the ocial retweet functionalityprovided by Twitter or by making use of an informal con-vention such as \RT @user" or \via @user." Alternatively,a user may tweet a URL that has previously been posted,but without acknowledgement of a source; in this case weassume the information was independently rediscovered andlabel this a \reintroduction" of content. For the purposesof studying when a user receives information directly fromthe media or indirectly through an intermediary, we treatretweets and reintroductions equivalently. If the rst occur-rence of a URL in Twitter came from a media user, but auser received the URL from another source, then that sourcecan be considered an intermediary, whether they are citingthe source within Twitter by retweeting the URL, or rein-troducing it, having discovered the URL outside of Twitter.To quantify the extent to which ordinary users get their

information indirectly versus directly from the media, wesampled 1M random ordinary users6, and for each user,counted the number n of bit.ly URLs they had received thathad originated from one of our 5K media users, where ofthe 1M total, 600K had received at least one such URL.For each member of this 600K subset we then counted thenumber n2 of these URLs that they received via non-mediafriends; that is, via a two-step ow. The average fractionn2=n = 0:46 therefore represents the proportion of media-originated content that reaches the masses via an interme-diary rather than directly. As Figure 5 shows, however,this average is somewhat misleading. In reality, the pop-ulation comprises two types|those who receive essentiallyall of their media-originating information via two-step owsand those who receive virtually all of it directly from the me-dia. Unsurprisingly, the former type is exposed to less totalmedia than the latter. What is surprising, however, is that

6As before, performing this analysis for the entire populationof over 40M ordinary users proved to be computationallyunfeasible.

10 102 103 104 105 106

0.0

0.4

0.8

random sample

# mediaoriginated URLs

indi

rect

flow

rat

io a

10 102 103 104 105 106

random sample


# us

ers

050

000

1500

00 b

10 102 103 104 105 106

0.0

0.4

0.8

intermediaries


indi

rect

flow

rat

io c

10 102 103 104 105 106

intermediaries


# us

ers

010

0000

d

Figure 5: Percentage of information that receivedvia an intermediary as a function of total volume ofmedia content to which a user is exposed

even users who received up to 100 media URLs during ourobservation period received all of them via intermediaries.Who are these intermediaries, and how many of them are

there? In total, the population of intermediaries is smallerthan that of the users who rely on them, but still surprisinglylarge, roughly 490K, the vast majority of which (484K, or99%) are classied as ordinary users, not elites. To illustratethe dierence, we note that whereas the top 20K elite userscollectively account for nearly 50% of attention, the top 10Kmost-followed ordinary users account for only 5%. Moreover,Figure 5c also shows that at least some intermediaries alsoreceive the bulk of their media content indirectly, just likeother ordinary users.Comparing Figure 5a and 5c, however, we note that in-

termediaries are not like other ordinary users in that theyare exposed to considerably more media than randomly se-lected users (9165 media-originated URLs on average vs.1377), hence the number of intermediaries who rely on two-step ows is smaller than for random users. In addition,we nd that on average intermediaries have more followersthan randomly sampled users (543 followers versus 34) andare also more active (180 tweets on average, versus 7). Fi-nally, Figure 6 shows that although all intermediaries, bydenition, pass along media content to at least one otheruser, a minority satises this function for multiple users,where we note that the most prominent intermediaries aredisproportionately drawn from the 4% of elite users|AshtonKucher (aplusk), for example, acts as an intermediary forover 100,000 users.Interestingly, these results are all broadly consistent with

the original conception of the two-step ow, advanced over50 years ago, which emphasized that opinion leaders were\distributed in all occupational groups, and on every so-cial and economic level," corresponding to our classicationof most intermediaries as ordinary [9]. The original theoryalso emphasized that opinion leaders, like their followers,also received at least some of their information via two-step

ows, but that in general they were more exposed to themedia than their followers|just as we nd here. Finally,the theory predicted that opinion leadership was not a bi-nary attribute, but rather a continuously varying one, cor-

# of twostep recipients

# of

opi

nion

lead

ers

0 10 102 103 104 1050

10

102103104105

Figure 6: Frequency of intermediaries binned by #randomly sampled users to whom they transmit me-dia content

responding to our nding that intermediaries vary widely inthe number of users for whom they act as lters and trans-mitters of media content. Given the length of time that haselapsed since the theory of the two-step ow was articulated,and the transformational changes that have taken place incommunications technology in the interim|given, in fact,that a service like Twitter was likely unimaginable at thetime|it is remarkable how well the theory agrees with ourobservations.

5. WHO LISTENS TOWHAT?The results in Section 4 demonstrate the \elite" users ac-

count for a substantial portion of all of the attention onTwitter, but also show clear dierences in how the attentionis allocated to the dierent elite categories. It is thereforeinteresting to consider what kinds of content is being sharedby these categories. Given the large number of URLs in ourobservation period (260M), and the many dierent ways onecan classify content (video vs. text, news vs. entertainment,political news vs. sports news, etc.), classifying even a smallfraction of URLs according to content is an onerous task.Bakshy et al. [1], for example, used Amazon's MechanicalTurk to classify a stratied sample of 1,000 URLs along avariety of dimensions; however, this method does not scalewell to larger sample sizes.Instead, we restricted attention to URLs originated by the

New York Times which, with over 2.5M followers, is the mostactive and the second-most-followed news organization onTwitter (after CNN Breaking News). To classify NY Timescontent, we exploited a convenient feature of their format|namely that all NY Times URLs are classied in a consistentway by the section in which they appear (e.g. U.S., World,Sports, Science, Arts, etc.) 7. Of the 6398 New York Timesbit.ly URLs we observed, 6370 could be successfully unshort-ened and assigned to one of 21 categories. Of these, how-ever, only 9 categories had more than 100 URLs during theobservation period, one of which|\NY region"|was highlyspecic to the New York metropolitan area; thus we focusedour attention on the remaining 8 topical categories. Figure7 shows the proportion of URLs from each New York Timessection retweeted or reintroduced by each category. World

7http://www.nytimes.com/year/month/day/category/title.html?ref=category

User Category

% R

Ts a

nd

Re

-in

tro

du

ctio

ns

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

1. World News

3. Business

5. Health

7. Science

blog celeb media org other

2. U.S. News

4. Sports

6. Technology

8. Arts

blog celeb media org other

Figure 7: Number of RTs and Reintroductions ofNew York Times stories by content category

news is the most popular category, followed by U.S. News,Business, and Sports, where increasingly niche categorieslike Health, Arts, Science, and Technology are less popu-lar still. In general, the overall pattern is replicated for allcategories of users, but there are some minor deviations: inparticular, organizations show disproportionately little in-terest in business and arts-related stories, and dispropor-tionately high interest in science, technology, and possiblyworld news. Celebrities, by contrast, show greater interestin sports and less interest in health, while the media showssomewhat greater interest in U.S. news stories.

5.1 Lifespan of ContentIn addition to dierent types of content, URLs introduced

by dierent types of elite users or ordinary users may exhibitdierent lifespans, by which we mean the time lag betweenthe rst and last appearance of a given URL on Twitter.Naively, measuring lifespan seems a trivial matter; how-

ever, a nite observation period|which results in censoringof our data|complicates this task. In other words, a URLthat is last observed towards the end of the observation pe-riod may be retweeted or reintroduced after the period ends,while correspondingly, a URL that is rst observed towardthe beginning of the observation window may in fact havebeen introduced before the window began. What we observeas the lifespan of a URL, therefore, is in reality a lower boundon the lifespan. Although this limitation does not createmuch of a problem for short-lived URLs|which account forthe vast majority of our observations|it does potentiallycreate large biases for long lived URLs. In particular, URLsthat appear towards the end of our observation period will

first observation

of URL

!!

last observation

of URL

!

estimation period = 133 days evaluation period = 90 days

Total observation window = 223 days

Figure 8: (a) Denition of URL lifespan (b)Schematic of lifespan estimation procedure

be systematically classied as shorter-lived than URLs thatappear towards the beginning.To address the censoring problem, we seek to determine

a buer at both the beginning and the end of our 223-day period, and only count URLs as having a lifespan of if (a) they do not appear in the rst days, (b) they rstappear in the interval between the buers, and (c) they donot appear in the last days, as illustrated in Figure 8(a).To determine we rst split the 223 day period into twosegments|the rst 133 day estimation period and the last90 day evaluation period (see Figure 8(b))|and then ask: ifwe (a) observe a URL rst appear in the rst (133 ) daysand (b) do not see it in the days prior to the onset of theevaluation period, how likely are we see it in the last 90 days?Clearly this depends on the actual lifespan of the URL, asthe longer a URL lives, the more likely it will re-appearin the future. Using this estimation/evaluation split, wend an upper-bound on lifespan for which we can determinethe actual lifespan with 95% accuracy as a function of .Finally, because we require a beginning and ending buer,and because we can only classify a URL as having lifespan if it appears at least days before the end of our window, weneed to pick and such that +2 223. We determinedthat = 70 and = 70 suciently satised our constraints;thus for the following analysis, we consider only URLs thathave a lifespan 70 8.

5.2 Lifespan By CategoryHaving established a method for estimating URL lifespan,

we now explore the lifespan of URLs introduced by dierentcategories of users, as shown in Figure 9(a). URLs initi-ated by the elite categories exhibit a similar distribution overlifespan to those initiated by ordinary users. As Figure 9(b)shows, however, when looking at the percentage of URLs ofdierent lifespans initiated by each category, we see two ad-ditional results: rst, URLs originated by media actors gen-erate a large portion of short-lived URLs (especially URLswith = 0, those that only appeared once); and second,URLs originated by bloggers are overrepresented among thelonger-lived content. Both of these results can be explainedby the type of content that originates from dierent sources:whereas news stories tend to be replaced by updates on adaily or more frequent basis, the sorts of URLs that arepicked up by bloggers are of more persistent interest, andso are more likely to be retweeted or reintroduced months

8We also performed our analysis with dierent values of ,nding very similar results; thus our conclusions are robustwith respect to the details of our estimation procedure.

0 10 20 30 40 50 60 70

lifespan (day)

# of

URL

s

0

102

104

106

108othercelebmediaorgblog

(a) Count

0 10 20 30 40 50 60 7001234567

lifespan (day)

% o

f URL

s fro

m e

lites

cat

egor

y

celebmediaorgblog

(b) Percent

Figure 9: 9(a) Count and 9(b) percentage of URLsinitiated by 4 categories, with dierent lifespans

or even years after their initial introduction. Twitter, inother words, should be viewed as a subset of a much largermedia ecosystem in which content exists and is repeatedlyrediscovered by Twitter users. Some of this content|suchas daily news stories|has a relatively short period of rel-evance, after which a given story is unlikely to be reintro-duced or rebroadcast. At the other extreme, classic musicvideos, movie clips, and long-format magazine articles havelifespans that are eectively unbounded, and can seeminglybe rediscovered by Twitter users indenitely without losingrelevance.To shed more light on the nature of long-lived content on

Twitter, we used the bit.ly API service to unshorten 35Kof the most long-lived URLs (URLs that lived at least 200days), and mapped them into 21034 web domains. As Figure10 shows, the population of long-lived URLs is dominatedby videos, music, and consumer goods. Two related pointsare illustrated by Figure 11, which shows the average RTrate (the proportion of tweets containing the URL that areretweets of another tweet) of URLs with dierent lifespans,grouped by the categories that introduced the URL9. First,for ordinary users, the majority of appearances of URLs af-ter the initial introduction derives not from retweeting, butrather from reintroduction, where this result is especiallypronounced for long-lived URLs. For the vast majority ofURLs on Twitter, in other words, longevity is determinednot by diusion, but by many dierent users independently

9Note here that URLs with lifespan = 0 are those URLsthat only appeared once in our dataset, thus the RT rate iszero.

123greetings.comsmashingmagazine.com

vimeo.comcollegehumor.com

flickr.comgoogle.comtwitter.com

facebook.commyspace.com

imdb.comecademy.com

ted.comfeedproxy.google.com

mashable.combitrebels.com

en.wikipedia.orgpollpigeon.com

amazon.comlast.fm

youtube.com

count10 102 103 104

Figure 10: Top 20 domains for URLs that lived morethan 200 days

0 10 20 30 40 50 60 700.0

0.2

0.4

0.6

0.8

1.0

lifespan (day)

RT ra

te =

#

of R

Ts

tota

l # o

f occ

urre

nces

othercelebmediaorgblog

Figure 11: Average RT rate by lifespan for each ofthe originating categories

rediscovering the same content, consistent with our inter-pretation above. Second, however, for URLs introduced byelite users, the result is somewhat the opposite|that is, theyare more likely to be retweeted than reintroduced, even forURLs that persist for weeks. Although it is unsurprisingthat elite users generate more retweets than ordinary users,the size of the dierence is nevertheless striking, and sug-gests that in spite of the dominant result above that contentlifespan is determined to a large extent by the type of con-tent, the source of its origin also impacts its persistence, atleast on average|a result that is consistent with previousndings [1].

6. CONCLUSIONSIn this paper, we investigated a classic problem in me-

dia communications research, captured by the rst part ofLaswell's maxim|\who says what to whom"|in the contextof Twitter. In particular, we nd that although audience at-tention has indeed fragmented among a wider pool of contentproducers than classical models of mass media, attention re-mains highly concentrated, where roughly 0.05% of the pop-ulation accounts for almost half of all posted URLs. Withinthis population of elite users, moreover, we nd that atten-tion is highly homophilous, with celebrities following celebri-

ties, media following media, and bloggers following bloggers.Second, we nd considerable support for the two-step owof information|almost half the information that originatesfrom the media passes to the masses indirectly via a diuseintermediate layer of opinion leaders, who although classiedas ordinary users, are more connected and more exposed tothe media than their followers. Third, we nd that althoughall categories devote a roughly similar fraction of their atten-tion to dierent categories of news (World, U.S., Business,etc), there are some dierences|organizations, for exam-ple, devote a surprisingly small fraction of their attentionto business-related news. We also nd that dierent typesof content exhibit very dierent lifespans: media-originatedURLs are disproportionately represented among short-livedURLs while those originated by bloggers tend to be over-represented among long-lived URLs. Finally, we nd thatthe longest-lived URLs are dominated by content such asvideos and music, which are continually being rediscoveredby Twitter users and appear to persist indenitely.By restricting our attention to URLs shared on Twitter,

our conclusions are necessarily limited to one narrow cross-section of the media landscape. An interesting directionfor future work would therefore be to apply similar meth-ods to quantifying information ow via more traditionalchannels, such as TV and radio on the one hand, and in-terpersonal interactions on the other hand. Moreover, al-though our approach of dening a limited set of predeter-mined user-categories allowed for relatively convenient anal-ysis and straightforward interpretation, it would be interest-ing to explore automatic classication schemes from whichadditional user categories could emerge. Finally, anothertwo areas for future work are rst, to extract content infor-mation in a more systematic manner|the \what" of Lass-well's maxim; and second, to focus more on the eects ofcommunication by merging the data regarding information

ow on Twitter with other sources of outcome data, such asthe opinions or actions of the recipients of the information.

7. REFERENCES[1] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J.

Watts. Identifying `inuencers' on twitter. In FourthACM International Conference on Web Seach andData Mining (WSDM), Hong Kong, 2011. ACM.

[2] W. L. Bennett and S. Iyengar. A new era of minimaleects? the changing foundations of politicalcommunication. Journal of Communication,58(4):707{731, 2008.

[3] M. Cha, H. Haddadi, F. Benevenuto, and K. P.Gummad. Measuring user inuence on twitter: Themillion follower fallacy. In 4th Int'l AAAI Conferenceon Weblogs and Social Media, Washington, DC, 2010.

[4] J. S. Coleman, E. Katz, and H. Menzel. The diusionof an innovation among physicians. Sociometry,20(4):253{270, 1957.

[5] R. Crane and D. Sornette. Robust dynamic classesrevealed by measuring the response function of asocial system. Proceedings of the National Academy ofSciences, 105(41):15649, 2008.

[6] T. Gitlin. Media sociology: The dominant paradigm.Theory and Society, 6(2):205{253, 1978.

[7] M. Gomez Rodriguez, J. Leskovec, and A. Krause.Inferring networks of diusion and inuence. In

Proceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 1019{1028. ACM, 2010.

[8] E. Katz. The two-step ow of communication: Anup-to-date report on an hypothesis. Public OpinionQuarterly, 21(1):61{78, 1957.

[9] E. Katz and P. F. Lazarsfeld. Personal inuence; thepart played by people in the ow of masscommunications. Free Press, Glencoe, Ill.

"1955.

[10] G. Kossinets and D. J. Watts. Empirical analysis of anevolving social network. Science, 311(5757):88{90,2006.

[11] H. Kwak, C. Lee, H. Park, and S. Moon. What istwitter, a social network or a news media? InProceedings of the 19th international conference onWorld Wide Web, pages 591{600. ACM, 2010.

[12] H. D. Lasswell. The structure and function ofcommunication in society. In L. Bryson, editor, TheCommunication of Ideas, pages 117{130. University ofIllinois Press, Urbana, IL, 1948.

[13] P. F. Lazarsfeld, B. Berelson, and H. Gaudet. Thepeople's choice; how the voter makes up his mind in apresidential campaign. Columbia University Press,New York, 3rd edition, 1968.

[14] R. K. Merton. Patterns of inuence: Local andcosmopolitan inuentials. In R. K. Merton, editor,Social theory and social structure, pages 441{474. FreePress, New York, 1968.

[15] C. Sunstein. Going to extremes: how like minds uniteand divide. Oxford University Press, USA, 2009.

[16] J. B. Walther, C. T. Carr, S. S. W. Choi, D. C.DeAndrea, J. Kim, S. T. Tong, and B. Van Der Heide.Interaction of interpersonal, peer, and media inuencesources online. In Z. Papacharissi, editor, A NetworkedSelf: Identity, Community, and Culture on SocialNetwork Sites, pages 17{38. Routledge, 2010.

[17] J. Weng, E. P. Lim, J. Jiang, and Q. He. Twitterrank:nding topic-sensitive inuential twitterers. InProceedings of the third ACM international conferenceon Web search and data mining, pages 261{270. ACM,2010.

who says what to whom on twitter

Documents