collecting twitter data

Post on 27-Nov-2014

2.208 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk held at the Royal Statistical Society in London as part of the event series "Blurring the boundaries - New social media, new social science?". I thank Grant Blank from the OII for inviting me to this exciting workshop.

TRANSCRIPT

Collecting Twitter dataDr. Cornelius Puschmann

School of Library and Information Science Humboldt-University of Berlin /

Humboldt Institute for Internet and Society16 April 2013

Royal Statistical Society

1. Examples of research using Twitter data

2. Twitter's data infrastructure

3. Tools for collecting data

4. Sampling issues

Overview

Examples of research using Twitter data

• Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media ? Categories and Subject Descriptors. Proceedings of the 19th International Conference on the World Wide Web (WWW ’10) (pp. 591–600). Raleigh, NC.

• González-Bailón, S., Borge-Holthoefer, J., Rivero, A., & Moreno, Y. (2011). The dynamics of protest recruitment through an online network. Scientific reports, 1, 197. doi:10.1038/srep00197

• Ausserhofer, J., & Maireder, A. (2013). National politics on Twitter: Structures and topics of a networked public sphere. Information, Communication & Society, 16(3), 291–314. doi:10.1080/1369118X.2012.756050

• Papacharissi, Z., & De Fatima Oliveira, M. (2012). Affective News and Networked Publics: The Rhythms of News Storytelling on #Egypt. Journal of Communication, 62(2), 266–282. doi:10.1111/j.1460-2466.2012.01630.x

Hashtags, keywords, and geography• How can the discussion of topic X be characterized? • Who is participating in discussions on X?• Where are users discussing X?

Twitter as a platform• How can Twitter's structure be described?

Social graph• Who follows whom?• How does information spread?

Example questions

Prediction/application• Can election results/flu outbreaks/consumption

patterns be reliably predicted?

URLs in Twitter• How is mass media content discussed?• How are academic papers cited on Twitter?

Example questions

Creative approaches• Where, when, and with what devices do people

call taxis?

#phdchat data set (30k tweets)

visualization of keywords using Gephi

Application Programming Interface (API)

HTTP request

return all data from a given user/hashtag/geolocation/...

Data (usually in a database or spreadsheet)

Extracting Twitter data

Tweet in browser

Tweet source via API

Streaming API• public, user, and

site streams• provides data in

real time and largely unprocessed as it flows through the platform

REST API• traditionally used

by most client software• v1.0 will be phased

out in May 2013• to be replaced by

more restrictive v1.1

Search API• same functionality

as Twitter search• rate-limited

Three Twitter APIs

1) data: tweets, social graph2) complex tools needed 3) constraints on how much data can be captured

"By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed)."

"You agree that this license includes the right for Twitter to make such Content available to other companies, organizations or individuals who partner with Twitter for the syndication, broadcast, distribution or publication of such Content on other media and services, subject to our terms and conditions for such Content use."

"We encourage and permit broad re-use of Content. The Twitter API exists to enable this."

Legal issues: Twitter's terms of service

"You will not attempt or encourage others to: sell, rent, lease, sublicense, redistribute, or syndicate access to the Twitter API or Twitter Content to any third party without prior written approval from Twitter. If you provide an API that returns Twitter data, you may only return IDs (including tweet IDs and user IDs). You may export or extract non-programmatic, GUI-driven Twitter Content as a PDF or spreadsheet by using "save as" or similar functionality. Exporting Twitter Content to a datastore as a service or other cloud based service, however, is not permitted."

"Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services."

Legal issues: API rules

Tweet Archivist Desktop(Windows desktop software)

yourTwapperKeeper(runs on a dedicated web server)

140kit(hosted platform for academic research)

DataSift/Gnip(social data resellers)

Strategy #3: Capture Twitter's entire throughput

Strategy #2: Use the 1% or 10% sample provided by the Streaming API

Strategy #1: Sample by hashtag, keyword, user, geographical location, or other filtering parameters

+ highly representative (of Twitter)

- technically very difficult/costly

+ generally assumed to be representative (of Twitter)

- time frame has to be carefully chosen

+ representativeness unclear on multiple levels

- time frame and parameters have to be carefully chosen

Sampling approaches

develop a question/general direction

collect data using these or other tools

store in a database or spreadsheet (CSV)

annotate, analyze and visualize using a variety of tools (Excel, Tableau, R, Gephi, NVIVO, ...)

Summary

top related