Download - Collecting Twitter Data
Collecting Twitter dataDr. Cornelius Puschmann
School of Library and Information Science Humboldt-University of Berlin /
Humboldt Institute for Internet and Society16 April 2013
Royal Statistical Society
1. Examples of research using Twitter data
2. Twitter's data infrastructure
3. Tools for collecting data
4. Sampling issues
Overview
Examples of research using Twitter data
• Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media ? Categories and Subject Descriptors. Proceedings of the 19th International Conference on the World Wide Web (WWW ’10) (pp. 591–600). Raleigh, NC.
• González-Bailón, S., Borge-Holthoefer, J., Rivero, A., & Moreno, Y. (2011). The dynamics of protest recruitment through an online network. Scientific reports, 1, 197. doi:10.1038/srep00197
• Ausserhofer, J., & Maireder, A. (2013). National politics on Twitter: Structures and topics of a networked public sphere. Information, Communication & Society, 16(3), 291–314. doi:10.1080/1369118X.2012.756050
• Papacharissi, Z., & De Fatima Oliveira, M. (2012). Affective News and Networked Publics: The Rhythms of News Storytelling on #Egypt. Journal of Communication, 62(2), 266–282. doi:10.1111/j.1460-2466.2012.01630.x
Hashtags, keywords, and geography• How can the discussion of topic X be characterized? • Who is participating in discussions on X?• Where are users discussing X?
Twitter as a platform• How can Twitter's structure be described?
Social graph• Who follows whom?• How does information spread?
Example questions
Prediction/application• Can election results/flu outbreaks/consumption
patterns be reliably predicted?
URLs in Twitter• How is mass media content discussed?• How are academic papers cited on Twitter?
Example questions
Creative approaches• Where, when, and with what devices do people
call taxis?
#phdchat data set (30k tweets)
visualization of keywords using Gephi
Application Programming Interface (API)
HTTP request
return all data from a given user/hashtag/geolocation/...
Data (usually in a database or spreadsheet)
Extracting Twitter data
Tweet in browser
Tweet source via API
Streaming API• public, user, and
site streams• provides data in
real time and largely unprocessed as it flows through the platform
REST API• traditionally used
by most client software• v1.0 will be phased
out in May 2013• to be replaced by
more restrictive v1.1
Search API• same functionality
as Twitter search• rate-limited
Three Twitter APIs
1) data: tweets, social graph2) complex tools needed 3) constraints on how much data can be captured
"By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed)."
"You agree that this license includes the right for Twitter to make such Content available to other companies, organizations or individuals who partner with Twitter for the syndication, broadcast, distribution or publication of such Content on other media and services, subject to our terms and conditions for such Content use."
"We encourage and permit broad re-use of Content. The Twitter API exists to enable this."
Legal issues: Twitter's terms of service
"You will not attempt or encourage others to: sell, rent, lease, sublicense, redistribute, or syndicate access to the Twitter API or Twitter Content to any third party without prior written approval from Twitter. If you provide an API that returns Twitter data, you may only return IDs (including tweet IDs and user IDs). You may export or extract non-programmatic, GUI-driven Twitter Content as a PDF or spreadsheet by using "save as" or similar functionality. Exporting Twitter Content to a datastore as a service or other cloud based service, however, is not permitted."
"Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services."
Legal issues: API rules
Tweet Archivist Desktop(Windows desktop software)
yourTwapperKeeper(runs on a dedicated web server)
140kit(hosted platform for academic research)
DataSift/Gnip(social data resellers)
Strategy #3: Capture Twitter's entire throughput
Strategy #2: Use the 1% or 10% sample provided by the Streaming API
Strategy #1: Sample by hashtag, keyword, user, geographical location, or other filtering parameters
+ highly representative (of Twitter)
- technically very difficult/costly
+ generally assumed to be representative (of Twitter)
- time frame has to be carefully chosen
+ representativeness unclear on multiple levels
- time frame and parameters have to be carefully chosen
Sampling approaches
develop a question/general direction
collect data using these or other tools
store in a database or spreadsheet (CSV)
annotate, analyze and visualize using a variety of tools (Excel, Tableau, R, Gephi, NVIVO, ...)
Summary
Questions?
http://www.teachthought.com/wp-content/uploads/2012/11/twitter-logo-hashtag.jpg