mining social web apis with ipython notebook - data day texas 2014
DESCRIPTION
Slides from a 2-hour workshop at Data Day Texas 2014 on how to mine social web APIs. This workshop specifically focused on extracting insight from Twitter data and was partitioned into two hour long segments. The first segment focused on familiarity with Twitter's API, while the latter segment focused on using pandas to extract insight from tweets from the firehose via the Streaming API.TRANSCRIPT
Mining Social Web APIswith IPython Notebook
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
Data Day Texas - 11 January 2014
1
Intro
2
Hello, My Name Is ... Matthew
3
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
Transforming Curiosity Into Insight
4
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding templates for data science experiments
Think of the book as "premium" support for the OSS project
The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
5
Overview
Intro (5 mins)
Module 1 - Virtual Machine & IPython Notebook Overview (10 mins)
Module 2 - Twitter Intro/Overview (45 mins)
Module 3 - Twitter Firehose Analysis with pandas (45 mins)
Module 4 - Overview of other MTSW IPython Notebooks (5 mins)
Wrap Up/Final Q&A (10 mins)
6
Workshop Objective
To send you away as a social web hacker
Hands-on experience hacking on Twitter data
Empowered to walk away ready for on Facebook, LinkedIn, Google+, etc.
Broad working knowledge popular social web APIs
To have fun and learn a few things
7
Just a Few More Things
This workshop is...
An adaptation of Chapters 1+9 from Mining the Social Web, 2nd Edition
More of a guided hacking session where you follow along (vs a lecture)
Designed to be very hands-on, not a lecture
I'm available 24/7 this week (and beyond) to help you be successful
8
Assumptions
At some point in your life, you have
Programmed with Python
Worked with JSON
Made requests and processed responses to/from web servers
Or you want to learn to do these things now...
And you're a quick learner
9
Module 1: Virtual Machine Setup
10
Why do you need a VM?
11
To save time
Because installation and configuration management is harder than it first appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating system
But I can do all of that myself...True...
If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand
At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to be compiled
Which requires specific versions of developer libraries to be installed
You get the idea...
12
The Virtual Machine ExperienceVagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...
IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
13
What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
14
Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step
Because it's great for collaboration
Sharing/publishing results is trivial
Because the UX is as easy as working in a notepad
Think of it as "executable paper"
15
16
17
VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!
Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
18
What Could Be Easier?
A hosted version of the VM!
But only for a few hours during this workshop
Because it costs money to run these servers
Go to http://bit.ly/mtsw-ddtx14 and pick a machine
Please do not share the URLs outside of this workshop!
With a cherry on top...
19
A Hosted Virtual Machine
Is it free?
Perhaps...
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later
See this blog post for some inspiration on how to easily build your own AMI from Vagrant boxes
http://wp.me/p3QiJd-3T
20
One More Thing
There's a new alpha product from O'Reilly Media that hosts IPython Notebooks and other software to enhance reading experiences
I can share out "invites" with any interested volunteers
21
Module 2: Twitter Intro/Overview
22
Objectives
23
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
Learn about a Twitter cookbook that you can easily adapt
Twitter Primitives
24
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
API RequestsRESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=SocialWebMining
Streaming API filters
JSON responses
Cursors (not quite pagination)
25
Twitter is an Interest Graph
26
Roberto Mercedes
Jorge
Ana
Nina
Johnny Araya
Rodolfo Hernández
What's in a Tweet?
27
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations
(financial) symbols
stock tickers
media
28
Data Mining Is Often Just...
Counting
Comparing
Filtering
Ranking
29
Histograms
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
30
31
Example: Histogram of Retweets
Social Media Analysis FrameworkA memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)
Acquire
Get the data
Analyze
Count things
Summarize
Plot the results
32
Exercises
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
See https://vimeo.com/79220146 for a helpful video
Execute each example sequentially
Customize queries, explore tweet metadata, count tweet entities, etc.
Explore the "Chapter 9 (Twitter Cookbook)" notebook
In particular, check out Example 9-8 (Twitter's Streaming API)
33
Module 3: Twitter Firehose Analysis with pandas
34
Objectives
35
To understand how to capture data from Twitter's firehose
A understand basic pandas usage for tweets
To work through a data science experiment with a systematic 4-step process
Social Media Analysis Framework
Remember:
Aspire
Acquire
Analyze
Summarize
36
Understanding the Reaction Amazon Prime Air
Open up the notebook entitled __Understanding the Reaction to Amazon Prime Air.ipynb and follow along
Or, visit http://bit.ly/mtsw-amazon-prime-air and follow along if you're just joining us
37
Module 4: Overview of other MTSW IPython Notebooks
38
Mining the Social Web ToCChapter 1 - Mining Twitter
Chapter 2 - Mining Facebook
Chapter 3 - Mining LinkedIn
Chapter 4 - Mining Google+
Chapter 5 - Mining Web Pages
Chapter 6 - Mining Mailboxes
Chapter 7 - Mining GitHub
Chapter 8 - Mining the Semantically Marked-Up Web
Chapter 9 - Twitter Cookbook
39
A Recommendation
Bookmark http://nbviewer.ipython.org
Take note of Mining the Social Web under "Books"
Notice lots of other terrific notebooks, too
40
Wrap Up / Final Q&A
41
Helpful Links & Free Stuffhttp://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts
42