mining social web apis with ipython notebook - data day texas 2014

42
Mining Social Web APIs with IPython Notebook Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com Data Day Texas - 11 January 2014 1

Upload: matthew-russell

Post on 10-May-2015

3.529 views

Category:

Technology


0 download

DESCRIPTION

Slides from a 2-hour workshop at Data Day Texas 2014 on how to mine social web APIs. This workshop specifically focused on extracting insight from Twitter data and was partitioned into two hour long segments. The first segment focused on familiarity with Twitter's API, while the latter segment focused on using pandas to extract insight from tweets from the firehose via the Streaming API.

TRANSCRIPT

Page 1: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Mining Social Web APIswith IPython Notebook

Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com

Data Day Texas - 11 January 2014

1

Page 2: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Intro

2

Page 3: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Hello, My Name Is ... Matthew

3

Background in Computer Science

Data mining & machine learning

CTO @ Digital Reasoning Systems

Data mining; machine learning

Author @ O'Reilly Media

5 published books on technology

Principal @ Zaffra

Selective boutique consulting

Page 4: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Transforming Curiosity Into Insight

4

An open source software (OSS) project

http://bit.ly/MiningTheSocialWeb2E

A book

http://bit.ly/135dHfs

Accessible to (virtually) everyone

Virtual machine with turn-key coding templates for data science experiments

Think of the book as "premium" support for the OSS project

Page 5: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

The Social Web Is All the Rage

World population: ~7B people

Facebook: 1.15B users

Twitter: 500M users

Google+ 343M users

LinkedIn: 238M users

~200M+ blogs (conservative estimate)

5

Page 6: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Overview

Intro (5 mins)

Module 1 - Virtual Machine & IPython Notebook Overview (10 mins)

Module 2 - Twitter Intro/Overview (45 mins)

Module 3 - Twitter Firehose Analysis with pandas (45 mins)

Module 4 - Overview of other MTSW IPython Notebooks (5 mins)

Wrap Up/Final Q&A (10 mins)

6

Page 7: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Workshop Objective

To send you away as a social web hacker

Hands-on experience hacking on Twitter data

Empowered to walk away ready for on Facebook, LinkedIn, Google+, etc.

Broad working knowledge popular social web APIs

To have fun and learn a few things

7

Page 8: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Just a Few More Things

This workshop is...

An adaptation of Chapters 1+9 from Mining the Social Web, 2nd Edition

More of a guided hacking session where you follow along (vs a lecture)

Designed to be very hands-on, not a lecture

I'm available 24/7 this week (and beyond) to help you be successful

8

Page 9: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Assumptions

At some point in your life, you have

Programmed with Python

Worked with JSON

Made requests and processed responses to/from web servers

Or you want to learn to do these things now...

And you're a quick learner

9

Page 10: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Module 1: Virtual Machine Setup

10

Page 11: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Why do you need a VM?

11

To save time

Because installation and configuration management is harder than it first appears

So that you can focus on the task at hand instead

So that I can support you regardless of your hardware and operating system

Page 12: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

But I can do all of that myself...True...

If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand

At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages

Including scientific computing tools that require underlying C/C++ code to be compiled

Which requires specific versions of developer libraries to be installed

You get the idea...

12

Page 13: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

The Virtual Machine ExperienceVagrant

A nice abstraction around virtual machine providers

One ring to rule them all

Virtualbox, VMWare, AWS, ...

IPython Notebook

The easiest way to program with Python

A better REPL (interpreter)

Great for hacking

13

Page 14: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

What happens when you vagrant up?

Vagrant follows the instructions in your Vagrantfile

Starts up a Virtualbox instance

Uses Chef to provision it

Installs OS patches/updates

Installs MTSW software dependencies

Starts IPython Notebook server on port 8888

14

Page 15: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Why Should I Use IPython Notebook?

Because it's great for hacking

And hacking is usually the first step

Because it's great for collaboration

Sharing/publishing results is trivial

Because the UX is as easy as working in a notepad

Think of it as "executable paper"

15

Page 16: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

16

Page 17: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

17

Page 18: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

VM Quick Start Instructions

Go to http://MiningTheSocialWeb.com/quick-start/

Follow the instructions

And watch the screencasts!

Basically:

Install Virtualbox & Vagrant

Run "vagrant up" in a terminal to start a guest VM

Then, go to http://localhost:8888 on your host machine's web browser

18

Page 19: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

What Could Be Easier?

A hosted version of the VM!

But only for a few hours during this workshop

Because it costs money to run these servers

Go to http://bit.ly/mtsw-ddtx14 and pick a machine

Please do not share the URLs outside of this workshop!

With a cherry on top...

19

Page 20: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

A Hosted Virtual Machine

Is it free?

Perhaps...

...Sign-up for the AWS free tier at http://aws.amazon.com/free/

But not right now. Do it later

See this blog post for some inspiration on how to easily build your own AMI from Vagrant boxes

http://wp.me/p3QiJd-3T

20

Page 21: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

One More Thing

There's a new alpha product from O'Reilly Media that hosts IPython Notebooks and other software to enhance reading experiences

I can share out "invites" with any interested volunteers

21

Page 22: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Module 2: Twitter Intro/Overview

22

Page 23: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Objectives

23

Be able to identify Twitter primitives

Understand tweet metadata and how to use it

Learn how to extract entities such as user mentions, hashtags, and URLs

Apply techniques for performing frequency analysis with Python

Be able to plot histograms of Twitter data with IPython Notebook

Learn about a Twitter cookbook that you can easily adapt

Page 24: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Twitter Primitives

24

Accounts Types: "Anything"

"Following" Relationships

Favorites

Retweets

Replies

(Almost) No Privacy Controls

Page 25: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

API RequestsRESTful requests

Everything is a "resource"

You GET, PUT, POST, and DELETE resources

Standard HTTP "verbs"

Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=SocialWebMining

Streaming API filters

JSON responses

Cursors (not quite pagination)

25

Page 26: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Twitter is an Interest Graph

26

Roberto Mercedes

Jorge

Ana

Nina

Johnny Araya

Rodolfo Hernández

Page 27: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

What's in a Tweet?

27

140 Characters ...

... Plus ~5KB of metadata!

Authorship

Time & location

Tweet "entities"

Replying, retweeting, favoriting, etc.

Page 28: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

What are Tweet Entities?

Essentially, the "easy to get at" data in the 140 characters

@usermentions

#hashtags

URLs

multiple variations

(financial) symbols

stock tickers

media

28

Page 29: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Data Mining Is Often Just...

Counting

Comparing

Filtering

Ranking

29

Page 30: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Histograms

A chart that is handy for frequency analysis

They look like bar charts...except they're not bar charts

Each value on the x-axis is a range (or "bin") of values

Not categorical data

Each value on the y-axis is the combined frequency of values in each range

30

Page 31: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

31

Example: Histogram of Retweets

Page 32: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Social Media Analysis FrameworkA memorable four step process to guide data science experiments:

Aspire

To test a hypothesis (answer a question)

Acquire

Get the data

Analyze

Count things

Summarize

Plot the results

32

Page 33: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Exercises

Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook

Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook

Fill in Example 1-1 with credentials and begin work

See https://vimeo.com/79220146 for a helpful video

Execute each example sequentially

Customize queries, explore tweet metadata, count tweet entities, etc.

Explore the "Chapter 9 (Twitter Cookbook)" notebook

In particular, check out Example 9-8 (Twitter's Streaming API)

33

Page 34: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Module 3: Twitter Firehose Analysis with pandas

34

Page 35: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Objectives

35

To understand how to capture data from Twitter's firehose

A understand basic pandas usage for tweets

To work through a data science experiment with a systematic 4-step process

Page 36: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Social Media Analysis Framework

Remember:

Aspire

Acquire

Analyze

Summarize

36

Page 38: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Module 4: Overview of other MTSW IPython Notebooks

38

Page 39: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Mining the Social Web ToCChapter 1 - Mining Twitter

Chapter 2 - Mining Facebook

Chapter 3 - Mining LinkedIn

Chapter 4 - Mining Google+

Chapter 5 - Mining Web Pages

Chapter 6 - Mining Mailboxes

Chapter 7 - Mining GitHub

Chapter 8 - Mining the Semantically Marked-Up Web

Chapter 9 - Twitter Cookbook

39

Page 40: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

A Recommendation

Bookmark http://nbviewer.ipython.org

Take note of Mining the Social Web under "Books"

Notice lots of other terrific notebooks, too

40

Page 41: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Wrap Up / Final Q&A

41

Page 42: Mining Social Web APIs with IPython Notebook - Data Day Texas 2014

Helpful Links & Free Stuffhttp://MiningTheSocialWeb.com

Mining the Social Web 2E Chapter 1 (Chimera)

http://bit.ly/13XgNWR

Source Code (GitHub)

http://bit.ly/MiningTheSocialWeb2E

http://bit.ly/1fVf5ej (numbered examples)

Screencasts (Vimeo)

http://bit.ly/mtsw2e-screencasts

42