twitter big data

31
Twitter Big Data Colin Surprenant @colinsurprenant Lead ninja warning: this presentation contains a few rage faces

Upload: colin-surprenant

Post on 15-Jan-2015

4.491 views

Category:

Technology


2 download

DESCRIPTION

How to access & analyse Twitter big data. Full working example using Storm and RedStorm in Ruby & JRuby. Code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/

TRANSCRIPT

Page 1: Twitter Big Data

Twitter Big Data

Colin Surprenant@colinsurprenantLead ninja warning: this presentation contains a few rage faces

Page 2: Twitter Big Data

Twitter - Spring 2012

● 350 million tweets/day● 140 million active users● >1 million applications using API

Page 3: Twitter Big Data

Daily Twitter @ Needium

50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data

Page 4: Twitter Big Data

Needium Geo

Page 5: Twitter Big Data

Anatomy of a tweet

What's in a tweet? Anything else?

avatar

usertimestamp

message

Page 6: Twitter Big Data
Page 7: Twitter Big Data

How to get the tweets?

● Streaming API

Subscribe to realtime feeds moving forward ● REST Search API

Search request on past data (1 week)

Page 8: Twitter Big Data

Streaming APIpublic statuses from all users

● status/filtertrack/location/follow○ 5000 follow user ids

○ 400 track keywords

○ 25 location boxes

○ rate limited

● status/sample○ 1% of all public statuses (message id mod 100)

○ two status/sample streams will result in same data

Page 9: Twitter Big Data

Streaming APIper user streams

● User Streams

○ all data required to update a user's display○ requires user's OAuth token○ statuses from followings, direct messages, mentions○ cannot open large number of user streams from same host

● Site Streams○ multiplexing of multiple User Streams

Page 10: Twitter Big Data

Streaming APIFirehose

need more/full data? only through partners ● gnip.com● datasift.com

● filtering/tracking

● partial to full Firehose

What's the catch?

Page 11: Twitter Big Data

Streaming APIFirehose

Base Twitter data license

$0.10 per 1000 tweets ~$1 million/monthapprox for full Firehose

Page 12: Twitter Big Data

Streaming APIFirehose

startup?

Page 13: Twitter Big Data

Search API

● REST API (http request/response)

○ search query○ geocode (lat, long, radius)○ result type (mixed/recent/popular)○ since id

● max 100 rpp and 1500 results● rate limited (~1 request/sec)

Page 14: Twitter Big Data

Twitter Geo

NO simple way to grabALL tweets for a given region

Page 15: Twitter Big Data

Twitter GeoStreaming API

● status/filter + location (bounding box)○ only tweets with explicit coordinates○ < 10% of all tweets

Page 16: Twitter Big Data

Twitter GeoStreaming API

● Firehose○ < 10% of all tweets contains explicit coordinates○ must do reverse geocoding on user profile location○ user profile location is free form

Page 17: Twitter Big Data

Twitter GeoSearch API

● geocode (lat, long, radius)● tweets with explicit coordinates● tweets reverse geocoded from user profile location

● location field: free form text (Montreal / Montreal,Qc / Mtl / Mourial)

● false positives

● REST API: not for frequent polling● rate limited (1 req/sec/ip)

Page 18: Twitter Big Data

Twitter Geo

Solutions? That's your job! But seriously?

Page 19: Twitter Big Data

Twitter Geo

● search API intelligent polling farm○ adjust polling interval to minimize polling in relation to traffic

● streaming API status/filter/follow reader farm?○ find N relevant users from city, # stream readers = N / 5000

○ must do reverse geocoding

○ user list dynamic update

● TOS gray zone

Page 20: Twitter Big Data

StormDistributed and fault-tolerant realtime computation

https://github.com/nathanmarz/storm

Page 21: Twitter Big Data

Storm

The promise ● Guaranteed data processing● Horizontal scalability● Fault-tolerance● No intermediate message brokers● Higher level abstraction than message passing● Just work

Page 22: Twitter Big Data

RedStorm

JRuby integration & DSL for Storm

Simplicity of Ruby + power of Storm

https://github.com/colinsurprenant/redstorm

+

Page 23: Twitter Big Data

StormTypical use cases

Stream processing Continuous computation

Page 24: Twitter Big Data

StormConcepts

Streams

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple

Page 25: Twitter Big Data

StormConcepts

Spouts

Source of streams

TupleTuple

TupleTuple

Tuple

TupleTuple

TupleTuple

Tuple

Page 26: Twitter Big Data

StormConcepts

Bolts

Processes input streams and produce new streams

Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple

Page 27: Twitter Big Data

StormConcepts

Topology

Network of spouts and bolts

Page 28: Twitter Big Data

Storm

What Storm does ● Distributes code● Robust process management● Monitors topologies and reassigns failed tasks● Provides reliability by tracking tuple trees● Routing and partitioning of stream

Page 29: Twitter Big Data

TweitgeistLive top 10 trending hashtags on Twitter

DEMO

https://github.com/colinsurprenant/tweitgeistLive demo: http://tweitgeist.needium.com

Page 30: Twitter Big Data

Tweitgeist

TwitterStreamSpout

ExtractMessageBolt

ExtractHashtagBolt

RollingCountBolt

shuf

fle

shuf

fle

field

hash

tag

RankBolt

field

hash

tag

MergeBolt

glob

al

Shuffle grouping: Tuples are randomly distributed across the bolt's tasksFields grouping: The stream is partitioned by the fields specified in the groupingGlobal grouping: The entire stream goes to a single one of the bolt's tasks

Redisqueue

Redisqueue

Twitter U

I

streamreader

messageextract

hashtagextract

rollingcounter

ranking merging

Page 31: Twitter Big Data

TweitgeistTopology definition