twitter stream processing

of 26/26
Twitter stream processing Colin Surprenant @colinsurprenant Lead ninja Big Data Montreal April 2012

Post on 10-May-2015




3 download

Embed Size (px)


Process the Twitter stream using Storm & Redstorm with Ruby & JRuby. Full working demo, code on github and live demo


  • 1.Twitterstream processingColin SurprenantBig Data Montreal @colinsurprenantApril 2012Lead ninja

2. Twitter - Spring 2012 350 million tweets/day 140 million active users >1 millionapplications using API 3. Daily Twitter @ Needium50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data 4. Anatomy of a tweetWhats in a tweet?timestamp usermessageavatarAnything else? 5. How to get the tweets? Streaming APISubscribe to realtime feeds moving forward REST Search APISearch request on past data (1 week) 6. Streaming APIpublic statuses from all users status/filtertrack/location/follow 5000 follow user ids 400 track keywords 25 location boxes rate limited status/sample 1% of all public statuses (message id mod 100) two status/sample streams will result in same data 7. Streaming APIper user streams User Streams all data required to update a users display requires users OAuth token statuses from followings, direct messages, mentions cannot open large number of user streams from same host Site Streams multiplexing of multiple User Streams 8. Streaming APIFirehoseneed more/full data? only through partners filtering/tracking $$$partial to full FirehoseWhats the catch? lots of it 9. Search API REST API (http request/response) search query geocode (lat, long, radius) result type (mixed/recent/popular) since id max 100 rpp and 1500 results rate limited (~1 request/sec) 10. StormDistributed and fault-tolerant realtime computation 11. Storm The promise Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers Higher level abstraction than message passing Just work 12. RedStormStorm: Java + ClosureRedStorm: JRuby integration & DSL for Storm +Ruby + Storm on JVM 13. StormTypical use casesStreamContinuousDistributedprocessingcomputationRPC 14. StormConcepts StreamsTupleTupleTupleTuple Tuple Unbounded sequence of tuples 15. StormConcepts SpoutsTupleTupleTuple Tuple Tuple Tuple TupleTupleTupleTuple Source of streams 16. StormConceptsBoltsTupleTupleTupleTupleTupleTuple Tuple Tuple Tuple TupleTupleTupleTupleTupleTuple Processes input streams and produce new streams 17. StormConceptsTopology Network of spouts and bolts 18. StormConcepts bolt A Shuffle bolt B Allbolt A bolt B random+fair distribution replication to all tasks across tasks Grouping bolt AFields bolt Bbolt A Globalbolt Bfield Xfield Y partition on specified field entire stream to single task 19. Storm What Storm does Distributes code Robust process management Monitors topologies and reassigns failed tasks Provides reliability by tracking tuple trees Routing and partitioning of stream 20. TweitgeistLive top 10 trending hashtags on TwitterDEMO demo: 21. TweitgeistArchitecturePartitionedPartitionedcounting ranking Parallel extraction MergingPartitionedPartitionedcounting rankingQueue QueueN partitions N partitions Twitter stream Frontend 22. TweitgeistArchitecture Parallel computation across partitions of the stream Scalable architecture Work for full Twitter firehose with very little mods 23. TweitgeistStorm Topology TwitterStream ExtractMessage ExtractHashtag RollingCount RankMerge Spout Bolt Bolt Bolt BoltBoltTwitterfield hashtag field hashtag UI global shuffleshuffle RedisRedis queuequeue streammessage hashtagrolling ranking merging readerextract extractcounterShuffle grouping:Tuples are randomly distributed across the bolts tasksFields grouping: The stream is partitioned by the fields specified in the groupingGlobal grouping: The entire stream goes to a single one of the bolts tasks 24. TweitgeistTopology definition 25. TweitgeistWhereCode Live demo