big data and nosql in real time
DESCRIPTION
Explain the challenge of having real time analytics in big data and nosql applications. Showing Facebook and Twitter examples.TRANSCRIPT
![Page 1: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/1.jpg)
Big Data and NoSQL in REAL TIMEFacebook and Twitter Examples
Ron Zavner
![Page 2: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/2.jpg)
2
Agenda
Our real time world… Flavors of Big Data Facebook messaging and real time analytics system Twitter analytics system Winning architecture?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
![Page 3: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/3.jpg)
What is Real Time?
3® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
![Page 4: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/4.jpg)
We’re Living in a Real Time World…Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4
![Page 5: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/5.jpg)
Big Data Predictions
“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
5® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
![Page 6: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/6.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
The Two Vs of Big Data
Velocity Volume
![Page 7: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/7.jpg)
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7
![Page 8: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/8.jpg)
Analytics – Counting
How many signups, tweets, retweets for a topic?
What’s the average latency?
Demographics Countries and cities Gender Age groups Device types …
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8
![Page 9: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/9.jpg)
Analytics – Correlating
What devices fail at the same time?
What features get user hooked?
What places on the globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9
![Page 10: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/10.jpg)
Analytics – Research
Sentiment analysis “Obama is popular”
Trends “People like to tweet
after watching American Idol”
Spam patterns How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10
![Page 11: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/11.jpg)
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
This is what we’re here to discuss
![Page 12: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/12.jpg)
FACEBOOK REAL-TIMEANALYTICS SYSTEM
12
![Page 13: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/13.jpg)
13
Store 135+ Billion Messages A Month
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
![Page 14: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/14.jpg)
14
The actual analytics.. Like button analytics
Comments box analytics
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
![Page 15: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/15.jpg)
15
Goals
Show why plugins are valuable Make the data more actionable Make the data more timely Remove point of failures Handle massive load - 200K events per second
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
![Page 16: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/16.jpg)
16
Technology Evaluation
MySQL DB Counters In-Memory Counters MapReduce Cassandra HBase
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
![Page 17: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/17.jpg)
PTail
Scribe
Puma
HbaseFACEBOOK
Log
Log
Log
HDFS
Real Time Long Term
Batch1.5 Sec
The solution..10,000 write/sec per server
![Page 18: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/18.jpg)
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
![Page 19: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/19.jpg)
TWITTER REAL-TIMEANALYTICS SYSTEM
19
![Page 20: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/20.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20
Twitter Reach – Here’s One Use Case
![Page 21: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/21.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21
Let’s start with some statistics ….
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
![Page 22: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/22.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
It takes a week for users to
send 1 billion Tweets.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
![Page 23: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/23.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
On average,
140 million tweets get sent every day.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
![Page 24: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/24.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
The highest throughput to date is
6,939 tweets/sec.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
![Page 25: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/25.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
460,000 new accounts
are created daily.
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
![Page 26: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/26.jpg)
26
5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/
![Page 27: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/27.jpg)
Challenge – Word Count
Word:Count
Tweets
Count?® Copyright 2011 Gigaspaces Ltd. All Rights Reserved27
• Hottest topics• URL mentions• etc.
![Page 28: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/28.jpg)
(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time
Aggregate counters for each word A few 10s of thousands of words (or hundreds of
thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant
Word Count - Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28
![Page 29: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/29.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved29
Use EDA (Event Driven Architecture)
TokenizerRaw FiltererTokenized CounterFiltered
![Page 30: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/30.jpg)
Sharding (Partitioning)
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer 3 Filterer 3
Tokenizer n Filterer n
Counter Updater 1
Counter Updater 2
Counter Updater 3
Counter Updater n
![Page 31: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/31.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved31
Computing Reach with Event Streams
![Page 32: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/32.jpg)
Twitter Storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32
![Page 33: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/33.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33
Twitter Storm
![Page 34: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/34.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34
Storm Overview
![Page 35: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/35.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35
Storm Cluster
![Page 36: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/36.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36
Streaming word count with Storm
![Page 37: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/37.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37
Storage Data Persistency Querying
Storm LimitationSpouts
Bolt
Topologies
![Page 38: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/38.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38
Event driven / flow Reliable Storage Data Persistency Querying
Winner is… storm & in memory data grids
![Page 39: Big data and noSQL in real time](https://reader034.vdocument.in/reader034/viewer/2022051513/546d2237b4af9f8e2c8b540a/html5/thumbnails/39.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39
Facebook messages http://
highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
Facebook Real time analytics http://
highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics
Detailed blog posthttp://bit.ly/gs-bigdata-analytics
Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html
Twitter Storm: http://bit.ly/twitter-storm
References