cassandra summit-2013
TRANSCRIPT
![Page 1: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/1.jpg)
Real Time Big Data With Storm, Cassandra, and In-Memory Computing
DeWayne Filppi@dfilppi
![Page 2: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/2.jpg)
Big Data Predictions
“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
![Page 3: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/3.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
![Page 4: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/4.jpg)
We’re Living in a Real Time World…Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
![Page 5: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/5.jpg)
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
![Page 6: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/6.jpg)
Analytics @ Twitter – Counting
How many signups, tweets, retweets for a topic?
What’s the average latency?
Demographics Countries and cities Gender Age groups Device types …
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
![Page 7: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/7.jpg)
Analytics @ Twitter – Correlating
What devices fail at the same time?
What features get user hooked?
What places on the globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
![Page 8: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/8.jpg)
Analytics @ Twitter – Research
Sentiment analysis “Obama is popular”
Trends “People like to tweet
after watching American Idol”
Spam patterns How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
![Page 9: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/9.jpg)
It’s All about Timing
“Real time” (< few Seconds)
Reasonably Quick (seconds - minutes)
Batch (hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
![Page 10: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/10.jpg)
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10
This is what we’re here to discuss
![Page 11: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/11.jpg)
VELOCITY + VAST VOLUME = IN MEMORY + BIG DATA
11
![Page 12: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/12.jpg)
RAM is the new disk Data partitioned across a cluster
Large “virtual” memory space Transactional Highly available Code collocated with data.
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
![Page 13: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/13.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13
Data Grid + Cassandra: A Complete Solution• Data flows through the in-memory cluster async to Cassandra• Side effects calculated• Filtering an option• Enrichment an option• Results instantly available• Internal and external event listeners notified
![Page 14: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/14.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14
Simplified Event Flow
![Page 15: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/15.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15
Grid – Cassandra Interface Hector and CQL based interface In memory data must be mapped to column families.
Configurable class to column family mapping Must serialize individual fields
Fixed fields can use defined types Variable fields ( for schemaless in-memory mode) need serializers
Object model flattening By default, nested fields are flattened. Can be overridden by custom serializer.
![Page 16: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/16.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16
Virtues and Limitations
Could be faster: high availability has a cost Complex flows not easy to assemble or understand with simple
event handlers
Complete stack, not just two tools of many Fast.
Microsecond latencies for in memory operations Fast enough for almost anybody
Highly available/self healing Elastic
BUT
![Page 17: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/17.jpg)
Popular open source, real time, in-memory, streaming computation platform.
Includes distributed runtime and intuitive API for defining distributed processing flows.
Scalable and fault tolerant. Developed at BackType, and open sourced by Twitter
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17
Storm Background
![Page 18: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/18.jpg)
Streams Unbounded sequence of tuples
Spouts Source of streams (Queues)
Bolts Functions, Filters, Joins, Aggregations
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18
Storm AbstractionsSpout
Bolt
Topologies
![Page 19: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/19.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19
Streaming word count with Storm
Storm has a simple builder interface to creating stream processing topologies
Storm delegates persistence to external providers Cassandra, because of its write performance, is commonly used
![Page 20: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/20.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20
Storm : Optimistic Processing Storm (quite rationally) assumes success is normal Storm uses batching and pipelining for performance Therefore the spout must be able to replay tuples on demand
in case of error. Any kind of quasi-queue like data source can be fashioned
into a spout. No persistence is ever required, and speed attained by
minimizing network hops during topology processing.
![Page 21: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/21.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21
Fast. Want to go faster? Eliminate non-memory components Substitute disk based queue for reliable in-memory queue Substitute disk based state persistence to in-memory
persistence Asynchronously update disk based state (C*)
![Page 22: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/22.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22
Sample Architecture
![Page 23: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/23.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23
References Try the Cloudify recipe
Download Cloudify : http://www.cloudifysource.org/ Download the Recipe (apps/xapstream, services/xapstream):
– https://github.com/CloudifySource/cloudify-recipes XAP – Cassandra Interface Details;
http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency Check out the source for the XAP Spout and a sample state
implementation backed by XAP, and a Storm friendly streaming implemention on github: https://github.com/Gigaspaces/storm-integration
For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/ Part 3 coming soon.
![Page 24: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/24.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
![Page 25: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/25.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25
Twitter Storm With Cassandra
![Page 26: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/26.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26
Storm Overview
![Page 27: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/27.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27
Streams Unbounded sequence of tuples
Spouts Source of streams (Queues)
Bolts Functions, Filters, Joins, Aggregations
Topologies
Storm ConceptsSpouts
Bolt
Topologies
![Page 28: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/28.jpg)
Challenge – Word Count
Word:Count
Tweets
Count?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28
• Hottest topics• URL mentions• etc.
![Page 29: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/29.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29
Streaming word count with Storm
![Page 30: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/30.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30
Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) using
batching. Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets, events,whatever….
![Page 31: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/31.jpg)
XAP Real Time Analytics
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
![Page 32: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/32.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach Advantage: Minimal
“impedance mismatch” between layers.– Both NoSQL cluster
technologies, with similar advantages
Grid layer serves as an in memory cache for interactive requests.
Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability.
In Memory Compute Cluster
NoSQL Cluster
...
Raw
Eve
nt S
trea
m
Raw
Eve
nt S
trea
m
Raw
Eve
nt S
trea
m
Raw And Derived Events
Rep
orti
ng E
ngin
e
SCALE
SCALE
![Page 33: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/33.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33
Simplified Architecture
![Page 34: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/34.jpg)
Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalable
layer Data grid provides a transactional/consistent façade on NoSQL
store (in this case eliminating SQL database entirely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34
Key Concepts
![Page 35: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/35.jpg)
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
![Page 36: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/36.jpg)
Take Aways A data grid can serve different needs for big data analytics:
Supercharge a dedicated stream processing cluster like Storm.– Provide fast, reliable, transactional tuple streams and state
Provide a general purpose analytics platform– Roll your own
Simplify overall architecture while enhancing scalability– Ultra high performance/low latency– Dynamically scalable processing and in-memory storage– Eliminate messaging tier– Eliminate or minimize need for RDBMS
![Page 37: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/37.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37
Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-analy
tics-with-storm Learn and fork the code on github:
https://github.com/Gigaspaces/storm-integration
Twitter Storm: http://storm-project.net
XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/
References
![Page 38: Cassandra summit-2013](https://reader036.vdocument.in/reader036/viewer/2022062319/554e8c0fb4c90526358b4af4/html5/thumbnails/38.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38