an efficient data mining solution by integrating spark and cassandra
DESCRIPTION
Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.TRANSCRIPT
![Page 2: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/2.jpg)
Hadoop?
![Page 3: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/3.jpg)
Cassandra?
![Page 4: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/4.jpg)
Spark?
![Page 5: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/5.jpg)
Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”
G. Orwell
#StratioBD
![Page 6: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/6.jpg)
![Page 7: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/7.jpg)
Goals
• Why do you need Cassandra?• What is the problem?• Why do you need Spark?• How do they work together?
#StratioBD
![Page 8: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/8.jpg)
Cassandra
#StratioBD
• Based on DynamoDB…• Replication, Key/Value, P2P• And based on Big Table…• Column oriented
![Page 9: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/9.jpg)
ROBUST FAST EFFICENT
![Page 10: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/10.jpg)
NO BOTTLENECK REPLICATE
DDECENTRALIZED
![Page 11: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/11.jpg)
Another Databas
e?
![Page 12: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/12.jpg)
Why?
![Page 13: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/13.jpg)
One User – Lot of data
Case A
#StratioBD
![Page 14: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/14.jpg)
Many User – Few data
Case B
#StratioBD
![Page 15: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/15.jpg)
Many user – Lot of data
Case C
#StratioBD
![Page 16: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/16.jpg)
Crawler app
#StratioBD
Cassandra, I choose you
100M
Indexedpages
3kreads
Query time
< 1s
![Page 17: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/17.jpg)
But…
![Page 18: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/18.jpg)
Marketingwalks in
![Page 19: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/19.jpg)
New query
“I need to find all the reference to the domain
ACME. I need the answer by Friday.”
#StratioBD
![Page 20: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/20.jpg)
Problem
Cassandra is not well suited to resolved this
type of queries
You need to design the schema with the query
in mind
#StratioBD
![Page 21: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/21.jpg)
ChallengeAccepted
![Page 22: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/22.jpg)
What options do we have?
• Run Hive Query on top of C*• Write an ETL script and load data into
another DB• Clone the cluster
#StratioBD
![Page 23: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/23.jpg)
What options do we have?
Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster
#StratioBD
![Page 24: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/24.jpg)
And now… what can we do?
“We can't solve problems by using the same kind of thinking
we used when we created them”
#StratioBD
Albert Einstein
![Page 25: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/25.jpg)
• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:
Interactive algorithms. Interactive data mining
Spark
#StratioBD
![Page 26: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/26.jpg)
Logistic regression inSpark vs Hadoop
SOURCE | http://spark.incubator.apache.org/
#StratioBD
![Page 27: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/27.jpg)
WHO USES SPARK?
![Page 28: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/28.jpg)
Spark and Cassandra
Integration points
#StratioBD
![Page 29: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/29.jpg)
Cassandra’s HDFS abstraction layer
Advantantages:• Easily integrates with legacy systems.
Drawbacks:• Very high-level: no access to low level Cassandra’s features.• Questionable performance.
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
![Page 30: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/30.jpg)
Cassandra’s Hadoop Interface• Thrift protocol• CQL3 (our implementation)
Uses the novel Cassandra’s
CqlPagingInputFormat
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
![Page 31: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/31.jpg)
• Supports CQL3 features• Respects data locality • Good compromise between performance / implementation complexity
CQL3 Integration
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
![Page 32: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/32.jpg)
CQL3 Integration (II)
Provides a Java friendly API:
• Developers map Column Families to custom serializable
POJOs
• StratioDeep wraps the complexity of performing Spark
calculations directly over the user provided POJOs.
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
![Page 33: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/33.jpg)
Demo
![Page 34: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/34.jpg)
Drawbacks:
• Still not preforming as well as we’d like
Uses Cassandra’s Hadoop Interface• No analyst-friendly interface:
No SQL-like query features
CQL3 Integration (III)
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD
![Page 35: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/35.jpg)
Bring the integration to another level:
• Dump Cassandra’s Hadoop Interface• Direct access to Cassandra’s SSTable(s) files.• Extend Cassandra’s CQL3 to make use of Spark’s
distributed data processing power
Future extensions
What are we currently working on?
#StratioBD
![Page 36: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/36.jpg)
#StratioBD
Conclusion
![Page 37: An efficient data mining solution by integrating Spark and Cassandra](https://reader033.vdocument.in/reader033/viewer/2022061223/54c2be0a4a795900628b45cc/html5/thumbnails/37.jpg)
THANKS