data analytics with apache spark and cassandra
TRANSCRIPT
#bigdatabe @maasg
Data Analytics with Apache
BigData.be Meetup 8/Sep/2015
Gerard Maas @maasgData Processing Team Lead
and
@maasg#bigdatabe
Agenda
MotivationSparkling RefreshmentQuick Cassandra OverviewConnecting the Dots . . .ExamplesResources
@maasg#bigdatabe
What is Apache Spark?
Spark is a fast and general engine for large-scale distributed data processing.
Fast Functional
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Growing Ecosystem
@maasg#bigdatabe
The Big Idea...Express computations in terms of transformations and actions on a distributed data set.
Spark Core Concept: RDD => Resilient Distributed Dataset
Think of an RDD as an immutable, distributed collection of objects
• Resilient => Can be reconstructed in case of failure• Distributed => Transformations are parallelizable operations• Dataset => Data loaded and partitioned across cluster nodes (executors)
RDDs are memory-intensive. Caching behavior is controllable.
@maasg#bigdatabe
RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
111111
111111
111111
@maasg#bigdatabe
RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
111111
111111
111111
.reduceByKey(_ + _)
2411
2221
3121
@maasg#bigdatabe
RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
111111
111111
111111
.reduceByKey(_ + _)
2411
2221
3121
75
7
3
@maasg#bigdatabe
RDDs.flatMap(l => l.split(" ")).textFile("...") .map(w => (w,1))
111111
111111
111111
.reduceByKey(_ + _)
2411
2221
3121
75
7753
7
3
@maasg#bigdatabe
RDD LineageEach RDDs keeps track of its parent.This is the basis for DAG scheduling and fault recoveryval file = spark.textFile("hdfs://...")val wordsRDD = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)val scoreRdd = words.map{case (k,v) => (v,k)}
HadoopRDD
MappedRDD
FlatMappedRDD
MappedRDD
MapPartitionsRDD
ShuffleRDD
wordsRDD MapPartitionsRDD
MappedRDDscoreRDDrdd.toDebugString is your friend
@maasg#bigdatabe
What is Apache Cassandra?Cassandra is a distributed, high performance, scalable and fault tolerant column-oriented “noSQL” database.
Bigtable
Data Model- wide rows, sparse arrays- high write throughput
DynamoDB
Infrastructure- P2P gossip- “kv” store- Tunable consistency
@maasg#bigdatabe
Cassandra ArchitectureNodes use gossip to communicate ring state
Data is distributed over
the cluster
Each node is responsible for a segment of tokens
Data is replicated to n (configurable) nodes
@maasg#bigdatabe
CREATE TABLE meetup.tweets( handle TEXT, ts TIMESTAMP, txt TEXT, PRIMARY KEY (handle, ts));
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);
maasg 1441709070 working on my presentation
@maasg#bigdatabe
CREATE TABLE meetup.tweets( handle TEXT, ts TIMESTAMP, txt TEXT, PRIMARY KEY (handle, ts));
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);
maasg 1441709070 working on my presentation
peter_v 1441721070 meetup tonight!!!
@maasg#bigdatabe
CREATE TABLE meetup.tweets( handle TEXT, ts TIMESTAMP, txt TEXT, PRIMARY KEY (handle, ts));
INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441709070, “working on my presentation”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“peter_v”, 1441719070, “meetup tonight!!!”);INSERT INTO meetup.tweets (handle, ts, txt) VALUES (“maasg”, 1441719110, “almost ready”);
maasg 1441709070 working on my presentation
1441719110 almost ready
peter_v 1441721070 meetup tonight!!!
@maasg#bigdatabe
maasg 1441709070 working on my presentation
1441719110 almost ready
peter_v 1441721070 meetup tonight!!!
...
Partition Key
Clustering Key
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800
maasg 1441709070 working on my presentation
Murmur3Hash(“maasg”) = 451
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800maasg 1441709070 working on my
presentation
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800maasg 1441709070 working on my
presentation
peter_v 1441721070 meetup tonight!!!
Murmur3Hash(“peter_v”) = 42
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800maasg 144170907
0working on my presentation
peter_v 1441721070 meetup tonight!!!
@maasg#bigdatabe
Cassandra Architecture
1000
000
200
400
600
800maasg 1441709070 working on my
presentation
peter_v 1441721070 meetup tonight!!!
1441719110 almost ready
@maasg#bigdatabe
“This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.”
@maasg#bigdatabe
RDD
PartitionsPartitionsPartitions1000
000
200
400
600
800
cassandraTable, joinWithCassandraTable
repartitionByCassandraReplica
@maasg#bigdatabe
Spark Notebook Software: https://github.com/andypetrella/spark-notebook
Meetup Notebooks: https://github.com/maasg/spark-notebooks
@maasg#bigdatabe
Resources
Project website: http://spark.apache.org/Spark presentations: http://spark-summit.org/2015Starting Questions: http://stackoverflow.com/questions/tagged/apache-sparkMore Advanced Questions: [email protected] Code: https://github.com/apache/sparkGetting involved: http://spark.apache.org/community.html
@maasg#bigdatabe
Resources
Project website: http://cassandra.apache.org/ Community Site: www.planetcassandra.org Questions: http://stackoverflow.com/questions/tagged/cassandra Training: https://academy.datastax.com/ Spark Cassandra Connector: https://github.com/datastax/spark-cassandra-connector Excellent deep-dive in data locality implementation:http://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1
@maasg#bigdatabe
Resources
Spark-Notebook: https://github.com/andypetrella/spark-notebook
Meetup code: https://github.com/maasg/spark-notebooks
Slides (soon): http://www.virdata.com/category/tech/