spark cassandra connector dataframes
TRANSCRIPT
Cassandra And Spark Dataframes
Russell Spitzer Software Engineer @ Datastax
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Cassandra And Spark Dataframes
Tungsten Gives Dataframes OffHeap Power!
Can compare memory off-heap and bitwise! Code generation!
The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra
/** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */
DataFrame
source org.apache.spark.sql.cassandra
The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra
/** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */
DataFrameCassandraSourceRelation
CassandraTableScanRDDConfiguration
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()
Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()
Namespace: default Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()
Namespace: default Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "default" ) ).load()
Namespace: default Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "other" , "cluster" -> "default" ) ).load()
Namespace: default Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32
Connector Default
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter clusteringKey > 100
Show
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter clusteringKey > 100
Show
Catalyst
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter clusteringKey > 100
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC* AND
add where clause to CQL
"clusteringKey > 100"
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
What can be pushed down?
1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ
expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only
the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed.
5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates.
6. If there is only one cluster column predicate, the predicates could be any non-IN predicate. There is no pushdown predicates if there is any OR condition or NOT IN condition.
7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate.
What can be pushed down?
If you could write in CQL it will get pushed down.
What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing
applies
What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing
applies
https://academy.datastax.com/ Watch me talk about this in the privacy of your own home!
How the Spark Cassandra Connector
Reads Data
Spark RDDs Represent a Large
Amount of Data Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs Represent a Large
Amount of Data Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Spark RDDs Represent a Large
Amount of Data Partitioned into Chunks
Cassandra Data is Distributed By Token Range
Cassandra Data is Distributed By Token Range
0
500
Cassandra Data is Distributed By Token Range
0
500
999
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes
Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes
Node 1
120-220300-500780-830
0-50
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
The Connector Uses Information on the Node to Make Spark Partitions
Node 1
120-220300-500
0-50
The Connector Uses Information on the Node to Make Spark Partitions
1
780-830
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
1
Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
2
1
Node 1
300-400
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830400-500
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830400-500
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830400-500
3
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830
3
400-500
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
21
Node 1
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
4
21
Node 1
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
4
21
Node 1
0-50
The Connector Uses Information on the Node to Make Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
421
Node 1
The Connector Uses Information on the Node to Make Spark Partitions
3
spark.cassandra.input.split_size_in_mb 1
Reported density is 100 tokens per mb
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50780-830
Node 1
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows
How The Spark Cassandra Connector
Writes Data
Spark RDDs Represent a Large
Amount of Data Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
Node 2
Node 1
Spark RDDs Represent a Large
Amount of Data Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
The Spark Cassandra Connector saveToCassandra
method can be called on almost all RDDs
rdd.saveToCassandra("Keyspace","Table")
Node 11
A Java Driver connection is made to the local node and a prepared statement
is built for the target table
Java Driver
Node 11
Batches are built from data in Spark partitions
Java Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
3,9,1
Node 11
By default these batches only contain CQL Rows which share the same
partition key
Java Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
Node 11 Java Driver
1,1,11,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
By default these batches only contain CQL Rows which share the same
partition key
PK=1
Node 11
When an element is not part of an existing batch, a new batch is started
Java Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
PK=1
Node 11 Java Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
When an element is not part of an existing batch, a new batch is started
PK=1
PK=2
Node 11 Java Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
When an element is not part of an existing batch, a new batch is started
PK=1
PK=2
Node 11 Java Driver
1,1,1 1,2,1
2,1,1
3,8,13,2,1 3,4,1 3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
If a batch size reaches batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
PK=3
Node 11 Java Driver
1,1,1 1,2,1
2,1,1
3,8,13,2,1 3,4,1 3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
PK=1
PK=2
PK=3
If a batch size reaches batch.size.rows or batch.size.bytes
it is executed by the driver
Node 11 Java Driver
1,1,1 1,2,1
2,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,3,9,1
3,1,1
spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
If a batch size reaches batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
Node 11 Java Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,3,9,1 spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
If a batch size reaches batch.size.rows or batch.size.bytes
it is executed by the driver
PK=1
PK=2
PK=3
Node 11
If more than batch.buffer.size batches are currently being made,
the largest batch is executed by the Java Driver
Java Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
PK=1
PK=2
PK=3
Node 11 Java Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
PK=2
PK=3
If more than batch.buffer.size batches are currently being made,
the largest batch is executed by the Java Driver
Node 11 Java Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
If more than batch.buffer.size batches are currently being made,
the largest batch is executed by the Java Driver
PK=2
PK=3
PK=5
Node 11 Java Driver
2,1,1
3,1,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,1
If more than batch.buffer.size batches are currently being made,
the largest batch is executed by the Java Driver
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,13,9,1
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,9,13,9,1
Write Acknowledged PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java Driver
2,1,1
3,1,1
5,4,1
2,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
8,4,1
3,9,1
PK=2
PK=3
PK=5
Node 11
If more batches are currently being executed by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java Driver
3,1,1
5,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
8,4,1
3,9,1
PK=3
PK=5
Node 11
If more batches are currently being executed by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java Driver
3,1,1
5,4,1
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
If more batches are currently being executed by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java Driver
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec blocks further batches if we have written more than
that much in the past second.
Java Driver
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec blocks further batches if we have written more than
that much in the past second.
Java Driver
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
Node 11
The last parameter throughput_mb_per_sec blocks further batches if we have written more than
that much in the past second.
Java Driver
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec blocks further batches if we have written more than
that much in the past second.
Java Driver
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
Node 11
The last parameter throughput_mb_per_sec blocks further batches if we have written more than
that much in the past second.
Java Driver
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
Block
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Node 11
The last parameter throughput_mb_per_sec blocks further batches if we have written more than
that much in the past second.
Java Driver
9,4,1
11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Thanks for Coming and I hope you Have a Great Time At C* Summit
http://cassandrasummit-datastax.com/agenda/the-spark-cassandra-connector-past-present-and-future/
Also ask these guys really hard questions
Jacek PiotrAlex