Download - Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU CASSANDRASUMMITEU

Richard Low | @richardalow

Mixing Batch and Real-time: Cassandra with Shark

#CASSANDRAEU @richardalow

About me*Analytics tech lead at SwiftKey*Cassandra freelancer*Previous: lead Cassandra and Analytics dev at

Acunu


Outline*Batch analytics on real-time databases*Current solutions*Spark and Shark*My solution*Performance results*Summary & future work


Batch analytics on real-time databases


Batch and real-time analytics*Wherever there is data there are unforeseeable

queries*Real-time databases are optimized for real-time

queries*Large queries may not be possible*Or will impact your real-time SLA


Example*User accounts database*Read-heavy*Must be low latency*Other tables on same database*Some are write heavy*A good fit for Cassandra!


Example data model

CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text);


Marketing walks in


Ad-hoc query

“Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com.

I need the answer by Monday.”


Ad-hoc query observations*We have 500k users from Brazil*60MB of raw data*No way to extract by country from data model*It’s on unchanging data**Can take hours, not days*No expectation this query will need rerunning

* Mostly, some of the people who haven’t visited for a while may suddenly come back


Why?*Underrepresented use case in plethora of tools*Seen days of dev time wasted*Want to see what can be done


Current solutions


Options*Run Hive query on top of Cassandra


*Run Hive query on top of CassandraOptions*Run Hive query on top of Cassandra

*Will compete with Cassandra for*I/O*Memory*CPU*Network

*Will cause extra GC pressure on Cassandra*Could flush filesystem cache


Options*Write ETL script and load into another DB


Options*Write ETL script and load into another DB*Write ETL script and load into another DB

*All custom code*Single threaded*Unreliable*Will still flush cache on Cassandra nodes


Options*Clone the cluster


Options*Clone the cluster*Clone the cluster

*Worst possible network load*Manual import each time*No incremental update*Need duplicate hardware


Options*Add ‘batch analytics’ DC and run Hive there


Options*Add ‘batch analytics’ DC and run Hive there*Add ‘batch analytics’ DC and run Hive there

*Initial copy slow and affects real-time performance

*Need duplicate hardware*Will drop writes when really busy


Spark and Shark


Spark*Developed by Amplab*Distributed computation, like Hadoop*Designed for iterative algorithms*Much faster for queries with working sets that fit

in RAM*Reliability from storing lineage rather than

intermediate results*Runs on Mesos or YARN


Spark is used by

Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark


Shark*Hive on Spark*Completely compatible with Hive*Same QL, UDFs and storage handlers*Can cache tables


Shark*Hive on Spark*Completely compatible with Hive*Same QL, UDFs and storage handlers*Can cache tables

CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’;


Shark on Cassandra


Shark on Cassandra* CqlStorageHandler*Can use existing hive-cassandra storage handler*Can work well - see Evan Chan’s talk (Ooyala) from

#cassandra13*But suffers from same problems as Hive+Hadoop

on Cassandra


Shark on Cassandra direct* SSTableStorageHandler*Run spark workers on the Cassandra nodes*Read directly from SSTables in separate JVM*Limit CPU and memory through Spark/Mesos/

YARN*Limit I/O by rate limiting raw disk access*Skip filesystem cache


Cassandra on Spark: through CQL interface

Cassandra JVM

Spark worker JVM

DeserializeMergeSerialize

DeserializeProcess

Remote client

FS CacheSSTables

Latency spikes!


Cassandra on Spark: SSTables direct

Cassandra JVM

Spark worker JVM

DeserializeMergeSerialize

DeserializeProcess

Remote client

SSTables

Constant latency

FS Cache


Disadvantages*Equivalent to CL.ONE*Always runs task local with the data*Doesn’t read data in memtables


Performance results


Testing*4 node Cassandra cluster on m1.large

*2 cores, 7.5 GB RAM, 2 ephemeral disks*1 spark master*Spark running on Cassandra nodes*Limited to 1 core, 1 GB RAM*Compare CQLStorageHandler with

SSTableStorageHandler


Setup*Cassandra 1.2.10*3 GB heap*256 tokens per node*RF 3*Preloaded 100M randomly generated records

*Each node started with 9GB of data*No optimization or tuning


Tools*codahale Metrics*Ganglia*Load generator using DataStax Java driver*Google spreadsheet


Result 1*No Cassandra load*Run caching query:

*Takes 33 mins through CQL*Takes 13 mins through SSTables

*130k records/s*=> SSTables is 2.5x faster*Even better since CQL has access to both cores

CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’;


Using cached results*Now have results cached, can run super fast

queries*No I/O or extra memory*Bounded number of cores

*Took 18 seconds

SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%';


Result 2*Add read load

*Read-modify-write of accounts info*200 ops/s*Measure latency

*Slow down SSTable loader to same rate as CQL


95%ile base

mean base


Analysis*Average latency 17% lower

*Probably due to less CPU used by query*Max 95th %ile latency 33% lower and much more

predictable*Possibly due to less GC pressure

*Still have a latency increase over base*Probably due to I/O use


Result 3*Keep read workload*Measure same latency*Add insert workload

*Insert into separate table*2500 ops/s


CQL loader SSTable loader


Analysis*Lots of latency, but there is anyway


Performance wrap up*2.5x faster with less CPU

=> uses less resources to do the same thing*Lower, more predictable latencies when at same

speed=> controlled resource usage lowers latency impact

*Could limit further to make impact unnoticeable


Summary


Summary*Discussed analytics use case not well served by

current tools*Spark, Shark*SSTableStorageHandler*Performance results


Future*Needs a name*Github*Speak to me if you want to use it*Speak to me if you want to contribute


Thank you!Richard Low | @richardalow

Download - Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

Top Related