#CASSANDRAEU CASSANDRASUMMITEU
Richard Low | @richardalow
Mixing Batch and Real-time: Cassandra with Shark
#CASSANDRAEU @richardalow
About me*Analytics tech lead at SwiftKey*Cassandra freelancer*Previous: lead Cassandra and Analytics dev at
Acunu
#CASSANDRAEU @richardalow
Outline*Batch analytics on real-time databases*Current solutions*Spark and Shark*My solution*Performance results*Summary & future work
#CASSANDRAEU @richardalow
Batch analytics on real-time databases
#CASSANDRAEU @richardalow
Batch and real-time analytics*Wherever there is data there are unforeseeable
queries*Real-time databases are optimized for real-time
queries*Large queries may not be possible*Or will impact your real-time SLA
#CASSANDRAEU @richardalow
Example*User accounts database*Read-heavy*Must be low latency*Other tables on same database*Some are write heavy*A good fit for Cassandra!
#CASSANDRAEU @richardalow
Example data model
CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text);
#CASSANDRAEU @richardalow
Example data modelSELECT * FROM user_accounts LIMIT 2;
userid | country | email | last_visited | password | username---------+---------+---------------------+---------------------+----------+---------a03dcf03 | UK | [email protected] | 2013-10-07 09:07:36 | td7rjxwp | rlowb3f1871e | FR | [email protected] | 2013-08-17 13:07:36 | moh7eksn | jean88
#CASSANDRAEU @richardalow
Marketing walks in
#CASSANDRAEU @richardalow
Ad-hoc query
“Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com.
I need the answer by Monday.”
#CASSANDRAEU @richardalow
Ad-hoc query observations*We have 500k users from Brazil*60MB of raw data*No way to extract by country from data model*It’s on unchanging data**Can take hours, not days*No expectation this query will need rerunning
* Mostly, some of the people who haven’t visited for a while may suddenly come back
#CASSANDRAEU @richardalow
Why?*Underrepresented use case in plethora of tools*Seen days of dev time wasted*Want to see what can be done
#CASSANDRAEU @richardalow
Current solutions
#CASSANDRAEU @richardalow
Options*Run Hive query on top of Cassandra
#CASSANDRAEU @richardalow
*Run Hive query on top of CassandraOptions*Run Hive query on top of Cassandra
*Will compete with Cassandra for*I/O*Memory*CPU*Network
*Will cause extra GC pressure on Cassandra*Could flush filesystem cache
#CASSANDRAEU @richardalow
Options*Write ETL script and load into another DB
#CASSANDRAEU @richardalow
Options*Write ETL script and load into another DB*Write ETL script and load into another DB
*All custom code*Single threaded*Unreliable*Will still flush cache on Cassandra nodes
#CASSANDRAEU @richardalow
Options*Clone the cluster
#CASSANDRAEU @richardalow
Options*Clone the cluster*Clone the cluster
*Worst possible network load*Manual import each time*No incremental update*Need duplicate hardware
#CASSANDRAEU @richardalow
Options*Add ‘batch analytics’ DC and run Hive there
#CASSANDRAEU @richardalow
Options*Add ‘batch analytics’ DC and run Hive there*Add ‘batch analytics’ DC and run Hive there
*Initial copy slow and affects real-time performance
*Need duplicate hardware*Will drop writes when really busy
#CASSANDRAEU @richardalow
Spark and Shark
#CASSANDRAEU @richardalow
Spark*Developed by Amplab*Distributed computation, like Hadoop*Designed for iterative algorithms*Much faster for queries with working sets that fit
in RAM*Reliability from storing lineage rather than
intermediate results*Runs on Mesos or YARN
#CASSANDRAEU @richardalow
Spark is used by
Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
#CASSANDRAEU @richardalow
Shark*Hive on Spark*Completely compatible with Hive*Same QL, UDFs and storage handlers*Can cache tables
#CASSANDRAEU @richardalow
Shark*Hive on Spark*Completely compatible with Hive*Same QL, UDFs and storage handlers*Can cache tables
CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’;
#CASSANDRAEU @richardalow
Shark on Cassandra
#CASSANDRAEU @richardalow
Shark on Cassandra* CqlStorageHandler*Can use existing hive-cassandra storage handler*Can work well - see Evan Chan’s talk (Ooyala) from
#cassandra13*But suffers from same problems as Hive+Hadoop
on Cassandra
#CASSANDRAEU @richardalow
Shark on Cassandra direct* SSTableStorageHandler*Run spark workers on the Cassandra nodes*Read directly from SSTables in separate JVM*Limit CPU and memory through Spark/Mesos/
YARN*Limit I/O by rate limiting raw disk access*Skip filesystem cache
#CASSANDRAEU @richardalow
Cassandra on Spark: through CQL interface
Cassandra JVM
Spark worker JVM
DeserializeMergeSerialize
DeserializeProcess
Remote client
FS CacheSSTables
Latency spikes!
#CASSANDRAEU @richardalow
Cassandra on Spark: SSTables direct
Cassandra JVM
Spark worker JVM
DeserializeMergeSerialize
DeserializeProcess
Remote client
SSTables
Constant latency
FS Cache
#CASSANDRAEU @richardalow
Disadvantages*Equivalent to CL.ONE*Always runs task local with the data*Doesn’t read data in memtables
#CASSANDRAEU @richardalow
Performance results
#CASSANDRAEU @richardalow
Testing*4 node Cassandra cluster on m1.large
*2 cores, 7.5 GB RAM, 2 ephemeral disks*1 spark master*Spark running on Cassandra nodes*Limited to 1 core, 1 GB RAM*Compare CQLStorageHandler with
SSTableStorageHandler
#CASSANDRAEU @richardalow
Setup*Cassandra 1.2.10*3 GB heap*256 tokens per node*RF 3*Preloaded 100M randomly generated records
*Each node started with 9GB of data*No optimization or tuning
#CASSANDRAEU @richardalow
Tools*codahale Metrics*Ganglia*Load generator using DataStax Java driver*Google spreadsheet
#CASSANDRAEU @richardalow
Result 1*No Cassandra load*Run caching query:
*Takes 33 mins through CQL*Takes 13 mins through SSTables
*130k records/s*=> SSTables is 2.5x faster*Even better since CQL has access to both cores
CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’;
#CASSANDRAEU @richardalow
Using cached results*Now have results cached, can run super fast
queries*No I/O or extra memory*Bounded number of cores
*Took 18 seconds
SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%';
#CASSANDRAEU @richardalow
Result 2*Add read load
*Read-modify-write of accounts info*200 ops/s*Measure latency
*Slow down SSTable loader to same rate as CQL
#CASSANDRAEU @richardalow
95%ile base
mean base
#CASSANDRAEU @richardalow
Analysis*Average latency 17% lower
*Probably due to less CPU used by query*Max 95th %ile latency 33% lower and much more
predictable*Possibly due to less GC pressure
*Still have a latency increase over base*Probably due to I/O use
#CASSANDRAEU @richardalow
Result 3*Keep read workload*Measure same latency*Add insert workload
*Insert into separate table*2500 ops/s
#CASSANDRAEU @richardalow
CQL loader SSTable loader
#CASSANDRAEU @richardalow
Analysis*Lots of latency, but there is anyway
#CASSANDRAEU @richardalow
Performance wrap up*2.5x faster with less CPU
=> uses less resources to do the same thing*Lower, more predictable latencies when at same
speed=> controlled resource usage lowers latency impact
*Could limit further to make impact unnoticeable
#CASSANDRAEU @richardalow
Summary
#CASSANDRAEU @richardalow
Summary*Discussed analytics use case not well served by
current tools*Spark, Shark*SSTableStorageHandler*Performance results
#CASSANDRAEU @richardalow
Future*Needs a name*Github*Speak to me if you want to use it*Speak to me if you want to contribute
#CASSANDRAEU @richardalow
Thank you!Richard Low | @richardalow