cassandra/hadoop integration

Cassandra/Hadoop Integration

OLTP + OLAP = Cassandra

BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable

Cassandra (basic overview)

Design your data model based on your query model

Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics?

Querying with Cassandra

Hadoop brings analytics MapReduce Pig/Hive and other tools built above

MapReduce Configurable data sources/destinations Many already familiar with it Active community

Enter Hadoop

Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache

Voilà Data locality Analytics engine scales with data

Cluster Configuration

Always tune Cassandra to taste For Hadoop workloads you might

Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy

Tune the rpc_timeout_in_ms in cassandra.yaml (higher)

Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper

Cluster Tuning

All-in-one Configuration

Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)

JobTracker and NameNode

Separate Analytics Configuration

Nodes for real-time random access Separated nodes for analytics

A single Cassandra cluster with different virtual data centers

Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count

MapReduce - InputFormat

OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoop variables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g.

ConsistencyLevel.ONE) Uses Avro for output serialization (enables

streaming) Example usage in contrib/word_count

MapReduce - OutputFormat

Visualizing

Take vertical slices of columns

Over the whole column family

What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of 0.7.0

Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2

Hadoop Streaming

Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Env variables Uses pig 0.7+ Example usage in contrib/pig

Pig

LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() \as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});

cols = FOREACH rows GENERATE flatten(cols) as (name, value);words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;grouped = GROUP words BY word;counts = FOREACH grouped GENERATE group, COUNT(words) as count;ordered = ORDER counts BY count DESC;topten = LIMIT ordered 10;dump topten;

ColumnFamilyInputFormat ColumnFamilyOutputFormat Hadoop Streaming Output Pig support – Cassandra LoadFunc

Summary of Integration

Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling

Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning)

See http://github.com/digitalreasoning/PyStratus

Users of Cassandra + Hadoop

http://github.com/digitalreasoning/PyStratus

http://github.com/digitalreasoning/PyStratus

Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 -

1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes

(1600) Performance improvements (though already

good)

Future

Performant OLTP + powerful OLAP Less need to shuttle data between storage

systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC

Conclusion

About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC)

~150-200+ users from around the world Cassandra: The Definitive Guide

About Hadoop Support in Cassandra Check out various <source>/contrib modules:

README/code http://wiki.apache.org/cassandra/HadoopSupport

Learn More

http://www.datastax.com/docs

http://www.datastax.com/docs

http://wiki.apache.org/cassandra

http://www.mail-archive.com/[email protected]/

mailto:[email protected]

http://webchat.freenode.net/?channels=%23cassandra

http://wiki.apache.org/cassandra/HadoopSupport

http://wiki.apache.org/cassandra/HadoopSupport

About me: [email protected] @jeromatron on Twitter jeromatron on IRC in #cassandra

Questions

cassandra/hadoop integration

Technology