cassandra/hadoop integration
DESCRIPTION
Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.TRANSCRIPT
Cassandra/Hadoop Integration
OLTP + OLAP = Cassandra
BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable
Cassandra (basic overview)
Design your data model based on your query model
Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics?
Querying with Cassandra
Hadoop brings analytics MapReduce Pig/Hive and other tools built above
MapReduce Configurable data sources/destinations Many already familiar with it Active community
Enter Hadoop
Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache
Voilà Data locality Analytics engine scales with data
Cluster Configuration
Always tune Cassandra to taste For Hadoop workloads you might
Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy
Tune the rpc_timeout_in_ms in cassandra.yaml (higher)
Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper
Cluster Tuning
All-in-one Configuration
Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
JobTracker and NameNode
Separate Analytics Configuration
Nodes for real-time random access Separated nodes for analytics
A single Cassandra cluster with different virtual data centers
Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count
MapReduce - InputFormat
OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoop variables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g.
ConsistencyLevel.ONE) Uses Avro for output serialization (enables
streaming) Example usage in contrib/word_count
MapReduce - OutputFormat
Visualizing
Take vertical slices of columns
Over the whole column family
What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of 0.7.0
Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2
Hadoop Streaming
Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Env variables Uses pig 0.7+ Example usage in contrib/pig
Pig
LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() \as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});
cols = FOREACH rows GENERATE flatten(cols) as (name, value);words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;grouped = GROUP words BY word;counts = FOREACH grouped GENERATE group, COUNT(words) as count;ordered = ORDER counts BY count DESC;topten = LIMIT ordered 10;dump topten;
ColumnFamilyInputFormat ColumnFamilyOutputFormat Hadoop Streaming Output Pig support – Cassandra LoadFunc
Summary of Integration
Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling
Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning)
See http://github.com/digitalreasoning/PyStratus
Users of Cassandra + Hadoop
Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 -
1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes
(1600) Performance improvements (though already
good)
Future
Performant OLTP + powerful OLAP Less need to shuttle data between storage
systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC
Conclusion
About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC)
~150-200+ users from around the world Cassandra: The Definitive Guide
About Hadoop Support in Cassandra Check out various <source>/contrib modules:
README/code http://wiki.apache.org/cassandra/HadoopSupport
Learn More
About me: [email protected] @jeromatron on Twitter jeromatron on IRC in #cassandra
Questions