cassandra/hadoop integration

20
Cassandra/Hadoop Integration OLTP + OLAP = Cassandra

Upload: jeremy-hanna

Post on 06-May-2015

22.167 views

Category:

Technology


0 download

DESCRIPTION

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

TRANSCRIPT

Page 1: Cassandra/Hadoop Integration

Cassandra/Hadoop Integration

OLTP + OLAP = Cassandra

Page 2: Cassandra/Hadoop Integration

BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable

Cassandra (basic overview)

Page 3: Cassandra/Hadoop Integration

Design your data model based on your query model

Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics?

Querying with Cassandra

Page 4: Cassandra/Hadoop Integration

Hadoop brings analytics MapReduce Pig/Hive and other tools built above

MapReduce Configurable data sources/destinations Many already familiar with it Active community

Enter Hadoop

Page 5: Cassandra/Hadoop Integration

Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache

Voilà Data locality Analytics engine scales with data

Cluster Configuration

Page 6: Cassandra/Hadoop Integration

Always tune Cassandra to taste For Hadoop workloads you might

Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy

Tune the rpc_timeout_in_ms in cassandra.yaml (higher)

Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper

Cluster Tuning

Page 7: Cassandra/Hadoop Integration

All-in-one Configuration

Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)

JobTracker and NameNode

Page 8: Cassandra/Hadoop Integration

Separate Analytics Configuration

Nodes for real-time random access Separated nodes for analytics

A single Cassandra cluster with different virtual data centers

Page 9: Cassandra/Hadoop Integration

Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count

MapReduce - InputFormat

Page 10: Cassandra/Hadoop Integration

OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoop variables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g.

ConsistencyLevel.ONE) Uses Avro for output serialization (enables

streaming) Example usage in contrib/word_count

MapReduce - OutputFormat

Page 11: Cassandra/Hadoop Integration

Visualizing

Take vertical slices of columns

Over the whole column family

Page 12: Cassandra/Hadoop Integration

What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of 0.7.0

Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2

Hadoop Streaming

Page 13: Cassandra/Hadoop Integration

Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Env variables Uses pig 0.7+ Example usage in contrib/pig

Pig

Page 14: Cassandra/Hadoop Integration

LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() \as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});

cols = FOREACH rows GENERATE flatten(cols) as (name, value);words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;grouped = GROUP words BY word;counts = FOREACH grouped GENERATE group, COUNT(words) as count;ordered = ORDER counts BY count DESC;topten = LIMIT ordered 10;dump topten;

Page 15: Cassandra/Hadoop Integration

ColumnFamilyInputFormat ColumnFamilyOutputFormat Hadoop Streaming Output Pig support – Cassandra LoadFunc

Summary of Integration

Page 16: Cassandra/Hadoop Integration

Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling

Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning)

See http://github.com/digitalreasoning/PyStratus

Users of Cassandra + Hadoop

Page 17: Cassandra/Hadoop Integration

Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 -

1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes

(1600) Performance improvements (though already

good)

Future

Page 18: Cassandra/Hadoop Integration

Performant OLTP + powerful OLAP Less need to shuttle data between storage

systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC

Conclusion

Page 19: Cassandra/Hadoop Integration

About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC)

~150-200+ users from around the world Cassandra: The Definitive Guide

About Hadoop Support in Cassandra Check out various <source>/contrib modules:

README/code http://wiki.apache.org/cassandra/HadoopSupport

Learn More

Page 20: Cassandra/Hadoop Integration

About me: [email protected] @jeromatron on Twitter jeromatron on IRC in #cassandra

Questions