cassandra + hadoop = brisk
DESCRIPTION
An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.TRANSCRIPT
Our sponsors:
Acunu
London
But first, a short back story…
NoSQL!
New Ad Targeting Product
Join
May 2010
1234
5 6 7 8
9101112
16 node cluster running smoothly!
Start meetup group
Learn, learn, learn
GC HELL!
13141516
17 18 19 20
21222324
Hire a Java dev from
Cassandra London!
No Hive support CASSANDRA-913
No streaming Jar support
Pig support
Analytics
25262728
29 30 31 32
33343536
Run out of speaker
volunteers
0.8 arrivesCounters
CQL
Not compatible
with 0.6
Have to watch the sales pitch
again!
Cassandra 0.7 released!Secondary indexes
Provide talks and beer!
More meetups…
25262728
29 30 31 32
33343536
Please volunteer if you would like to give a talk, Internet
fame awaits
• My experience with Cassandra in production is positive
• Analytics is more difficult than it could be
• Welcome Brisk!
• Brisk combines Hadoop, Hive and Cassandra in a “distribution”
In a nutshell
• CassandraFS as HDFS compatible layer; no namenode, no SPOF
• Can split cluster for OLAP and OLTP workloads, scaling up either as required
Demonstrating brisk…
Building an Ad Network!
Demonstrating brisk…
Building an Ad Network!
The plan:
• Simple data model – segment users into buckets• System to put users in buckets via a pixel• Real-time queries• Analytics
We Have Your KidneysThe ad-network for the paranoid generation
• Cookie based identification• API provides:• Add user to a bucket (including ability to define expiry time)• Get buckets a user belongs to
Setup Briskhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami
• Step-by-step guide with pictures!• Ubuntu 10.10 image with RAID 0 ephemeral disks• Jairam has been bug-fixing some minor issues
Data model
CF = users[userUUID] [segmentID] = 1
CF = segments[segmentID] [userUUID] = 1
Data modelcreate keyspace whyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}];create column family users ... with comparator = 'AsciiType'... and rows_cached = 5000;create column family segments... with comparator = 'AsciiType'... and rows_cached = 5000;
Data modelcreate keyspace whyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}];create column family users ... with comparator = 'AsciiType'... and rows_cached = 5000;create column family segments... with comparator = 'AsciiType'... and rows_cached = 5000;
Our pixel
http://wehaveyourkidneys.com/add.php?segment=<alphaNumericCode>&expire=<numberOfSeconds>
• We’ll use Cassandra’s expiring columns feature
PHP code – uses phpcassa$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');$segments = new ColumnFamily($pool, 'segments');$users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires );$segments->insert( $segment, array($userUuid => 1), NULL, // default TS $expires );
Real-time access
http://wehaveyourkidneys.com/show.php
$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');// @todo this only gets first 100!$segments = $users->get($userUuid);
header('Content-Type: application/json');echo json_encode(array_keys($segments));
Analytics
How many users in each segment?
Launch HIVE (very easy!)
root@brisk-01:~# brisk hive
CREATE EXTERNAL TABLE whyk.users(userUuid string, segmentId string,
value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );
select segmentId, count(1) as totalfrom whyk.usersgroup by segmentIdorder by total desc;
Summary
http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/
Real time access+
Batch analytics
Easy
Easy to setupEasy to deploy mixed-mode
clustersEasy to query (Hive)
No Single Pointof Failure
Further reading…Installing the Brisk AMIhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami
Key advantages of Brisk – from Jonathan Ellishttp://hackerne.ws/item?id=2528271
Why I’m very excited about DataStax’s Brisk – by Nathan Milfordhttp://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/
The demo code on Githubhttps://github.com/davegardnerisme/we-have-your-kidneys