cassandra + hadoop = brisk

Post on 11-May-2015

17.069 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.

TRANSCRIPT

Our sponsors:

Acunu

London

But first, a short back story…

NoSQL!

New Ad Targeting Product

Join

May 2010

1234

5 6 7 8

9101112

16 node cluster running smoothly!

Start meetup group

Learn, learn, learn

GC HELL!

13141516

17 18 19 20

21222324

Hire a Java dev from

Cassandra London!

No Hive support CASSANDRA-913

No streaming Jar support

Pig support

Analytics

25262728

29 30 31 32

33343536

Run out of speaker

volunteers

0.8 arrivesCounters

CQL

Not compatible

with 0.6

Have to watch the sales pitch

again!

Cassandra 0.7 released!Secondary indexes

Provide talks and beer!

More meetups…

25262728

29 30 31 32

33343536

Please volunteer if you would like to give a talk, Internet

fame awaits

• My experience with Cassandra in production is positive

• Analytics is more difficult than it could be

• Welcome Brisk!

• Brisk combines Hadoop, Hive and Cassandra in a “distribution”

In a nutshell

• CassandraFS as HDFS compatible layer; no namenode, no SPOF

• Can split cluster for OLAP and OLTP workloads, scaling up either as required

Demonstrating brisk…

Building an Ad Network!

Demonstrating brisk…

Building an Ad Network!

The plan:

• Simple data model – segment users into buckets• System to put users in buckets via a pixel• Real-time queries• Analytics

We Have Your KidneysThe ad-network for the paranoid generation

• Cookie based identification• API provides:• Add user to a bucket (including ability to define expiry time)• Get buckets a user belongs to

Setup Briskhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami

• Step-by-step guide with pictures!• Ubuntu 10.10 image with RAID 0 ephemeral disks• Jairam has been bug-fixing some minor issues

Data model

CF = users[userUUID] [segmentID] = 1

CF = segments[segmentID] [userUUID] = 1

Data modelcreate keyspace whyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}];create column family users ... with comparator = 'AsciiType'... and rows_cached = 5000;create column family segments... with comparator = 'AsciiType'... and rows_cached = 5000;

Data modelcreate keyspace whyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}];create column family users ... with comparator = 'AsciiType'... and rows_cached = 5000;create column family segments... with comparator = 'AsciiType'... and rows_cached = 5000;

Our pixel

http://wehaveyourkidneys.com/add.php?segment=<alphaNumericCode>&expire=<numberOfSeconds>

• We’ll use Cassandra’s expiring columns feature

PHP code – uses phpcassa$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');$segments = new ColumnFamily($pool, 'segments');$users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires );$segments->insert( $segment, array($userUuid => 1), NULL, // default TS $expires );

Real-time access

http://wehaveyourkidneys.com/show.php

$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');// @todo this only gets first 100!$segments = $users->get($userUuid);

header('Content-Type: application/json');echo json_encode(array_keys($segments));

Analytics

How many users in each segment?

Launch HIVE (very easy!)

root@brisk-01:~# brisk hive

CREATE EXTERNAL TABLE whyk.users(userUuid string, segmentId string,

value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );

select segmentId, count(1) as totalfrom whyk.usersgroup by segmentIdorder by total desc;

Summary

http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/

Real time access+

Batch analytics

Easy

Easy to setupEasy to deploy mixed-mode

clustersEasy to query (Hive)

No Single Pointof Failure

Further reading…Installing the Brisk AMIhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami

Key advantages of Brisk – from Jonathan Ellishttp://hackerne.ws/item?id=2528271

Why I’m very excited about DataStax’s Brisk – by Nathan Milfordhttp://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/

The demo code on Githubhttps://github.com/davegardnerisme/we-have-your-kidneys

top related