cassandra + hadoop = brisk

Our sponsors:

London

But first, a short back story…

NoSQL!

New Ad Targeting Product

May 2010

5 6 7 8

9101112

16 node cluster running smoothly!

Start meetup group

Learn, learn, learn

GC HELL!

13141516

17 18 19 20

21222324

Hire a Java dev from

Cassandra London!

No Hive support CASSANDRA-913

No streaming Jar support

Pig support

Analytics

25262728

29 30 31 32

33343536

Run out of speaker

volunteers

0.8 arrivesCounters

Not compatible

with 0.6

Have to watch the sales pitch

again!

Cassandra 0.7 released!Secondary indexes

Provide talks and beer!

More meetups…

25262728

29 30 31 32

33343536

Please volunteer if you would like to give a talk, Internet

fame awaits

• My experience with Cassandra in production is positive

• Analytics is more difficult than it could be

• Welcome Brisk!

• Brisk combines Hadoop, Hive and Cassandra in a “distribution”

In a nutshell

• CassandraFS as HDFS compatible layer; no namenode, no SPOF

• Can split cluster for OLAP and OLTP workloads, scaling up either as required

Demonstrating brisk…

Building an Ad Network!

Demonstrating brisk…

Building an Ad Network!

The plan:

• Simple data model – segment users into buckets• System to put users in buckets via a pixel• Real-time queries• Analytics

We Have Your KidneysThe ad-network for the paranoid generation

• Cookie based identification• API provides:• Add user to a bucket (including ability to define expiry time)• Get buckets a user belongs to

Setup Briskhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami

• Step-by-step guide with pictures!• Ubuntu 10.10 image with RAID 0 ephemeral disks• Jairam has been bug-fixing some minor issues

Data model

CF = users[userUUID] [segmentID] = 1

CF = segments[segmentID] [userUUID] = 1

Data modelcreate keyspace whyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}];create column family users ... with comparator = 'AsciiType'... and rows_cached = 5000;create column family segments... with comparator = 'AsciiType'... and rows_cached = 5000;

Our pixel

http://wehaveyourkidneys.com/add.php?segment=<alphaNumericCode>&expire=<numberOfSeconds>

• We’ll use Cassandra’s expiring columns feature

PHP code – uses phpcassa$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');$segments = new ColumnFamily($pool, 'segments');$users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires );$segments->insert( $segment, array($userUuid => 1), NULL, // default TS $expires );

Real-time access

http://wehaveyourkidneys.com/show.php

$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');// @todo this only gets first 100!$segments = $users->get($userUuid);

header('Content-Type: application/json');echo json_encode(array_keys($segments));

Analytics

How many users in each segment?

Launch HIVE (very easy!)

root@brisk-01:~# brisk hive

CREATE EXTERNAL TABLE whyk.users(userUuid string, segmentId string,

value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );

select segmentId, count(1) as totalfrom whyk.usersgroup by segmentIdorder by total desc;

Summary

http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/

Real time access+

Batch analytics

Easy to setupEasy to deploy mixed-mode

clustersEasy to query (Hive)

No Single Pointof Failure

Further reading…Installing the Brisk AMIhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami

Key advantages of Brisk – from Jonathan Ellishttp://hackerne.ws/item?id=2528271

Why I’m very excited about DataStax’s Brisk – by Nathan Milfordhttp://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/

The demo code on Githubhttps://github.com/davegardnerisme/we-have-your-kidneys

cassandra + hadoop = brisk

cassandra inproduction

internet fame awaits8

londonour sponsors

gc hell

Technology

hadoop + cassandra: fast queries on data lakes, and ...

brisk hadoop june2011_sfjava

cassandra hadoop best practices by jeremy hanna

tues 115pm cassandra + s3 + hadoop = quick auditing and...

benchmarking cloud databases - jboss developer ·...

store and process big data with hadoop and cassandra

hadoop and cassandra at rackspace

dcatch: automatically detecting distributed concurrency...

brief introduction on hadoop,dremel, pig, flumejava and...

hadoop and cassandra

c* summit eu 2013: from cql to time-series event tracking...

intro cassandra -...

c* summit eu 2013: analytics on top of cassandra and hadoop

red hat. cassandra and mongodb on encryption for hadoop

cassandra/hadoop integration

brisk: more powerful hadoop powered by cassandra

online analytics with hadoop and cassandra

lecture 11 hadoop &...

cfs: cassandra backed storage for hadoop

brisk hadoop june2011