cassandra + hadoop = brisk

30
Our sponsors: Acunu Londo n

Upload: dave-gardner

Post on 11-May-2015

17.069 views

Category:

Technology


0 download

DESCRIPTION

An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.

TRANSCRIPT

Page 1: Cassandra + Hadoop = Brisk

Our sponsors:

Acunu

London

Page 2: Cassandra + Hadoop = Brisk

But first, a short back story…

Page 3: Cassandra + Hadoop = Brisk

NoSQL!

New Ad Targeting Product

Join

May 2010

1234

5 6 7 8

9101112

Page 4: Cassandra + Hadoop = Brisk

16 node cluster running smoothly!

Start meetup group

Learn, learn, learn

GC HELL!

13141516

17 18 19 20

21222324

Page 5: Cassandra + Hadoop = Brisk

Hire a Java dev from

Cassandra London!

No Hive support CASSANDRA-913

No streaming Jar support

Pig support

Analytics

25262728

29 30 31 32

33343536

Page 6: Cassandra + Hadoop = Brisk

Run out of speaker

volunteers

0.8 arrivesCounters

CQL

Not compatible

with 0.6

Have to watch the sales pitch

again!

Cassandra 0.7 released!Secondary indexes

Provide talks and beer!

More meetups…

25262728

29 30 31 32

33343536

Page 7: Cassandra + Hadoop = Brisk

Please volunteer if you would like to give a talk, Internet

fame awaits

Page 8: Cassandra + Hadoop = Brisk

• My experience with Cassandra in production is positive

• Analytics is more difficult than it could be

• Welcome Brisk!

Page 9: Cassandra + Hadoop = Brisk

• Brisk combines Hadoop, Hive and Cassandra in a “distribution”

Page 10: Cassandra + Hadoop = Brisk
Page 11: Cassandra + Hadoop = Brisk

In a nutshell

• CassandraFS as HDFS compatible layer; no namenode, no SPOF

• Can split cluster for OLAP and OLTP workloads, scaling up either as required

Page 12: Cassandra + Hadoop = Brisk

Demonstrating brisk…

Building an Ad Network!

Page 13: Cassandra + Hadoop = Brisk

Demonstrating brisk…

Building an Ad Network!

Page 14: Cassandra + Hadoop = Brisk

The plan:

• Simple data model – segment users into buckets• System to put users in buckets via a pixel• Real-time queries• Analytics

Page 15: Cassandra + Hadoop = Brisk

We Have Your KidneysThe ad-network for the paranoid generation

• Cookie based identification• API provides:• Add user to a bucket (including ability to define expiry time)• Get buckets a user belongs to

Page 16: Cassandra + Hadoop = Brisk

Setup Briskhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami

• Step-by-step guide with pictures!• Ubuntu 10.10 image with RAID 0 ephemeral disks• Jairam has been bug-fixing some minor issues

Page 17: Cassandra + Hadoop = Brisk
Page 18: Cassandra + Hadoop = Brisk

Data model

CF = users[userUUID] [segmentID] = 1

CF = segments[segmentID] [userUUID] = 1

Page 19: Cassandra + Hadoop = Brisk

Data modelcreate keyspace whyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}];create column family users ... with comparator = 'AsciiType'... and rows_cached = 5000;create column family segments... with comparator = 'AsciiType'... and rows_cached = 5000;

Page 20: Cassandra + Hadoop = Brisk

Data modelcreate keyspace whyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}];create column family users ... with comparator = 'AsciiType'... and rows_cached = 5000;create column family segments... with comparator = 'AsciiType'... and rows_cached = 5000;

Page 21: Cassandra + Hadoop = Brisk

Our pixel

http://wehaveyourkidneys.com/add.php?segment=<alphaNumericCode>&expire=<numberOfSeconds>

• We’ll use Cassandra’s expiring columns feature

Page 22: Cassandra + Hadoop = Brisk

PHP code – uses phpcassa$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');$segments = new ColumnFamily($pool, 'segments');$users->insert( $userUuid, array($segment => 1), NULL, // default TS $expires );$segments->insert( $segment, array($userUuid => 1), NULL, // default TS $expires );

Page 23: Cassandra + Hadoop = Brisk

Real-time access

http://wehaveyourkidneys.com/show.php

$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');// @todo this only gets first 100!$segments = $users->get($userUuid);

header('Content-Type: application/json');echo json_encode(array_keys($segments));

Page 24: Cassandra + Hadoop = Brisk

Analytics

How many users in each segment?

Launch HIVE (very easy!)

root@brisk-01:~# brisk hive

Page 25: Cassandra + Hadoop = Brisk

CREATE EXTERNAL TABLE whyk.users(userUuid string, segmentId string,

value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );

select segmentId, count(1) as totalfrom whyk.usersgroup by segmentIdorder by total desc;

Page 26: Cassandra + Hadoop = Brisk

Summary

http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/

Page 27: Cassandra + Hadoop = Brisk

Real time access+

Batch analytics

Page 28: Cassandra + Hadoop = Brisk

Easy

Easy to setupEasy to deploy mixed-mode

clustersEasy to query (Hive)

Page 29: Cassandra + Hadoop = Brisk

No Single Pointof Failure

Page 30: Cassandra + Hadoop = Brisk

Further reading…Installing the Brisk AMIhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami

Key advantages of Brisk – from Jonathan Ellishttp://hackerne.ws/item?id=2528271

Why I’m very excited about DataStax’s Brisk – by Nathan Milfordhttp://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/

The demo code on Githubhttps://github.com/davegardnerisme/we-have-your-kidneys