nyc* 2013 - "advanced data processing: beyond queries and slices"

Before we get into the heavy stuff, Let's imagine hacking around with C* for a bit...

You run a large video website

● CREATE TABLE videos (videoid uuid,videoname varchar,username varchar,description varchar, tags varchar,upload_date timestamp,PRIMARY KEY (videoid,videoname) );

● INSERT INTO videos (videoid, videoname, username, description, tags, upload_date) VALUES ('99051fe9-6a9c-46c2-b949-38ef78858dd0','My funny cat','ctodd', 'My cat likes to play the piano! So funny.','cats,piano,lol','2012-06-01 08:00:00');

You have a bajillion users

● CREATE TABLE users (username varchar,firstname varchar,lastname varchar,email varchar,password varchar,created_date timestamp,PRIMARY KEY (username));

● INSERT INTO users (username, firstname, lastname, email, password, created_date) VALUES ('tcodd','Ted','Codd', '[email protected]','5f4dcc3b5aa765d61d8327deb882cf99','2011-06-01 08:00:00');

That's great! Then you ask yourself...

● Can I slice a slice (or sub query)?● Can I do advanced where clauses ?● Can I union two slices server side?● Can I join data from two tables without two

request/response round trips?● What about procedures?● Can I write functions or aggregation functions?

Let's look at the API's we have

http://www.slideshare.net/aaronmorton/apachecon-nafeb2013

But none of those API's do what I want, and it seems simple

enough to do...

Intravert joins the “party” at the API Layer

Why not just do it client side?

● Move processing close to data– Idea borrowed from Hadoop

● Doing work close to the source can result in:– Less network IO

– Less memory spend encoding/decoding 'throw away' data

– New storage and access paradigms

Vertx + cassandra

● What is vertx ?– Distributed Event Bus which spans the server and

even penetrates into client side for effortless 'real-time' web applications

● What are the cool features?– Asynchronous – Hot re-loadable modules

– Modules can be written in groovy, ruby, java, java-script

http://vertx.io

Transport, payload, and batching

HTTP Transport

● HTTP is easy to use on firewall'ed networks● Easy to secure● Easy to compress● The defacto way to do everything anyway● IntraVert attempts to limit round-trips

– Not provide a terse binary format

JSON Payload

● Simple nested types like list, map, String● Request is composed of N operations● Each operation has a configurable timeout● Again, IntraVert attempts to limit round-trips

– Not provide a terse message format

Why not use lighting fast transport and serialization library X?

● X's language/code gen issues● You probably can not TCP dump X● Net-admins don't like 90 jars for health checks● IntraVert attempts to limit round-trips:

– Prepared statements

– Server side filtering

– Other cool stuff

Sample request and response

{"e": [ {

"type": "SETKEYSPACE",

"op": { "keyspace": "myks" }

}, {

"type": "SETCOLUMNFAMILY",

"op": { "columnfamily": "mycf" }

}, {

"type": "SLICE",

"op": {

"rowkey": "beers",

"start": "Allagash",

"end": "Sierra Nevada",

"size": 9

} } ] }

{

"exception":null,

"exceptionId":null,

"opsRes": {

"0":"OK",

"1":"OK",

"2":[{

"name":"Founders",

"value":"Breakfast Stout"

}]

}}

Server side filter

Imagine your data looks like...

{ "rowkey": "beers", "name": "Allagash", "value": "Allagash Tripel" }

{ "rowkey": "beers", "name": "Founders", "value": "Breakfast Stout" }

{ "rowkey": "beers", "name": "Dogfish Head",

"value": "Hellhound IPA" }

Application requirement

● User request wishes to know which beers are “Breakfast Stout” (s)

● Common “solutions”:– Write a copy of the data sorted by type

– Request all the data and parse on client side

Using an IntraVert filter

● Send a function to the server● Function is applied to subsequent get or slice

operations● Only results of the filter are returned to the

client

Defining a filter JavaScript

● Syntax to create a filter

{

"type": "CREATEFILTER",

"op": {

"name": "stouts",

"spec": "javascript",

"value": "function(row) { if (row['value'] == 'Breakfast Stout') return row; else return null; }"

}

},

Defining a filter Groovy/Java

● We can define a groovy closure or Java filter

{

"type": "CREATEFILTER",

"op": {

"name": "stouts",

"spec": "groovy",

"{ row -> if (row[\"value\"] == \"Breakfast Stout\") return row else return null }"

}

},

Filter flow

Common filter use cases

● Transform data● Prune columns/rows like a where clause● Extract data from complex fields (json, xml,

protobuf, etc)

Some light relief

Server Side Multi-Processor

It's the cure for your “redis envy”

Imagine your data looks like...

● { “row key”:”1”, name:”a” ,val...}

● { “row key”:”1”, name:”b” ,val...}

● { “row key”:”4”, name:”a” ,val...}

● { “row key”:”4”, name:”z” ,val...}

Application Requirements

● User wishes to intersect the column names of two slices/queries

● Common “solutions”– Pull all results to client and apply the intersection

there

Server Side MultiProcessor

● Send a class that implements MultiProcessor interface to server

● public List<Map> multiProcess (Map<Integer,Object> input, Map params);

● Do one or more get/slice operations as input● Invoke MultiProcessor on input

Multi-processor flow

Multi-processor use cases

● Union N slices● Intersection N slices● Some “Join” scenarios

Fat client becomes the 'Phat client'

Imagine you want to insert this data

● User wishes to enter this event for multiple column families– 09/10/201111:12:13

– App=Yahoo

– Platform=iOS

– OS=4.3.4

– Device=iPad2,1

– Resolution=768x1024

– Events–videoPlayPercent=38–Taste=great

http://www.slideshare.net/charmalloc/jsteincassandranyc2011

Inserting the data aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”

def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {

c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))

}

aggregateKeys(KEYSPACE \ ”ByMonth") = month //201109

aggregateKeys(KEYSPACE \ "ByDay") = day //20110910

aggregateKeys(KEYSPACE \ ”ByHour") = hour //2011091012

aggregateKeys(KEYSPACE \ ”ByMinute") = minute //201109101213

def r(columnName: String): Unit = {

aggregateKeys.foreach{tuple:(ColumnFamily, String) => {

val (columnFamily,row) = tuple

if (row !=null && row.size > 0)

rows add (columnFamily -> row has columnName inc) //increment the counter

}

}

}

ccAppPlatformOSVersionDeviceResolution(r)

http://www.slideshare.net/charmalloc/jsteincassandranyc2011

Solution

● Send the data once and compute the N permutations on the server side

public void process(JsonObject request, JsonObject state, JsonObject response, EventBus eb) { JsonObject params = request.getObject("mpparams"); String uid = (String) params.getString("userid"); String fname = (String) params.getString("fname"); String lname = (String) params.getString("lname"); String city = (String) params.getString("city");

RowMutation rm = new RowMutation("myks", IntraService.byteBufferForObject(uid)); QueryPath qp = new QueryPath("users", null, IntraService.byteBufferForObject("fname")); rm.add(qp, IntraService.byteBufferForObject(fname), System.nanoTime()); QueryPath qp2 = new QueryPath("users", null, IntraService.byteBufferForObject("lname")); rm.add(qp2, IntraService.byteBufferForObject(lname), System.nanoTime()); ... try { StorageProxy.mutate(mutations, ConsistencyLevel.ONE); } catch (WriteTimeoutException | UnavailableException | OverloadedException e) { e.printStackTrace(); response.putString("status", "FAILED"); } response.putString("status", "OK");}

Service Processor Flow

IntraVert status

● Still pre 1.0● Good docs

– https://github.com/zznate/intravert-ug/wiki/_pages

● Functional equivalent to thrift (mostly features ported)

● CQL support● Virgil (coming soon)● Hbase like scanners (coming soon)

Hack at it

https://github.com/zznate/intravert-ug

Questions?

nyc* 2013 - "advanced data processing: beyond queries and slices"

Technology