nyc* 2013 - "advanced data processing: beyond queries and slices"
TRANSCRIPT
Before we get into the heavy stuff, Let's imagine hacking around with C* for a bit...
You run a large video website
● CREATE TABLE videos (videoid uuid,videoname varchar,username varchar,description varchar, tags varchar,upload_date timestamp,PRIMARY KEY (videoid,videoname) );
● INSERT INTO videos (videoid, videoname, username, description, tags, upload_date) VALUES ('99051fe9-6a9c-46c2-b949-38ef78858dd0','My funny cat','ctodd', 'My cat likes to play the piano! So funny.','cats,piano,lol','2012-06-01 08:00:00');
You have a bajillion users
● CREATE TABLE users (username varchar,firstname varchar,lastname varchar,email varchar,password varchar,created_date timestamp,PRIMARY KEY (username));
● INSERT INTO users (username, firstname, lastname, email, password, created_date) VALUES ('tcodd','Ted','Codd', '[email protected]','5f4dcc3b5aa765d61d8327deb882cf99','2011-06-01 08:00:00');
You can query up a storm
● SELECT firstname,lastname FROM users WHERE username='tcodd';
firstname | lastname
-----------+----------
Ted | Codd
● SELECT * FROM videos WHERE videoid = 'b3a76c6b-7c7f-4af6-964f-803a9283c401' and videoname>'N';
videoid | videoname | description | tags | upload_date | username
b3a76c6b-7c7f-4af6-964f-803a9283c401 | Now my dog plays piano! | My dog learned to play the piano because of the cat. | dogs,piano,lol | 2012-08-30 16:50:00+0000 | ctodd
That's great! Then you ask yourself...
● Can I slice a slice (or sub query)?● Can I do advanced where clauses ?● Can I union two slices server side?● Can I join data from two tables without two
request/response round trips?● What about procedures?● Can I write functions or aggregation functions?
Let's look at the API's we have
http://www.slideshare.net/aaronmorton/apachecon-nafeb2013
But none of those API's do what I want, and it seems simple
enough to do...
Intravert joins the “party” at the API Layer
Why not just do it client side?
● Move processing close to data– Idea borrowed from Hadoop
● Doing work close to the source can result in:– Less network IO
– Less memory spend encoding/decoding 'throw away' data
– New storage and access paradigms
Vertx + cassandra
● What is vertx ?– Distributed Event Bus which spans the server and
even penetrates into client side for effortless 'real-time' web applications
● What are the cool features?– Asynchronous – Hot re-loadable modules
– Modules can be written in groovy, ruby, java, java-script
http://vertx.io
Transport, payload, and batching
HTTP Transport
● HTTP is easy to use on firewall'ed networks● Easy to secure● Easy to compress● The defacto way to do everything anyway● IntraVert attempts to limit round-trips
– Not provide a terse binary format
JSON Payload
● Simple nested types like list, map, String● Request is composed of N operations● Each operation has a configurable timeout● Again, IntraVert attempts to limit round-trips
– Not provide a terse message format
Why not use lighting fast transport and serialization library X?
● X's language/code gen issues● You probably can not TCP dump X● Net-admins don't like 90 jars for health checks● IntraVert attempts to limit round-trips:
– Prepared statements
– Server side filtering
– Other cool stuff
Sample request and response
{"e": [ {
"type": "SETKEYSPACE",
"op": { "keyspace": "myks" }
}, {
"type": "SETCOLUMNFAMILY",
"op": { "columnfamily": "mycf" }
}, {
"type": "SLICE",
"op": {
"rowkey": "beers",
"start": "Allagash",
"end": "Sierra Nevada",
"size": 9
} } ] }
{
"exception":null,
"exceptionId":null,
"opsRes": {
"0":"OK",
"1":"OK",
"2":[{
"name":"Founders",
"value":"Breakfast Stout"
}]
}}
Server side filter
Imagine your data looks like...
{ "rowkey": "beers", "name": "Allagash", "value": "Allagash Tripel" }
{ "rowkey": "beers", "name": "Founders", "value": "Breakfast Stout" }
{ "rowkey": "beers", "name": "Dogfish Head",
"value": "Hellhound IPA" }
Application requirement
● User request wishes to know which beers are “Breakfast Stout” (s)
● Common “solutions”:– Write a copy of the data sorted by type
– Request all the data and parse on client side
Using an IntraVert filter
● Send a function to the server● Function is applied to subsequent get or slice
operations● Only results of the filter are returned to the
client
Defining a filter JavaScript
● Syntax to create a filter
{
"type": "CREATEFILTER",
"op": {
"name": "stouts",
"spec": "javascript",
"value": "function(row) { if (row['value'] == 'Breakfast Stout') return row; else return null; }"
}
},
Defining a filter Groovy/Java
● We can define a groovy closure or Java filter
{
"type": "CREATEFILTER",
"op": {
"name": "stouts",
"spec": "groovy",
"{ row -> if (row[\"value\"] == \"Breakfast Stout\") return row else return null }"
}
},
Filter flow
Common filter use cases
● Transform data● Prune columns/rows like a where clause● Extract data from complex fields (json, xml,
protobuf, etc)
Some light relief
Server Side Multi-Processor
It's the cure for your “redis envy”
Imagine your data looks like...
● { “row key”:”1”, name:”a” ,val...}
● { “row key”:”1”, name:”b” ,val...}
● { “row key”:”4”, name:”a” ,val...}
● { “row key”:”4”, name:”z” ,val...}
Application Requirements
● User wishes to intersect the column names of two slices/queries
● Common “solutions”– Pull all results to client and apply the intersection
there
Server Side MultiProcessor
● Send a class that implements MultiProcessor interface to server
● public List<Map> multiProcess (Map<Integer,Object> input, Map params);
● Do one or more get/slice operations as input● Invoke MultiProcessor on input
Multi-processor flow
Multi-processor use cases
● Union N slices● Intersection N slices● Some “Join” scenarios
Fat client becomes the 'Phat client'
Imagine you want to insert this data
● User wishes to enter this event for multiple column families– 09/10/201111:12:13
– App=Yahoo
– Platform=iOS
– OS=4.3.4
– Device=iPad2,1
– Resolution=768x1024
– Events–videoPlayPercent=38–Taste=great
http://www.slideshare.net/charmalloc/jsteincassandranyc2011
Inserting the data aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”
def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {
c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))
}
aggregateKeys(KEYSPACE \ ”ByMonth") = month //201109
aggregateKeys(KEYSPACE \ "ByDay") = day //20110910
aggregateKeys(KEYSPACE \ ”ByHour") = hour //2011091012
aggregateKeys(KEYSPACE \ ”ByMinute") = minute //201109101213
def r(columnName: String): Unit = {
aggregateKeys.foreach{tuple:(ColumnFamily, String) => {
val (columnFamily,row) = tuple
if (row !=null && row.size > 0)
rows add (columnFamily -> row has columnName inc) //increment the counter
}
}
}
ccAppPlatformOSVersionDeviceResolution(r)
http://www.slideshare.net/charmalloc/jsteincassandranyc2011
Solution
● Send the data once and compute the N permutations on the server side
public void process(JsonObject request, JsonObject state, JsonObject response, EventBus eb) { JsonObject params = request.getObject("mpparams"); String uid = (String) params.getString("userid"); String fname = (String) params.getString("fname"); String lname = (String) params.getString("lname"); String city = (String) params.getString("city");
RowMutation rm = new RowMutation("myks", IntraService.byteBufferForObject(uid)); QueryPath qp = new QueryPath("users", null, IntraService.byteBufferForObject("fname")); rm.add(qp, IntraService.byteBufferForObject(fname), System.nanoTime()); QueryPath qp2 = new QueryPath("users", null, IntraService.byteBufferForObject("lname")); rm.add(qp2, IntraService.byteBufferForObject(lname), System.nanoTime()); ... try { StorageProxy.mutate(mutations, ConsistencyLevel.ONE); } catch (WriteTimeoutException | UnavailableException | OverloadedException e) { e.printStackTrace(); response.putString("status", "FAILED"); } response.putString("status", "OK");}
Service Processor Flow
IntraVert status
● Still pre 1.0● Good docs
– https://github.com/zznate/intravert-ug/wiki/_pages
● Functional equivalent to thrift (mostly features ported)
● CQL support● Virgil (coming soon)● Hbase like scanners (coming soon)
Hack at it
https://github.com/zznate/intravert-ug
Questions?