precog & mongodb user group: skyrocket your analytics

Skyrocket your Analytics

MongoDB Meetup on December 10, 2012www.precog.com@precogioNov - Dec 2012

■ Welcome to the Precog & MongoDB Meetup!

■ Questions? Please ask away!

welcome & agenda

7:00 - 7:30Overview of Precog for MongoDB by Derek Chen-Becker

7:30 - 7:45Break (grab a beer, drink and snacks)

7:45 - 8:15Analyzing Big Data with Quirrel by John A. De Goes

8:15 - 8:30Precog Challenge Problems! Win some prizes!

■ Precog TeamDerek Chen-Becker, Lead Infrastructure Engineer

John A. De Goes, CEO/Founder

Kris Nuttycombe, Dir of Engineering

Nathan Lubchenco, Developer Evangelist

■ MongoDB HostClay Mcllrath

■ Thank you to Google for hosting us!

who we are

Current MongoDB Support for Analytics

Derek Chen-BeckerPrecog Lead Infrastructure Engineer@dchenbeckerNov - Dec 2012

■ Mongo has support for a small set of simple aggregation primitives

○ count - returns the count of a given collection's documents with optional

filtering

○ distinct - returns the distinct values for given selector criteria

○ group - returns groups of documents based on given key criteria. Group

cannot be used in sharded configurations

current mongodb support for analytics

> db.london_medals.group({

key : {"Country":1},

reduce : function(curr, result) { result.total += 1 },

initial: { total : 0, fullTotal: db.london_medals.count() },

finalize: function(result){ result.percent = result.total * 100 / result.fullTotal }

})

[

{"Country" : "Great Britain", "total" : 88, "fullTotal" : 1019, "percent" : 8.635917566241414},

{"Country" : "Dominican Republic", "total" : 2, "fullTotal" : 1019, "percent" : 0.19627085377821393},

{"Country" : "Denmark", "total" : 16, "fullTotal" : 1019, "percent" : 1.5701668302257115},

...

■ More sophisticated queries are possible, but require a lot of JS and you'll hit the limits pretty quickly

■ Group cannot be used in sharded configurations. For that you need...


■ Map/Reduce: Exactly what its name says.

■ You utilize JavaScript functions to map your documents' data, then reduce that

data into a form of your choosing.


Input Collection

Mapping Function Reducing Function

Result Document

Output Collection

■ The mapping function redefines this to be the current document

■ Output mapped keys and values are generated via the emit function

■ Emit can be called zero or more times for a single document

function () { emit(this.Countryname, { count : 1 }); }

function () {

for (var i = 0; i < this.Pupils.length; i++) {

emit(this.Pupils[i].name, { count : 1});

}

function () {

if ((this.parents.age - this.age) < 25) { emit(this.age, { income : this.income }); }

}


■ The reduction function is used to aggregate the outputs from the mapping

function

■ The function receives two inputs: the key for the elements being reduced, and

the values being reduced

■ The result of the reduction must be the same format as in the input elements,

and must be idempotent

function (key, values) {

var count = 0;

for (var item in values) {

count += item.count

}

{ "count" : count }

}


■ Map/Reduce utilizes JavaScript to do all of its work

○ JavaScript in MongoDB is currently single-threaded (performance bottleneck)

○ Using external JS libraries is cumbersome and doesn't play well with sharding

○ No matter what language you're actually using, you'll be writing/maintaining

JavaScript

■ Troubleshooting the Map/Reduce functions is primitive. 10Gen's advice: "write

your own emit function" (!)

■ Output options are flexible, but have some caveats

○ Output to a result document must fit in a BSON doc (16MB limit)

○ For an output collection: if you want indices on the result set, you need to pre-

create the collection then use the merge output option


■ The Aggregation Framework is designed to alleviate some of the issues with

Map/Reduce for common analytical queries

■ New in 2.2

■ Works by constructing a pipeline of operations on data. Similar to M/R, but

implemented in native code (higher performance, not single-threaded)


Input Collection Match Project Group

■ Filtering/paging ops

○ $match - utilize Mongo selection syntax to choose input docs

○ $limit

○ $skip

■ Field manipulation ops

○ $project - select which fields are processed. Can add new fields

○ $unwind - flattens a doc with an array field into multiple events, one per array

value

■ Output ops

○ $group

○ $sort

■ Most common pipelines will be of the form $match ⇒ $project ⇒ $group


■ $match is very important to getting good performance

■ Needs to be the first op in the pipeline, otherwise indices can't be used

■ Uses normal MongoDB query syntax, with two exceptions

○ Can't use a $where clause (this requires JavaScript)

○ Can't use Geospatial queries (just because)

{ $match : { "Name" : "Fred" } }

{ $match : { "Countryname" : { $neq : "Great Britain" } } }

{ $match : { "Income" : { $exists : 1 } } }


■ $project is used to select/compute/augment the fields you want in the output

documents

{ $project : { "Countryname" : 1, "Sportname" : 1 } }

■ Can reference input document fields in computations via "$"

{ $project : { "country_name" : "$Countryname" } } /* renames field */

■ Computation of field values is possible, but it's limited and can be quite painful

{ $project: {

"_id":0, "height":1, "weight":1,

"bmi": { $divide : ["$weight", { $multiply : [ "$height", "$height" ] } ] } }

} /* omit "_id" field, inflict pain and suffering on future maintainers... */


■ $group, like the group command, collates and computes sets of values based

on the identity field ("_id"), and whatever other fields you want

{ $group : { "_id" : "$Countryname" } } /* distinct list of countries */

■ Aggregation operators can be used to perform computation ($max, $min, $avg,

$sum)

{ $group : { "_id" : "$Countryname", "count" : { $sum : 1 } } } /* histogram by

country */

{ $group : { "_id" : "$Countryname", "weight" : { $avg : "$weight" } } }

{ $group : { "_id" : "$Countryname", "weight" : { $sum : "$weight" } } }

■ Set-based operations ($addToSet, $push)

{ $group : { "_id" : "$Countryname", "sport" : { $addToSet : "$sport" } } }


■ Aggregation framework has a limited set of operators

○ $project limited to $add/$subtract/$multiply/$divide, as well as some

boolean, string, and date/time operations

○ $group limited to $min/$max/$avg/$sum

■ Some operators, notably $group and $sort, are required to operate entirely in

memory

○ This may prevent aggregation on large data sets

○ Can't work around using subsetting like you can with M/R, because output is

strictly a document (no collection option yet)


■ Even with these tools, there are still limitations

○ MongoDB is not relational. This means a lot of work on your part if you have

datasets representing different things that you'd like to correlate. Clicks vs

views, for example

○ While the Aggregation Framework alleviates some of the performance issues

of Map/Reduce, it does so by throwing away flexibility

○ The best approach for parallelization (sharding) is fraught with operational

challenges (come see me for horror stories)


Overview of Precog for MongoDB

Derek Chen-BeckerPrecog Lead Infrastructure Engineer@dchenbeckerNov - Dec 2012

■ Download file: http://www.precog.com/mongodb

■ Setup:

$ unzip precog.zip

$ cd precog

$ emacs -nw config.cfg (adjust ports, etc)

$ ./precog.sh

overview of precog for mongodb

■ Precog for MongoDB allows you to perform sophisticated analytics utilizing

existing mongo instances

■ Self-contained JAR bundling:

○ The Precog Analytics service

○ Labcoat IDE for Quirrel

■ Does not include the full Precog stack

○ Minimal authentication handling (single api key in config)

○ No ingest service (just add data directly to mongo)


■ Some sample queries

-- histogram by countrydata := //summer_games/athletessolve 'country { country: 'country, count: count(data where data.Countryname = 'country) }


Analyzing Big Data with Quirrel

John A. De GoesPrecog CEO/Founder@jdegoesNov - Dec 2012

Quirrel is a statistically-oriented query language designed for the analysis of large-scale, potentially heterogeneous data sets.

overview

● Simple● Set-oriented● Statistically-oriented● Purely declarative● Implicitly parallel

quirrel

pageViews := //pageViewsavg := mean(pageViews.duration)bound := 1.5 * stdDev(pageViews.duration)pageViews.userId where pageViews.duration > avg + bound

sneak peek

1true[[1, 0, 0], [0, 1, 0], [0, 0, 1]]

"All work and no play makes jack a dull boy"

{"age": 23, "gender": "female", "interests": ["sports", "tennis"]}

quirrel speaks json

-- Ignore me.(- Ignore me, too -)

comments

2 * 4

(1 + 2) * 3 / 9 > 23

3 > 2 & (1 != 2)

false & true | !false

basic expressions

x := 2

square := x * x

named expressions

//pageViews

load("/pageViews")

//campaigns/summer/2012

loading data

pageViews := load("/pageViews")

pageViews.userId

pageViews.keywords[2]

drilldown

count(//pageViews)

sum(//purchases.total)

stdDev(//purchases.total)

reductions

pageViews := //pageViews

pageViews.userId where pageViews.duration > 1000

filtering

clicks with {dow: dayOfWeek(clicks.time)}

augmentation

import std::stats::rank

rank(//pageViews.duration)

standard library

ctr(day) := count(clicks where clicks.day = day) / count(impressions where impressions.day = day)

ctrOnMonday := ctr(1)

ctrOnMonday

user-defined functions

solve 'day {day: 'day, ctr: count(clicks where clicks.day = 'day) / count(impressions where impressions.day = 'day)}

grouping - implicit constraints

solve 'day = purchases.day {day: 'day, cummTotal: sum(purchases.total where purchases.day < 'day)}

grouping - explicit constraints

http://quirrel-lang.org

questions?

Now, it's your turn! Win some cool prizes!

Precog Challenge ProblemsNov - Dec 2012

■ Using the conversions data, find the state with the highest average income.

■ Variable names: conversions.customers.state and conversions.customers.income

precog challenge #1

■ Use Labcoat to display a bar chart of the clicks per month.

■ Variable names: clicks.timestamp

precog challenge #2

■ What product has the worst overall sales to women? To men?

■ Variable names: billing.product.ID, billing.product.price, billing.customer.gender

precog challenge #3

conversions := //conversions

results := solve 'state

{state: 'state,

aveIncome: mean(conversions.customer.income where

conversions.customer.state = 'state)}

results where results.aveIncome = max(results.aveIncome)

precog challenge #1 possible solution

clicks := //clicks

clicks' := clicks with {month: std::time::monthOfYear(clicks.timeStamp)}

solve 'month

{month: 'month, clicks: count(clicks'.product.price where clicks'.month = 'month)}


billing := //billing

results := solve 'product, 'gender

{product: 'product,

gender: 'gender,

sales: sum(billing.product.price where

billing.product.ID = 'product &

billing.customer.gender = 'gender)}

worstSalesToWomen := results where results.gender = "female" &

results.sales = min(results.sales where results.gender = "female")

worstSalesToMen := results where results.gender = "male" &

results.sales = min(results.sales where results.gender = "male")

worstSalesToWomen union worstSalesToMen


Thank you!

Follow us on Twitter@precogio@jdegoes@dchenbecker

Download Precog for MongoDB for FREE:www.precog.com/mongodb

Try Precog for free and get a free account:www.precog.com

Subscribe to our monthly newsletter:www.precog.com/about/newsletter

Nov - Dec 2012

precog & mongodb user group: skyrocket your analytics

Documents