joins and other mongodb 3.2 aggregation enhancements

MongoDB 3.2 – $lookup and Other Aggregation

EnhancementsAndrew Morgan

@clusterdbclusterdb.com

[email protected] November 2015

mailto:[email protected]



DISCLAIMER: MongoDB's product plans are for informational purposes only. MongoDB's plans may change and you should not rely on them for delivery of a specific feature at a specific time.

Agenda

Document vs. Relational Model

Analytics on MongoDB data

60,000 feet – what is the aggregation pipeline

Aggregation pipeline operators

$lookup (Left Outer Equi Joins) in MongoDB

3.2

Other aggregation enhancements

Worked examples

Document vs. Relational ModelRDBMS MongoDB

{ _id: ObjectId("4c4ba5e5e8aabf3"), employee_name: {First: "Billy", Last: "Fish"}, department: "Engineering", title: "Aquarium design", pay_band: "C", benefits: [ { type: "Health", plan: "PPO Plus" }, { type: "Dental", plan: "Standard" } ] }

Existing Alternatives to Joins{ "_id": 10000, "items": [ { "productName": "laptop", "unitPrice": 1000, "weight": 1.2, "remainingStock": 23 }, { "productName": "mouse", "unitPrice": 20, "weight": 0.2, "remainingStock": 276 } ],…}

• Option 1: Include all data for an order in the same document

– Fast reads• One find delivers all the required data

– Captures full description at the time of the event

– Consumes extra space• Details of each product stored in many order

documents– Complex to maintain

• A change to any product attribute must be propagated to all affected orders

orders

Existing Alternatives to Joins

{ "_id": 10000, "items": [ 12345, 54321 ], ...}

• Option 2: Order document references product documents

– Slower reads• Multiple trips to the database

– Space efficient• Product details stored once

– Lose point-in-time snapshot of full record

– Extra application logic• Must iterate over product IDs in

the order document and find the product documents

• RDBMS would automate through a JOIN

orders

{ "_id": 12345, "productName": "laptop", "unitPrice": 1000, "weight": 1.2, "remainingStock": 23}{ "_id": 54321, "productName": "mouse", "unitPrice": 20, "weight": 0.2, "remainingStock": 276}

products

The Winner?• In general, Option 1 wins

– Performance and containment of everything in same place beats space efficiency of normalization

– There are exceptions• e.g. Comments in a blog post -> unbounded size

• However, analytics benefit from combining data from multiple collections– Keep listening...

Aggregation Pipeline

{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}


$match{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}


$match{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}

{★ds}{★ds}{★ds}


$match $project{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}


{=d+s}


$match $project{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}


{★}{★}{★}

{=d+s}


$match $project $lookup{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}


{★}{★}{★}{★}

{★}{★}{★}

{=d+s}


$match $project $lookup{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}


{★}{★}{★}{★}

{★}{★}{★}

{=d+s}

{★[]}{★[]}{★}


$match $project $lookup $group{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}


{★}{★}{★}{★}

{★}{★}{★}

{=d+s}

{ Σ λ σ}{ Σ λ σ}{ Σ λ σ}

{★[]}{★[]}{★}

Aggregation Pipeline Stages• $match

Filter documents• $geoNear

Geospherical query• $project

Reshape documents• $lookup

New – Left-outer equi joins• $unwind

Expand documents• $group

Summarize documents

• $sampleNew – Randomly selects a subset of documents

• $sortOrder documents

• $skipJump over a number of documents

• $limitLimit number of documents

• $redactRestrict documents

• $outSends results to a new collection

$lookup• Left-outer join

– Includes all documents from the left collection

– For each document in the left collection, find the matching documents from the right collection and embed them

Left Collection Right Collection

$lookupdb.leftCollection.aggregate([{ $lookup: { from: “rightCollection”,

localField: “leftVal”, foreignField:

“rightVal”, as: “embeddedData”

}}])

leftCollection rightCollection

New Aggregation Operators• Array operations

– $slice, $arrayElemAt, $concatArrays, $isArray, $filter, $min, $max, $avg and $sum

• Standard Deviations– $stdDevSamp (sample) and

$stdDevPop (complete)• Square Root

– $sqrt

• Absolute (make +ve) value– $abs

• Rounding numbers– $trunc, $ceil, $floor

• Logarithms– $log, $log10, $ln

• Raise to power– $pow

• Natural Exponent– $exp

Worked Example – Data Set

db.postcodes.findOne(){ "_id": ObjectId("5600521e50fa77da54dfc0d2"), "postcode": "SL6 0AA", "location": { "type": "Point", "coordinates": [ 51.525605, -0.700974 ] }}

db.homeSales.findOne(){ "_id": ObjectId("56005dd980c3678b19792b7f"), "amount": 9000, "date": ISODate("1996-09-19T00:00:00Z"), "address": { "nameOrNumber": 25, "street": "NORFOLK PARK COTTAGES", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7DR" }}

Reduce Data Set First

db.homeSales.aggregate([ {$match: { amount: {$gte:3000000}} }])

… { "_id": ObjectId("56005dda80c3678b19799e52"), "amount": 3000000, "date": ISODate("2012-04-19T00:00:00Z"), "address": { "nameOrNumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" } },…

Join (left-outer-equi) Results With Second Collection

db.homeSales.aggregate([ {$match: { amount: {$gte:3000000}} }, {$lookup: { from: "postcodes", localField:

"address.postcode", foreignField: "postcode", as: "postcode_docs"} }])

... "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "postcode_docs": [ { "_id": ObjectId("560053e280c3678b1978b293"), "postcode": "SL6 5ND", "location": { "type": "Point", "coordinates": [ 51.549516, -0.80702 ] }}]}, ...

Refactor Each Resulting Document...}, {$project: { _id: 0, saleDate: ”$date", price: "$amount", address: 1, location: {$arrayElemAt: ["$postcode_docs.location",

0]}}])

{ "address": { "nameOrNumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "saleDate": ISODate("2012-04-19T00:00:00Z"), "price": 3000000, "location": { "type": "Point", "coordinates": [ 51.549516, -0.80702 ]}},...

Sort on Sale Price & Write to Collection

...}, {$sort:

{price: -1}}, {$out: "hotSpots"}])

…{"address": { "nameOrNumber": "2 - 3", "street": "THE SWITCHBACK", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7RJ" }, "saleDate": ISODate("1999-03-15T00:00:00Z"), "price": 5425000, "location": { "type": "Point", "coordinates": [ 51.536848, -0.735835 ]}},...

Aggregated Statisticsdb.homeSales.aggregate([ {$group: { _id:

{$year: "$date"}, higestPrice:

{$max: "$amount"}, lowestPrice:

{$min: "$amount"}, averagePrice:

{$avg: "$amount"}, amountStdDev:

{$stdDevPop: "$amount"} }}])

... { "_id": 1995, "higestPrice": 1000000, "lowestPrice": 12000, "averagePrice": 114059.35206869633, "amountStdDev": 81540.50490801703 }, { "_id": 1996, "higestPrice": 975000, "lowestPrice": 9000, "averagePrice": 118862, "amountStdDev": 79871.07569783277 }, ...

Clean Up Output..., {$project: { _id: 0, year: "$_id", higestPrice: 1, lowestPrice: 1, averagePrice:

{$trunc: "$averagePrice"}, priceStdDev:

{$trunc: "$amountStdDev"} } } ])

... { "higestPrice": 1000000, "lowestPrice": 12000, "averagePrice": 114059, "year": 1995, "priceStdDev": 81540 }, { "higestPrice": 2200000, "lowestPrice": 10500, "averagePrice": 307372, "year": 2004, "priceStdDev": 199643 },...

Postal Code & Location for Each Year’s Highest Priced Sale

db.homeSales.aggregate([ {$sort: {amount: -1}}, {$group: { _id: {$year: "$date"}, priciestPostCode: {$first: "$address.postcode"} } },

{$lookup: { from: "postcodes", localField:

"priciestPostCode", foreignField: "postcode", as: "locationData" } }, {$sort: {_id: -1}},

Postal Code & Location for Each Year’s Highest Priced Sale

{$project: { _id: 0, Year: "$_id", PostCode:

"$priciestPostCode", Location:{$arrayElemAt: [

"$locationData.location”, 0]} } }])

... { "Year": 2014, "PostCode": "SL6 1UP", "Location”: { "type": "Point", "coordinates": [ 51.51407, -0.704414 ] } },...

Aggregation Optionsdb.cData.aggregate([

<pipeline stages>],{

'allowDiskUse': true,

'cursor’:{

'batchSize': 5}

})

• explain – Information on execution plan

• allowDiskUse– Enable use of disk to store

intermediate results• cursor.batchsize

– Specify the size of the initial result set

Aggregation With a Sharded Database• Workload split between shards

– Client works through mongos as with any query

– Shards execute pipeline up to a point– A single shard merges cursors and

continues processing– Use explain to analyze pipeline split– Early $match on shard key may

exclude shards– Potential CPU and memory

implications for primary shard host– $lookup & $out performed within

Primary shard for the database

?

Tableau + MongoDB Connector for BI

Restrictions• $lookup only support equality for the match• $lookup can only be used in the aggregation pipeline (e.g. not for find)• The pipeline is linear; no forks. Can remove data at each stage and can only add new

raw data through $lookup• Right collection for $lookup cannot be sharded• Indexes are only used at the beginning of the pipeline (and right tables in subsequent

$lookups), before any data transformations• $out can only be used in the final stage of the pipeline• $geoNear can only be the first stage in the pipeline• The BI Connector for MongoDB is part of MongoDB Enterprise Advanced

– Not in community

Next Steps• Documentation

– https://docs.mongodb.org/manual/release-notes/3.2/#aggregation-framework-enhancements • Not yet ready for production but download and try!

– https://www.mongodb.org/downloads#development • Detailed blog

– https://www.mongodb.com/blog/post/joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2-part-1-of-3-introduction

• Webinars– Tomorrow: What's New in MongoDB 3.2 https://www.mongodb.com/webinar/whats-new-in-mongodb-3-2 – Replay: 3.2 $lookup & aggregation

https://www.mongodb.com/presentations/webinar-joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2 • Feedback

– MongoDB 3.2 Bug Hunt• https://www.mongodb.com/blog/post/announcing-the-mongodb-3-2-bug-hunt

– https://jira.mongodb.org/

DISCLAIMER: MongoDB's product plans are for informational purposes only. MongoDB's plans may change and you should not rely on them for delivery of a specific feature at a specific time.

https://docs.mongodb.org/manual/release-notes/3.2/%23aggregation-framework-enhancements

https://docs.mongodb.org/manual/release-notes/3.2/%23aggregation-framework-enhancements

https://www.mongodb.org/downloads%23development

https://www.mongodb.org/downloads%23development

https://www.mongodb.com/blog/post/joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2-part-1-of-3-introduction

https://www.mongodb.com/blog/post/joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2-part-1-of-3-introduction

https://www.mongodb.com/webinar/whats-new-in-mongodb-3-2

https://www.mongodb.com/webinar/whats-new-in-mongodb-3-2

https://www.mongodb.com/presentations/webinar-joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2

https://www.mongodb.com/presentations/webinar-joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2

https://jira.mongodb.org/




MongoDB Days 2015October 6, 2015October 20, 2015 November 5, 2015December 2, 2015

FranceGermany UKSilicon Valley

https://www.mongodb.com/events/mongodb-days-france

https://www.mongodb.com/events/mongodb-days-germany

https://www.mongodb.com/events/mongodb-days-uk

https://www.mongodb.com/events/mongodb-days-siliconvalley

joins and other mongodb 3.2 aggregation enhancements

Software