joins and other mongodb 3.2 aggregation enhancements
TRANSCRIPT
MongoDB 3.2 – $lookup and Other Aggregation
EnhancementsAndrew Morgan
@clusterdbclusterdb.com
[email protected] November 2015
DISCLAIMER: MongoDB's product plans are for informational purposes only. MongoDB's plans may change and you should not rely on them for delivery of a specific feature at a specific time.
Agenda
Document vs. Relational Model
Analytics on MongoDB data
60,000 feet – what is the aggregation pipeline
Aggregation pipeline operators
$lookup (Left Outer Equi Joins) in MongoDB
3.2
Other aggregation enhancements
Worked examples
Document vs. Relational ModelRDBMS MongoDB
{ _id: ObjectId("4c4ba5e5e8aabf3"), employee_name: {First: "Billy", Last: "Fish"}, department: "Engineering", title: "Aquarium design", pay_band: "C", benefits: [ { type: "Health", plan: "PPO Plus" }, { type: "Dental", plan: "Standard" } ] }
Existing Alternatives to Joins{ "_id": 10000, "items": [ { "productName": "laptop", "unitPrice": 1000, "weight": 1.2, "remainingStock": 23 }, { "productName": "mouse", "unitPrice": 20, "weight": 0.2, "remainingStock": 276 } ],…}
• Option 1: Include all data for an order in the same document
– Fast reads• One find delivers all the required data
– Captures full description at the time of the event
– Consumes extra space• Details of each product stored in many order
documents– Complex to maintain
• A change to any product attribute must be propagated to all affected orders
orders
Existing Alternatives to Joins
{ "_id": 10000, "items": [ 12345, 54321 ], ...}
• Option 2: Order document references product documents
– Slower reads• Multiple trips to the database
– Space efficient• Product details stored once
– Lose point-in-time snapshot of full record
– Extra application logic• Must iterate over product IDs in
the order document and find the product documents
• RDBMS would automate through a JOIN
orders
{ "_id": 12345, "productName": "laptop", "unitPrice": 1000, "weight": 1.2, "remainingStock": 23}{ "_id": 54321, "productName": "mouse", "unitPrice": 20, "weight": 0.2, "remainingStock": 276}
products
The Winner?• In general, Option 1 wins
– Performance and containment of everything in same place beats space efficiency of normalization
– There are exceptions• e.g. Comments in a blog post -> unbounded size
• However, analytics benefit from combining data from multiple collections– Keep listening...
Aggregation Pipeline
{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}
Aggregation Pipeline
$match{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}
Aggregation Pipeline
$match{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}
{★ds}{★ds}{★ds}
Aggregation Pipeline
$match $project{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}
{★ds}{★ds}{★ds}
{=d+s}
Aggregation Pipeline
$match $project{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}
{★ds}{★ds}{★ds}
{★}{★}{★}
{=d+s}
Aggregation Pipeline
$match $project $lookup{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}
{★ds}{★ds}{★ds}
{★}{★}{★}{★}
{★}{★}{★}
{=d+s}
Aggregation Pipeline
$match $project $lookup{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}
{★ds}{★ds}{★ds}
{★}{★}{★}{★}
{★}{★}{★}
{=d+s}
{★[]}{★[]}{★}
Aggregation Pipeline
$match $project $lookup $group{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds}{★ds} {}
{★ds}{★ds}{★ds}
{★}{★}{★}{★}
{★}{★}{★}
{=d+s}
{ Σ λ σ}{ Σ λ σ}{ Σ λ σ}
{★[]}{★[]}{★}
Aggregation Pipeline Stages• $match
Filter documents• $geoNear
Geospherical query• $project
Reshape documents• $lookup
New – Left-outer equi joins• $unwind
Expand documents• $group
Summarize documents
• $sampleNew – Randomly selects a subset of documents
• $sortOrder documents
• $skipJump over a number of documents
• $limitLimit number of documents
• $redactRestrict documents
• $outSends results to a new collection
$lookup• Left-outer join
– Includes all documents from the left collection
– For each document in the left collection, find the matching documents from the right collection and embed them
Left Collection Right Collection
$lookupdb.leftCollection.aggregate([{ $lookup: { from: “rightCollection”,
localField: “leftVal”, foreignField:
“rightVal”, as: “embeddedData”
}}])
leftCollection rightCollection
New Aggregation Operators• Array operations
– $slice, $arrayElemAt, $concatArrays, $isArray, $filter, $min, $max, $avg and $sum
• Standard Deviations– $stdDevSamp (sample) and
$stdDevPop (complete)• Square Root
– $sqrt
• Absolute (make +ve) value– $abs
• Rounding numbers– $trunc, $ceil, $floor
• Logarithms– $log, $log10, $ln
• Raise to power– $pow
• Natural Exponent– $exp
Worked Example – Data Set
db.postcodes.findOne(){ "_id": ObjectId("5600521e50fa77da54dfc0d2"), "postcode": "SL6 0AA", "location": { "type": "Point", "coordinates": [ 51.525605, -0.700974 ] }}
db.homeSales.findOne(){ "_id": ObjectId("56005dd980c3678b19792b7f"), "amount": 9000, "date": ISODate("1996-09-19T00:00:00Z"), "address": { "nameOrNumber": 25, "street": "NORFOLK PARK COTTAGES", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7DR" }}
Reduce Data Set First
db.homeSales.aggregate([ {$match: { amount: {$gte:3000000}} }])
… { "_id": ObjectId("56005dda80c3678b19799e52"), "amount": 3000000, "date": ISODate("2012-04-19T00:00:00Z"), "address": { "nameOrNumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" } },…
Join (left-outer-equi) Results With Second Collection
db.homeSales.aggregate([ {$match: { amount: {$gte:3000000}} }, {$lookup: { from: "postcodes", localField:
"address.postcode", foreignField: "postcode", as: "postcode_docs"} }])
... "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "postcode_docs": [ { "_id": ObjectId("560053e280c3678b1978b293"), "postcode": "SL6 5ND", "location": { "type": "Point", "coordinates": [ 51.549516, -0.80702 ] }}]}, ...
Refactor Each Resulting Document...}, {$project: { _id: 0, saleDate: ”$date", price: "$amount", address: 1, location: {$arrayElemAt: ["$postcode_docs.location",
0]}}])
{ "address": { "nameOrNumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "saleDate": ISODate("2012-04-19T00:00:00Z"), "price": 3000000, "location": { "type": "Point", "coordinates": [ 51.549516, -0.80702 ]}},...
Sort on Sale Price & Write to Collection
...}, {$sort:
{price: -1}}, {$out: "hotSpots"}])
…{"address": { "nameOrNumber": "2 - 3", "street": "THE SWITCHBACK", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7RJ" }, "saleDate": ISODate("1999-03-15T00:00:00Z"), "price": 5425000, "location": { "type": "Point", "coordinates": [ 51.536848, -0.735835 ]}},...
Aggregated Statisticsdb.homeSales.aggregate([ {$group: { _id:
{$year: "$date"}, higestPrice:
{$max: "$amount"}, lowestPrice:
{$min: "$amount"}, averagePrice:
{$avg: "$amount"}, amountStdDev:
{$stdDevPop: "$amount"} }}])
... { "_id": 1995, "higestPrice": 1000000, "lowestPrice": 12000, "averagePrice": 114059.35206869633, "amountStdDev": 81540.50490801703 }, { "_id": 1996, "higestPrice": 975000, "lowestPrice": 9000, "averagePrice": 118862, "amountStdDev": 79871.07569783277 }, ...
Clean Up Output..., {$project: { _id: 0, year: "$_id", higestPrice: 1, lowestPrice: 1, averagePrice:
{$trunc: "$averagePrice"}, priceStdDev:
{$trunc: "$amountStdDev"} } } ])
... { "higestPrice": 1000000, "lowestPrice": 12000, "averagePrice": 114059, "year": 1995, "priceStdDev": 81540 }, { "higestPrice": 2200000, "lowestPrice": 10500, "averagePrice": 307372, "year": 2004, "priceStdDev": 199643 },...
Postal Code & Location for Each Year’s Highest Priced Sale
db.homeSales.aggregate([ {$sort: {amount: -1}}, {$group: { _id: {$year: "$date"}, priciestPostCode: {$first: "$address.postcode"} } },
{$lookup: { from: "postcodes", localField:
"priciestPostCode", foreignField: "postcode", as: "locationData" } }, {$sort: {_id: -1}},
Postal Code & Location for Each Year’s Highest Priced Sale
{$project: { _id: 0, Year: "$_id", PostCode:
"$priciestPostCode", Location:{$arrayElemAt: [
"$locationData.location”, 0]} } }])
... { "Year": 2014, "PostCode": "SL6 1UP", "Location”: { "type": "Point", "coordinates": [ 51.51407, -0.704414 ] } },...
Aggregation Optionsdb.cData.aggregate([
<pipeline stages>],{
'allowDiskUse': true,
'cursor’:{
'batchSize': 5}
})
• explain – Information on execution plan
• allowDiskUse– Enable use of disk to store
intermediate results• cursor.batchsize
– Specify the size of the initial result set
Aggregation With a Sharded Database• Workload split between shards
– Client works through mongos as with any query
– Shards execute pipeline up to a point– A single shard merges cursors and
continues processing– Use explain to analyze pipeline split– Early $match on shard key may
exclude shards– Potential CPU and memory
implications for primary shard host– $lookup & $out performed within
Primary shard for the database
?
Tableau + MongoDB Connector for BI
Restrictions• $lookup only support equality for the match• $lookup can only be used in the aggregation pipeline (e.g. not for find)• The pipeline is linear; no forks. Can remove data at each stage and can only add new
raw data through $lookup• Right collection for $lookup cannot be sharded• Indexes are only used at the beginning of the pipeline (and right tables in subsequent
$lookups), before any data transformations• $out can only be used in the final stage of the pipeline• $geoNear can only be the first stage in the pipeline• The BI Connector for MongoDB is part of MongoDB Enterprise Advanced
– Not in community
Next Steps• Documentation
– https://docs.mongodb.org/manual/release-notes/3.2/#aggregation-framework-enhancements • Not yet ready for production but download and try!
– https://www.mongodb.org/downloads#development • Detailed blog
– https://www.mongodb.com/blog/post/joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2-part-1-of-3-introduction
• Webinars– Tomorrow: What's New in MongoDB 3.2 https://www.mongodb.com/webinar/whats-new-in-mongodb-3-2 – Replay: 3.2 $lookup & aggregation
https://www.mongodb.com/presentations/webinar-joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2 • Feedback
– MongoDB 3.2 Bug Hunt• https://www.mongodb.com/blog/post/announcing-the-mongodb-3-2-bug-hunt
– https://jira.mongodb.org/
DISCLAIMER: MongoDB's product plans are for informational purposes only. MongoDB's plans may change and you should not rely on them for delivery of a specific feature at a specific time.
MongoDB Days 2015October 6, 2015October 20, 2015 November 5, 2015December 2, 2015
FranceGermany UKSilicon Valley