data modeling deep dive

Data Modeling:

Four use cases

Toji GeorgeSolutions ArchitectMongoDB Inc.

Agenda

• 4 Real World Schemas

– Inbox

– History

– Indexed Attributes

– Multiple Identities

• Conclusions

In MongoDB

Application Development requires Good Schema

Design

Success comes from Proper Data Structure

“Schema-less”?

#1 –Message Inbox

Lets get social

Sending Messages

?

Design Goals

• Efficiently send new messages to recipients

• Efficiently read inbox

Reading My Inbox

?

Three (of many) Approaches

• Fan out on Read

• Fan out on Write

• Fan out on Write with Bucketing

Fan out on read

// Shard on "from"db.shardCollection( "mongodbdays.inbox", { from: 1 } )

// Make sure we have an index to handle inbox readsdb.inbox.ensureIndex( { to: 1, sent: 1 } )

msg = {from: "Joe",to: [ "Bob", "Jane" ],sent: new Date(), message: "Hi!",

}

// Send a messagedb.inbox.save( msg )

// Read my inboxdb.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )

Fan out on read – I/O

Shard 1 Shard 2 Shard 3

Send Message

Fan out on read – I/O


Read Inbox

Send Message

Considerations

• Write: One document per message sent

• Read: Find all messages with my own name in

the recipient field

• Read: Requires scatter-gather on sharded

cluster

• A lot of random I/O on a shard to find everything

Fan out on write

// Shard on “recipient” and “sent” db.shardCollection( "mongodbdays.inbox", { ”recipient”: 1, ”sent”: 1 } )

msg = {from: "Joe",to: [ "Bob", "Jane" ],sent: new Date(), message: "Hi!",

}

// Send a messagefor ( recipient in msg.to ) {

msg.recipient = msg.to[recipient]db.inbox.save( msg );

}

// Read my inboxdb.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )

Fan out on write – I/O


Send Message

Fan out on write – I/O

Read Inbox

Send Message


Considerations

• Write: One document per recipient

• Read: Find all of the messages with me as the

recipient

• Can shard on recipient, so inbox reads hit one

shard

• But still lots of random I/O on the shard

Fan out on write with buckets

// Shard on "owner / sequence"

db.shardCollection( "mongodbdays.inbox",

{ owner: 1, sequence: 1 } )

db.shardCollection( "mongodbdays.users", { user_name: 1 } )

msg = {

from: "Joe",

to: [ "Bob", "Jane" ],

sent: new Date(),

message: "Hi!",

}


// Send a messagefor( recipient in msg.to) {

count = db.users.findAndModify({query: { user_name: msg.to[recipient] },

update: { "$inc": { "msg_count": 1 } },upsert: true,new: true }).msg_count;

sequence = Math.floor(count / 50);

db.inbox.update({ owner: msg.to[recipient], sequence: sequence }, { $push: { "messages": msg } },{ upsert: true } );

}

// Read my inboxdb.inbox.find( { owner: "Joe" } )

.sort ( { sequence: -1 } ).limit( 2 )


• Each “inbox” document is an array of messages

• Append a message onto “inbox” of recipient

• Bucket inboxes so there’s not too many

messages per document

• Can shard on recipient, so inbox reads hit one

shard

• 1 or 2 documents to read the whole inbox

Fan out on write with buckets – I/O


Send Message

Fan out on write with buckets – I/O


Read Inbox

Send Message

#2 - History

Design Goals

• Need to retain a limited amount of history e.g.

– Hours, Days, Weeks

– May be legislative requirement (e.g. HIPPA, SOX,

DPA)

• Need to query efficiently by

– match

– ranges

3 (of many) approaches

• Bucket by Number of messages

• Fixed size array

• Bucket by date + TTL collections

Bucket by number of messages

db.inbox.find(){ owner: "Joe", sequence: 25, messages: [

{ from: "Joe",to: [ "Bob", "Jane" ],sent: ISODate("2013-03-01T09:59:42.689Z"),message: "Hi!"

},…

] }

// Query with a date rangedb.inbox.find ({owner: "friend1",

messages: { $elemMatch: {sent:{$gte: ISODate("…") }}}})

// Remove elements based on a datedb.inbox.update({owner: "friend1" },

{ $pull: { messages: { sent: { $gte: ISODate("…") } } } } )

Considerations

• Shrinking documents, space can be reclaimed

with– db.runCommand ( { compact: '<collection>' } )

• Removing the document after the last element in

the array as been removed– { "_id" : …, "messages" : [ ], "owner" :

"friend1", "sequence" : 0 }

Fixed size array

msg = {from: "Your Boss",to: [ "Bob" ],sent: new Date(), message: "CALL ME NOW!"

}

// 2.4 Introduces $each, $sort and $slice for $pushdb.messages.update(

{ _id: 1 }, { $push: { messages: { $each: [ msg ],

$sort: { sent: 1 }, $slice: -50 }

}}

)

Considerations

• Need to compute the size of the array based on

retention period

TTL Collections

// messages: one doc per user per day

db.inbox.findOne()

{

_id: 1,

to: "Joe",

sequence: ISODate("2013-02-04T00:00:00.392Z"),

messages: [ ]

}

// Auto expires data after 31536000 seconds = 1 year

db.messages.ensureIndex( { sequence: 1 },

{ expireAfterSeconds: 31536000 } )

#3 – Indexed Attributes

Design Goal

• Application needs to stored a variable number of

attributes e.g.

– User defined Form

– Meta Data tags

• Queries needed

– Equality

– Range based

• Need to be efficient, regardless of the number of

attributes

2 (of many) Approaches

• Attributes as Embedded Document

• Attributes as Objects in an Array

Attributes as a sub-document

db.files.insert( { _id: "local.0",

attr: { type: "text", size: 64,

created: ISODate("..." } } )


attr: { type: "text", size: 128} } )

db.files.insert( { _id: "mongod",

attr: { type: "binary", size: 256,

created: ISODate("...") } } )

// Need to create an index for each item in the sub-document

db.files.ensureIndex( { "attr.type": 1 } )

db.files.find( { "attr.type": "text"} )

// Can perform range queries

db.files.ensureIndex( { "attr.size": 1 } )

db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )

Considerations

• Each attribute needs an Index

• Each time you extend, you add an index

• Lots and lots of indexes

Attributes as objects in array

db.files.insert( {_id: "local.0",

attr: [ { type: "text" },

{ size: 64 },

{ created: ISODate("...") } ] } )


attr: [ { type: "text" },

{ size: 128 } ] } )

db.files.insert( { _id: "mongod",

attr: [ { type: "binary" },

{ size: 256 },

{ created: ISODate("...") } ] } )

db.files.ensureIndex( { attr: 1 } )

Considerations

• Only one index needed on attr

• Can support range queries, etc.

• Index can be used only once per query

#4 –Multiple Identities

Design Goal

• Ability to look up by a number of different

identities e.g.

- Username

- Email address

- FB handle

- LinkedIn URL

2 (of many) approaches

• Identifiers in a single document

• Separate Identifiers from Content

Single document by user

db.users.findOne()

{ _id: "joe",

email: "[email protected],

fb: "joe.smith", // facebook

li: "joe.e.smith", // linkedin

other: {…}

}

// Shard collection by _id

db.shardCollection("mongodbdays.users", { _id: 1 } )

// Create indexes on each key

db.users.ensureIndex( { email: 1} )

db.users.ensureIndex( { fb: 1 } )

db.users.ensureIndex( { li: 1 } )

Read by _id (shard key)


find( { _id: "joe"} )

Read by email (non-shard key)


find ( { email: [email protected] } )

Considerations

• Lookup by shard key is routed to 1 shard

• Lookup by other identifier is scatter gathered

across all shards

• Secondary keys cannot have a unique index

Document per identity

// Create unique index

db.identities.ensureIndex( { identifier : 1} , { unique: true} )

// Create a document for each users document

db.identities.save(

{ identifier : { hndl: "joe" }, user: "1200-42" } )

db.identities.save(

{ identifier : { email: "[email protected]" }, user: "1200-42" } )

db.identities.save(

{ identifier : { li: "joe.e.smith" }, user: "1200-42" } )


db.shardCollection( "mydb.identities", { identifier : 1 } )

// Create unique index

db.users.ensureIndex( { _id: 1} , { unique: true} )


db.shardCollection( "mydb.users", { _id: 1 } )

Read requires 2 reads


db.identities.find({"identifier" : { "hndl" : "joe" }})

db.users.find( { _id: "1200-42"} )

Considerations

• Lookup to Identities is a routed query

• Lookup to Users is a routed query

• Unique indexes available

• Must do two queries per lookup

Conclusion

Summary

• Multiple ways to model a domain problem

• Understand the key uses cases of your app

• Balance between ease of query vs. ease of write

• Reduce random I/O where possible for better

performance

data modeling deep dive

Technology