social analytics on mongodb at mongonyc

Post on 28-Nov-2014

1.502 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Social Analytics with MongoDB

@BuddyMedia

Disclaimer

+= maybe not the best deck in the world

What is MongoDB?

• Document Store. • Schemaless.• High performance.

Why MongoDB?

• Months of testing– Data Types– Horizontal Scaling – Replication– Querying– Atomicity – Concurrency

Everything in that last slide was a LIE.

Same reason most of you do.

• It’s new and cool and we wanted to check it out.

• We become cool by association.• But mostly because we like learning new

things.

That last slide was kind of a lie too.

• We started with Cassandra.• Cassandra was written by Facebook and

Facebook is really cool, we wanted to be as cool as them.

Why Not Cassandra?

• Thrift. – “Thrift is a software framework for scalable cross-

language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml.”

• Eff that. We’re a startup.

So MongoDB it Was.

Also, MongoDB Happened to be in NYC. We are in NYC. NYC is Cool.

Proof that NYC is cool.

What You Should Know

• MongoDB is not relational.• It’s also not schemaless even though they love to say that.

(applications always have schemas/data models).• Right tool for right job.

– Logging– Queues– Aggregate Analytics

• Don’t get confused with ORM.• Return what you need.• Don’t worry about document size limits.

Aggregate Analytics

• Lots of “Stuff” happens at Buddy Media.• Need to keep track of it all.• Need to it to be real time. • Need to be able to group it by various levels and

resolutions.• Need to be able to create new metrics on the fly.• Write heavy, Read light.

What does it look like?

Event Queue Processor Metric

Architecture

The Event Listener

• Node.js is the perfect event listener.– Evented IO like Twisted or Event Machine.– 2 days of development (maybe ~100 lines of JS). – 0 lost events– 0 downtime.– Just don’t upgrade

Raw Event

A Pageview

{"_id" : ObjectId("4d8d0df101cddf2e6e0027af"),"created_date" : "2010-07-26 20:15:01","data" : {

"client_id" : "1034","page_id" : "175”

},"status" : {

"state" : 0,"updated" : "2011-04-12 10:15:15"

},"type" : "pageview"

}

Processing

• 3 resolutions– Minute– Hour– Day

• 1 event = 3 metric updates * number of groupings.

"pageview": {"metrics": [

{ "name":"client.pageviews", "key":"client_id" },{ "name":"page.pageviews", "key":"page_id" }

]}

Creating a Metric

A pageview happened and I want to update metrics for the client the page belongs to.

metrics.update({

'name’:client.pageview','period':'minute','start_date':'2010-05-12 12:50:00'

}, { '$inc': {'aggregates.1034':1} }, upsert=True

);

Completed Metric

{"_id" : ObjectId("4da45cf6306a22719829b71b"),"aggregates" : {

”1034" : 11},"end_date" : "2010-05-12 12:54:59","name" : ”client.pageview","period" : "minute","start_date" : "2010-05-12 12:50:00","total" : 11

}

What about another client?If a second pageview comes in for a different client, we end up updating the exact same record. Thus our last metric becomes:

{"_id" : ObjectId("4da45cf6306a22719829b71b"),"aggregates" : {

”1034" : 1,“1213”: 1

},"end_date" : "2010-05-12 12:54:59","name" : ”client.pageview","period" : "minute","start_date" : "2010-05-12 12:50:00","total" : 11

}

Some Queries1. Get pageviews for all clients that occurred on May 12 between 12:50 and 12:51

db.metrics.find({name:"client.pageview",period:"minute",start_date:"2010-05-12 12:50:00”

});

2. Get pageviews for client 1034 that occurred on May 12 between 12:50 and 12:51

db.metrics.find({name:"client.pageview",period:"minute",start_date:"2010-05-12 12:50:00”

},{“aggregates.1034”:1});

1 Document, n entries.

1 Document, 1 entry.

More Queries1. Get pageviews for all clients that occurred on May 12 and graph by hour.

db.metrics.find({name:"client.pageview",period:”hour",start_date:”/2010-05-12/”

});

2. Get pageviews for client 1034 that occurred on May 12 and graph by minute.

db.metrics.find({name:"client.pageview",period:"minute",start_date:”/2010-05-/”

},{“aggregates.1034”:1});

24 Documents, n entries.

1440 Documents, 1 entry.

Let’s take a peak.

@patr1cks@buddymedia

top related