implementing and visualizing clickstream data with mongodb
DESCRIPTION
Having recently implemented a new framework for the real-time collection, aggregation and visualization of web and mobile generated Clickstream traffic (realizing daily click-stream volumes of 1M+ events), this walkthrough is about the motivations, throughout-process and key decisions made, as well as an in depth look at the implementation of how to buildout a data-collection, analytics and visualization framework using MongoDB. Technologies covered in this presentation (as well as MongoDB) are Java, Spring, Django and Pymongo.TRANSCRIPT
Implementing and Visualizing Click-Stream Data with MongoDB
Jan 22, 2013 - New York MongoDB User Group
Cameron Sim - LearnVest.com
Agenda • About LearnVest
• HL Application Architecture
• Data Capture
• Event Packaging
• MongoDB Data Warehousing
• Loading & Visualization
• Finishing up
• Next Steps
LearnVest Inc. ���www.learnvest.com
Company Founded in 2008 by Alexa Von Tobel, CEO
50+ People and Growing rapidly
Based in NYC
Platforms Web & iPhone
Mission Statement Aiming to making Financial Planning as accessible as having a gym membership
Key Products Account Aggregation and Management
(Bank, Credit, Loan, Investment, Mortgage)
Original and Syndicated Newsletter Content
Financial Planning (tiered product offering)
Stack
Operational Wordpress, Backbone.js, Node.js Java Spring 3, Redis, Memcached,
MongoDB, ActiveMQ, Nginx, MySQL 5.x
Analytics MongoDB 2.2.0 (3-node replica-set)
Java 6, Spring 3 pyMongo
Django 1.4
LearnVest.com Web
LearnVest.com IPhone
MondoDB Data Warehousing Loading & Visualization
High Level Architecture Analytics
Services Loaders & Dashboards
Production
Platform Delivery Services
HTTPS pyMongo MongoDB Java Conn MongoDB Replication JDBC
Event Collection Event Packaging
Philosophy For Data Collection
Capture Everything • User-Driven events over web and mobile • System-level exceptions • Everything else Temporary Data • Be ‘ok’ with approximate data • Operational Databases are the system of record Aggregate events as they come in • Remove the overhead of basic metrics (counts, sums) on core events • Group by user unique id and increment counts per event, over time-dimensions (day, week-ending, month, year)
Data Capture IOS - (void) sendAnalyticEventType:(NSString*)eventType object:(NSString*)object name:(NSString*)name page:(NSString*)page source:(NSString*)source; { NSMutableDictionary *eventData = [NSMutableDictionary dictionary]; if (eventType!=nil) [params setObject:eventType forKey:@"eventType"]; if (object!=nil) [eventData setObject:object forKey:@"object"]; if (name!=nil) [eventData setObject:name forKey:@"name"]; if (page!=nil) [eventData setObject:page forKey:@"page"]; if (source!=nil) [eventData setObject:source forKey:@"source"]; if (eventData!=nil) [params setObject:eventData forKey:@"eventData"]; [[LVNetworkEngine sharedManager] analytics_send:params]; }
Data Capture
WEB (JavaScript) function internalTrackPageView() { var cookie = {
userContext: jQuery.cookie('UserContextCookie'), };
var trackEvent = {
eventType: "pageView", eventData: { page: window.location.pathname + window.location.search } };
// AJAX jQuery.ajax({ url: "/api/track", type: "POST", dataType: "json", data: JSON.stringify(trackEvent), // Set Request Headers beforeSend: function (xhr, settings) { xhr.setRequestHeader('Accept', 'application/json'); xhr.setRequestHeader('User-Context', cookie.userContext); if(settings.type === 'PUT' || settings.type === 'POST') { xhr.setRequestHeader('Content-Type', 'application/json'); } } });
}
Bus Event Packaging 1. Spring 3 RESTful service layer, controller methods define the eventCode via @tracking
annotation • Custom Intercepter class extends HandlerInterceptorAdapter and implements
postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher • EventPublisher publishes to common event bus queue with multiple subscribers, one of which
packages the eventPayload Map<String, Object> object and forwards to Analytics Rest Service
Bus Event Packaging 1) Spring RestController Methods Interface
@RequestMapping(value = "/user/login", method = RequestMethod.POST, headers="Accept=application/json") public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request);
Concrete/Impl Class @Override @Tracking("user.login") public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request){ //Implementation
return event; }
Bus Event Packaging 2) Custom Intercepter class extends HandlerInterceptorAdapter protected void handleTracking(String trackingCode, Map<String, Object> modelMap, HttpServletRequest request) { Map<String, Object> responseModel = new HashMap<String, Object>(); // remove non-serializables & copy over data from modelMap try { this.eventPublisher.publish(trackingCode, responseModel, request); } catch (Exception e) { log.error("Error tracking event '" + trackingCode + "' : " + ExceptionUtils.getStackTrace(e)); } }
Bus Event Packaging 2) Custom Intercepter class extends HandlerInterceptorAdapter public void publish (String eventCode, Map<String,Object> eventData, HttpServletRequest request) { Map<String,Object> payload = new HashMap<String,Object>(); String eventId=UUID.randomUUID().toString(); Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request); //Normalize message payload.put("eventType", eventData.get("eventType")); payload.put("eventData", eventData.get("eventType")); payload.put("version", eventData.get("eventType")); payload.put("eventId", eventId); payload.put("eventTime", new Date()); payload.put("request", requestMap); . . . //Send to the Analytics Service for MongoDB persistence } public void sendPost(EventPayload payload){ HttpEntity request = new HttpEntity(payload.getEventPayload(), headers); Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class); }
Bus Event Packaging The Serialized Json (User Action) { “eventCode” : “user.login”, “eventType” : “login”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “” : “”, “” : “”, “” : “” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } }
Bus Event Packaging The Serialized Json (Generic Event) { “eventCode” : “generic.ui”, “eventType” : “pageView”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } }
MongoDB Data Warehousing MongoDB Information • v2.2.0 • 3-node replica-set • 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines • Each with single 500GB EBS volumes mounted to /opt/data MongoDB Config File dbpath = /opt/data/mongodb/datarest = truereplSet = voyager Volumes ~IM events daily on web, ~600K on mobile 2-3 GB per day at start, slowed to ~1GB per day Currently at 78GB (collecting since August 2012) Future Scaling Strategy • Setup 2nd Replica-Set • Shard replica-sets to n at 60% / 250GB per EBS volume • Shard key probably based on sequential mix of email_address & additional string
MongoDB Data Warehousing
Approach • Persist all events, bucketed by source:- WEB MOBILE • Persist all events, bucketed by source, event code and time:- WEB/MOBILE user.login time (day, week-ending, month, year) 3. Insert into collection e_web / e_mobile 4. Upsert into:- e_web_user_login_day e_web_user_login_week e_web_user_login_month e_web_user_login_year 5. Predictable model for scaling and measuring business growth
MongoDB Data Warehousing
2. Persist all events, bucketed by source, event code and time:- //instantiate collections dynamically DBCollection collection_day = mongodb.getCollection(eventCode + "_day"); DBCollection collection_week = mongodb.getCollection(eventCode + "_week"); DBCollection collection_month = mongodb.getCollection(eventCode + "_month"); DBCollection collection_year = mongodb.getCollection(eventCode + "_year"); BasicDBObject newDocument = new BasicDBObject().append("$inc" new BasicDBObject().append("count", 1)); //update day dimension collection_day.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_day.format(d)),newDocument, true, false); //update week dimension collection_week.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_day.format(w)), newDocument, true, false); //update month dimension collection_month.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_month.format(d)), newDocument, true, false); //update month dimension collection_year.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_year.format(d)), newDocument, true, false);
MongoDB Data Warehousing
Persist all events, bucketed by source, event code and time:- > show collectionse_mobilee_webe_web_account_addManual_daye_web_account_addManual_monthe_web_account_addManual_weeke_web_account_addManual_year e_web_user_login_day e_web_user_login_week e_web_user_login_month e_web_user_login_yeare_mobile_generic_ui_daye_mobile_generic_ui_monthe_mobile_generic_ui_weeke_mobile_generic_ui_year > db.e_web_user_login_day.find() { "_id" : ObjectId("50e4b9871b36921910222c42"), "count" : 5, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50cd6cfcb9a80a2b4ee21422"), "count" : 7, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50cd6e51b9a80a2b4ee21427"), "count" : 2, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50e4b9871b36921910222c42"), "count" : 3, "date" : "01/03", "user-context" : "50e49a561b36921910222c33" }
MongoDB Data Warehousing
Persist all events > db.e_web.findOne() { "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" : ISODate("2013-01-02T21:07:55.656Z"), "created_date" : ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" : "localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" : "c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept-encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" : "button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" : "transaction_river_details" } }
MongoDB Data Warehousing Indexing Strategy • Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large Instance and 3.75GB on Medium instances
• Split datetime in two fields and compound index on date with other fields like eventType and user unique id (user-context)
• Heavy insertion rates, much lower read rates....so less indexes the better
MongoDB Data Warehousing Indexing Strategy > db.e_web.getIndexes()[ { "v" : 1, "key" : { "request.user-context" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "request.user-context_1_created_date_1" }, { "v" : 1, "key" : { "eventData.name" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "eventData.name_1_created_date_1" }]
Loading & Visualization Objective • Show historic and intraday stats on core use cases (logins, conversions) • Show user funnel rates on conversion pages • Show general usability - how do users really use the Web and IOS platforms?
Non-Functionals • Intraday doesn’t need to be “real-time”, polling is good enough for now • Overnight batch job for historic must scale horizontally General Implementation Strategy • Do all heavy lifting & object manipulation, UI should just display graph or table • Modularize the service to be able to regenerate any graphs/tables without a full load
Loading & Visualization Java Batch Service Java Mongo library to query key collections and return user counts and sum of events DBCursor webUserLogins = c.find( new BasicDBObject("date", sdf.format(new Date()))); private HashMap<String, Object> getSumAndCount(DBCursor cursor){
HashMap<String, Object> m = new HashMap<String, Object>(); int sum=0; int count=0; DBObject obj; while(cursor.hasNext()){ obj=(DBObject)cursor.next(); count++; sum=sum+(Integer)obj.get("count"); } m.put("sum", sum); m.put("count", count); m.put("average", sdf.format(new Float(sum)/count)); return m;
}
Loading & Visualization Java Batch Service Use Aggregation Framework where required on core collections (e_web) and external data //create aggregation objects DBObject project = new BasicDBObject("$project", new BasicDBObject("day_value", fields) ); DBObject day_value = new BasicDBObject( "day_value", "$day_value"); DBObject groupFields = new BasicDBObject( "_id", day_value); //create the fields to group by, in this case “number” groupFields.put("number", new BasicDBObject( "$sum", 1)); //create the group DBObject group = new BasicDBObject("$group", groupFields); //execute AggregationOutput output = mycollection.aggregate( project, group );
for(DBObject obj : output.results()){ . . }
Loading & Visualization
Java Batch Service MongoDB Command Line example on aggregation over a time period, e.g. month > db.e_web.aggregate( [ { $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}}, { $project : { day_value : {"day" : { $dayOfMonth : "$created_date" }, "month":{ $month : "$created_date" }} }}, { $group : { _id : {day_value:"$day_value"} ,
number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ])
Loading & Visualization Java Batch Service Persisting events into graph and table collections >db.homeGraphs.find() { "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 } { "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 } { "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" : "15.13", "premium_rate" : "0", "str_date" : "2011,01,08", "upgrade_rate" : "0", "users_avg_linked" : "4.5", "users_linked" : 18 }
Loading & Visualization
Django and HighCharts Extract data (pyMongo) def getHomeChart(dt_from, dt_to): """Called by home method to get latest 30 day numbers""" try: conn = pymongo.Connection('localhost', 27017) db = conn['lvanalytics'] cursor = db.accountmetrics.find( {"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date") return buildMetricsDict(cursor) except Exception as e: logger.error(e.message)
Return the graph object (as a list or a dict of lists) to the view that called the method pagedata={} pagedata['accountsGraph']=mongodb_home.getHomeChart() return render_to_response('home.html',{'pagedata': pagedata}, context_instance=RequestContext(request))
>db.homeGraphs.find() { "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 } { "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 } { "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" : "15.13", "premium_rate" : "0", "str_date" : "2011,01,08", "upgrade_rate" : "0", "users_avg_linked" : "4.5", "users_linked" : 18 }
Loading & Visualization
Django and HighCharts Populate the series.. (JavaScript with Django templating) seriesOptions[0] = { id: 'naturalAccounts', name: "Natural Accounts", data: [ {% for a in pagedata.metrics.accounts_natural %} {% if not forloop.first %}, {% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor %} ], tooltip: { valueDecimals: 2 } };
Loading & Visualization Django and HighCharts And Create the Charts and Tables...
Loading & Visualization Django and HighCharts And Create the Charts and Tables...
Lessons Learned • Date Time managed as two fields, Datetime and Date
• Aggregating and upserting documents as events are received works for us
• Real-time Map-Reduce in pyMongo - too slow, don’t do this. • Django-noRel - Unstable, use Django and configure MongoDB as a
datastore only
• Memcached on Django is good enough (at the moment) - use django-celery with rabbitmq to pre-cache all data after data loading
• HighCharts is buggy - considering D3 & other libraries
• Don’t need to retrieve data directly from MongoDB to Django, perhaps provide all data via a service layer (at the expense of ever-additional features in pyMongo)
Next Steps • A/B testing framework, experiments and variances
• Unauthenticated / Authenticated user tracking
• Provide data async over service layer
• Segmentation with graphical libraries like D3 & Cross-Filter (http://square.github.com/crossfilter/)
• Saving Query Criteria, expanding out BI tools for internal users
• MongoDB Connector, Hadoop and Hive (maybe Tableau and other tools)
• Storm / Kafka for real-time analytics processing
• Shard the Replica-Set, looking into Gizzard as the middleware
Kevin Connelly Director of Engineering [email protected]
Cameron Sim Director of Analytics Tech [email protected]
Thanks & Questions���������������
Hrishi Dixit Chief Technology Officer
Jeremy Brennan
Director of UI/UX Technology [email protected]
Will Larche Lead IOS Developer [email protected]
<your name here>
New Awesome Developer [email protected]
HIRED!