implementing and visualizing clickstream data with mongodb

34
Implementing and Visualizing Click- Stream Data with MongoDB Jan 22, 2013 - New York MongoDB User Group Cameron Sim - LearnVest.com

Upload: mongodb

Post on 15-Jan-2015

3.642 views

Category:

Technology


5 download

DESCRIPTION

Having recently implemented a new framework for the real-time collection, aggregation and visualization of web and mobile generated Clickstream traffic (realizing daily click-stream volumes of 1M+ events), this walkthrough is about the motivations, throughout-process and key decisions made, as well as an in depth look at the implementation of how to buildout a data-collection, analytics and visualization framework using MongoDB. Technologies covered in this presentation (as well as MongoDB) are Java, Spring, Django and Pymongo.

TRANSCRIPT

Page 1: Implementing and Visualizing Clickstream data with MongoDB

Implementing and Visualizing Click-Stream Data with MongoDB

Jan 22, 2013 - New York MongoDB User Group

Cameron Sim - LearnVest.com

Page 2: Implementing and Visualizing Clickstream data with MongoDB

Agenda • About LearnVest

• HL Application Architecture

• Data Capture

• Event Packaging

• MongoDB Data Warehousing

• Loading & Visualization

• Finishing up

• Next Steps

Page 3: Implementing and Visualizing Clickstream data with MongoDB

LearnVest Inc. ���www.learnvest.com

Company Founded in 2008 by Alexa Von Tobel, CEO

50+ People and Growing rapidly

Based in NYC

Platforms Web & iPhone

Mission Statement Aiming to making Financial Planning as accessible as having a gym membership

Key Products Account Aggregation and Management

(Bank, Credit, Loan, Investment, Mortgage)

Original and Syndicated Newsletter Content

Financial Planning (tiered product offering)

Stack

Operational Wordpress, Backbone.js, Node.js Java Spring 3, Redis, Memcached,

MongoDB, ActiveMQ, Nginx, MySQL 5.x

Analytics MongoDB 2.2.0 (3-node replica-set)

Java 6, Spring 3 pyMongo

Django 1.4

Page 4: Implementing and Visualizing Clickstream data with MongoDB

LearnVest.com Web

Page 5: Implementing and Visualizing Clickstream data with MongoDB

LearnVest.com IPhone

Page 6: Implementing and Visualizing Clickstream data with MongoDB

MondoDB Data Warehousing Loading & Visualization

High Level Architecture Analytics

Services Loaders & Dashboards

Production

Platform Delivery Services

HTTPS pyMongo MongoDB Java Conn MongoDB Replication JDBC

Event Collection Event Packaging

Page 7: Implementing and Visualizing Clickstream data with MongoDB

Philosophy For Data Collection

Capture Everything •  User-Driven events over web and mobile •  System-level exceptions •  Everything else Temporary Data •  Be ‘ok’ with approximate data •  Operational Databases are the system of record Aggregate events as they come in •  Remove the overhead of basic metrics (counts, sums) on core events • Group by user unique id and increment counts per event, over time-dimensions (day, week-ending, month, year)

Page 8: Implementing and Visualizing Clickstream data with MongoDB

Data Capture IOS - (void) sendAnalyticEventType:(NSString*)eventType object:(NSString*)object name:(NSString*)name page:(NSString*)page source:(NSString*)source; { NSMutableDictionary *eventData = [NSMutableDictionary dictionary]; if (eventType!=nil) [params setObject:eventType forKey:@"eventType"]; if (object!=nil) [eventData setObject:object forKey:@"object"]; if (name!=nil) [eventData setObject:name forKey:@"name"]; if (page!=nil) [eventData setObject:page forKey:@"page"]; if (source!=nil) [eventData setObject:source forKey:@"source"]; if (eventData!=nil) [params setObject:eventData forKey:@"eventData"]; [[LVNetworkEngine sharedManager] analytics_send:params]; }

Page 9: Implementing and Visualizing Clickstream data with MongoDB

Data Capture

WEB (JavaScript) function internalTrackPageView() { var cookie = {

userContext: jQuery.cookie('UserContextCookie'), };

var trackEvent = {

eventType: "pageView", eventData: { page: window.location.pathname + window.location.search } };

// AJAX jQuery.ajax({ url: "/api/track", type: "POST", dataType: "json", data: JSON.stringify(trackEvent), // Set Request Headers beforeSend: function (xhr, settings) { xhr.setRequestHeader('Accept', 'application/json'); xhr.setRequestHeader('User-Context', cookie.userContext); if(settings.type === 'PUT' || settings.type === 'POST') { xhr.setRequestHeader('Content-Type', 'application/json'); } } });

}

Page 10: Implementing and Visualizing Clickstream data with MongoDB

Bus Event Packaging 1. Spring 3 RESTful service layer, controller methods define the eventCode via @tracking

annotation •  Custom Intercepter class extends HandlerInterceptorAdapter and implements

postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher •  EventPublisher publishes to common event bus queue with multiple subscribers, one of which

packages the eventPayload Map<String, Object> object and forwards to Analytics Rest Service

Page 11: Implementing and Visualizing Clickstream data with MongoDB

Bus Event Packaging 1) Spring RestController Methods Interface

@RequestMapping(value = "/user/login", method = RequestMethod.POST, headers="Accept=application/json") public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request);

Concrete/Impl Class @Override @Tracking("user.login") public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request){ //Implementation

return event; }

Page 12: Implementing and Visualizing Clickstream data with MongoDB

Bus Event Packaging 2) Custom Intercepter class extends HandlerInterceptorAdapter protected void handleTracking(String trackingCode, Map<String, Object> modelMap, HttpServletRequest request) { Map<String, Object> responseModel = new HashMap<String, Object>(); // remove non-serializables & copy over data from modelMap try { this.eventPublisher.publish(trackingCode, responseModel, request); } catch (Exception e) { log.error("Error tracking event '" + trackingCode + "' : " + ExceptionUtils.getStackTrace(e)); } }

Page 13: Implementing and Visualizing Clickstream data with MongoDB

Bus Event Packaging 2) Custom Intercepter class extends HandlerInterceptorAdapter public void publish (String eventCode, Map<String,Object> eventData, HttpServletRequest request) { Map<String,Object> payload = new HashMap<String,Object>(); String eventId=UUID.randomUUID().toString(); Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request); //Normalize message payload.put("eventType", eventData.get("eventType")); payload.put("eventData", eventData.get("eventType")); payload.put("version", eventData.get("eventType")); payload.put("eventId", eventId); payload.put("eventTime", new Date()); payload.put("request", requestMap); . . . //Send to the Analytics Service for MongoDB persistence } public void sendPost(EventPayload payload){ HttpEntity request = new HttpEntity(payload.getEventPayload(), headers); Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class); }

Page 14: Implementing and Visualizing Clickstream data with MongoDB

Bus Event Packaging The Serialized Json (User Action) { “eventCode” : “user.login”, “eventType” : “login”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “” : “”, “” : “”, “” : “” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } }

Page 15: Implementing and Visualizing Clickstream data with MongoDB

Bus Event Packaging The Serialized Json (Generic Event) { “eventCode” : “generic.ui”, “eventType” : “pageView”, “version” : “1.0”, “eventTime” : “1358603157746”, “eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” }, “request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, } }

Page 16: Implementing and Visualizing Clickstream data with MongoDB

MongoDB Data Warehousing MongoDB Information •  v2.2.0 •  3-node replica-set •  1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines •  Each with single 500GB EBS volumes mounted to /opt/data MongoDB Config File dbpath = /opt/data/mongodb/datarest = truereplSet = voyager Volumes ~IM events daily on web, ~600K on mobile 2-3 GB per day at start, slowed to ~1GB per day Currently at 78GB (collecting since August 2012) Future Scaling Strategy •  Setup 2nd Replica-Set •  Shard replica-sets to n at 60% / 250GB per EBS volume •  Shard key probably based on sequential mix of email_address & additional string

Page 17: Implementing and Visualizing Clickstream data with MongoDB

MongoDB Data Warehousing

Approach •  Persist all events, bucketed by source:- WEB MOBILE •  Persist all events, bucketed by source, event code and time:- WEB/MOBILE user.login time (day, week-ending, month, year) 3. Insert into collection e_web / e_mobile 4. Upsert into:- e_web_user_login_day e_web_user_login_week e_web_user_login_month e_web_user_login_year 5. Predictable model for scaling and measuring business growth

Page 18: Implementing and Visualizing Clickstream data with MongoDB

MongoDB Data Warehousing

2. Persist all events, bucketed by source, event code and time:- //instantiate collections dynamically DBCollection collection_day = mongodb.getCollection(eventCode + "_day"); DBCollection collection_week = mongodb.getCollection(eventCode + "_week"); DBCollection collection_month = mongodb.getCollection(eventCode + "_month"); DBCollection collection_year = mongodb.getCollection(eventCode + "_year"); BasicDBObject newDocument = new BasicDBObject().append("$inc" new BasicDBObject().append("count", 1)); //update day dimension collection_day.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_day.format(d)),newDocument, true, false); //update week dimension collection_week.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_day.format(w)), newDocument, true, false); //update month dimension collection_month.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_month.format(d)), newDocument, true, false); //update month dimension collection_year.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_year.format(d)), newDocument, true, false);

Page 19: Implementing and Visualizing Clickstream data with MongoDB

MongoDB Data Warehousing

Persist all events, bucketed by source, event code and time:- > show collectionse_mobilee_webe_web_account_addManual_daye_web_account_addManual_monthe_web_account_addManual_weeke_web_account_addManual_year e_web_user_login_day e_web_user_login_week e_web_user_login_month e_web_user_login_yeare_mobile_generic_ui_daye_mobile_generic_ui_monthe_mobile_generic_ui_weeke_mobile_generic_ui_year > db.e_web_user_login_day.find() { "_id" : ObjectId("50e4b9871b36921910222c42"), "count" : 5, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50cd6cfcb9a80a2b4ee21422"), "count" : 7, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50cd6e51b9a80a2b4ee21427"), "count" : 2, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" } { "_id" : ObjectId("50e4b9871b36921910222c42"), "count" : 3, "date" : "01/03", "user-context" : "50e49a561b36921910222c33" }

Page 20: Implementing and Visualizing Clickstream data with MongoDB

MongoDB Data Warehousing

Persist all events > db.e_web.findOne() { "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" : ISODate("2013-01-02T21:07:55.656Z"), "created_date" : ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" : "localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" : "c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept-encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" : "button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" : "transaction_river_details" } }

Page 21: Implementing and Visualizing Clickstream data with MongoDB

MongoDB Data Warehousing Indexing Strategy • Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large Instance and 3.75GB on Medium instances

• Split datetime in two fields and compound index on date with other fields like eventType and user unique id (user-context)

•  Heavy insertion rates, much lower read rates....so less indexes the better

Page 22: Implementing and Visualizing Clickstream data with MongoDB

MongoDB Data Warehousing Indexing Strategy > db.e_web.getIndexes()[ { "v" : 1, "key" : { "request.user-context" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "request.user-context_1_created_date_1" }, { "v" : 1, "key" : { "eventData.name" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "eventData.name_1_created_date_1" }]

Page 23: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization Objective •  Show historic and intraday stats on core use cases (logins, conversions) •  Show user funnel rates on conversion pages •  Show general usability - how do users really use the Web and IOS platforms?

Non-Functionals •  Intraday doesn’t need to be “real-time”, polling is good enough for now •  Overnight batch job for historic must scale horizontally General Implementation Strategy •  Do all heavy lifting & object manipulation, UI should just display graph or table •  Modularize the service to be able to regenerate any graphs/tables without a full load

Page 24: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization Java Batch Service Java Mongo library to query key collections and return user counts and sum of events DBCursor webUserLogins = c.find( new BasicDBObject("date", sdf.format(new Date()))); private HashMap<String, Object> getSumAndCount(DBCursor cursor){

HashMap<String, Object> m = new HashMap<String, Object>(); int sum=0; int count=0; DBObject obj; while(cursor.hasNext()){ obj=(DBObject)cursor.next(); count++; sum=sum+(Integer)obj.get("count"); } m.put("sum", sum); m.put("count", count); m.put("average", sdf.format(new Float(sum)/count)); return m;

}

Page 25: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization Java Batch Service Use Aggregation Framework where required on core collections (e_web) and external data //create aggregation objects DBObject project = new BasicDBObject("$project", new BasicDBObject("day_value", fields) ); DBObject day_value = new BasicDBObject( "day_value", "$day_value"); DBObject groupFields = new BasicDBObject( "_id", day_value); //create the fields to group by, in this case “number” groupFields.put("number", new BasicDBObject( "$sum", 1)); //create the group DBObject group = new BasicDBObject("$group", groupFields); //execute AggregationOutput output = mycollection.aggregate( project, group );

for(DBObject obj : output.results()){ . . }

Page 26: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization

Java Batch Service MongoDB Command Line example on aggregation over a time period, e.g. month > db.e_web.aggregate( [ { $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}}, { $project : { day_value : {"day" : { $dayOfMonth : "$created_date" }, "month":{ $month : "$created_date" }} }}, { $group : { _id : {day_value:"$day_value"} ,

number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ])

Page 27: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization Java Batch Service Persisting events into graph and table collections >db.homeGraphs.find() { "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 } { "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 } { "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" : "15.13", "premium_rate" : "0", "str_date" : "2011,01,08", "upgrade_rate" : "0", "users_avg_linked" : "4.5", "users_linked" : 18 }

Page 28: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization

Django and HighCharts Extract data (pyMongo) def getHomeChart(dt_from, dt_to): """Called by home method to get latest 30 day numbers""" try: conn = pymongo.Connection('localhost', 27017) db = conn['lvanalytics'] cursor = db.accountmetrics.find( {"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date") return buildMetricsDict(cursor) except Exception as e: logger.error(e.message)

Return the graph object (as a list or a dict of lists) to the view that called the method pagedata={} pagedata['accountsGraph']=mongodb_home.getHomeChart() return render_to_response('home.html',{'pagedata': pagedata}, context_instance=RequestContext(request))

>db.homeGraphs.find() { "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 } { "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 } { "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" : "15.13", "premium_rate" : "0", "str_date" : "2011,01,08", "upgrade_rate" : "0", "users_avg_linked" : "4.5", "users_linked" : 18 }

Page 29: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization

Django and HighCharts Populate the series.. (JavaScript with Django templating) seriesOptions[0] = { id: 'naturalAccounts', name: "Natural Accounts", data: [ {% for a in pagedata.metrics.accounts_natural %} {% if not forloop.first %}, {% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor %} ], tooltip: { valueDecimals: 2 } };

Page 30: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization Django and HighCharts And Create the Charts and Tables...

Page 31: Implementing and Visualizing Clickstream data with MongoDB

Loading & Visualization Django and HighCharts And Create the Charts and Tables...

Page 32: Implementing and Visualizing Clickstream data with MongoDB

Lessons Learned • Date Time managed as two fields, Datetime and Date

• Aggregating and upserting documents as events are received works for us

•  Real-time Map-Reduce in pyMongo - too slow, don’t do this. • Django-noRel - Unstable, use Django and configure MongoDB as a

datastore only

• Memcached on Django is good enough (at the moment) - use django-celery with rabbitmq to pre-cache all data after data loading

•  HighCharts is buggy - considering D3 & other libraries

• Don’t need to retrieve data directly from MongoDB to Django, perhaps provide all data via a service layer (at the expense of ever-additional features in pyMongo)

Page 33: Implementing and Visualizing Clickstream data with MongoDB

Next Steps •  A/B testing framework, experiments and variances

•  Unauthenticated / Authenticated user tracking

•  Provide data async over service layer

• Segmentation with graphical libraries like D3 & Cross-Filter (http://square.github.com/crossfilter/)

• Saving Query Criteria, expanding out BI tools for internal users

• MongoDB Connector, Hadoop and Hive (maybe Tableau and other tools)

• Storm / Kafka for real-time analytics processing

• Shard the Replica-Set, looking into Gizzard as the middleware

Page 34: Implementing and Visualizing Clickstream data with MongoDB

Kevin Connelly Director of Engineering [email protected]

Cameron Sim Director of Analytics Tech [email protected]

Thanks & Questions���������������

Hrishi Dixit Chief Technology Officer

[email protected]

Jeremy Brennan

Director of UI/UX Technology [email protected]

Will Larche Lead IOS Developer [email protected]

<your name here>

New Awesome Developer [email protected]

HIRED!