data science stack with mongodb and rstudio

Post on 19-Aug-2014

2.825 Views

Category:

Engineering

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Building up an easy data science platform with RStudio server on top of your MongoDB Winston Chen – Lead Software Engineer

TRANSCRIPT

Data Science Stack with MongoDB and RStudio

Building up an easy data science platform with RStudio server on top of your MongoDB

Winston Chen – Lead Software Engineer

What does Fliptop do?

• Predictive Lead Scoring, using data science– Pull opportunity/lead/contact data from CRM– Aggregate company data and social data from various

data sources and the internet– Over 3000 signals– Build conversion/revenue model– Predict lead conversion and revenue

Our Platform Stack

• Java/Scala• Liftweb• JMS/Storm• MongoDB/MySql

Our Machine Learning Stack

• Python• Numpy/Scipy/Pandas• Bottle (RESTful Server)

So, where is R then?

• Problem:– Data is stored in MongoDB

• Sales Lead Data• Sales Opportunity Data• Sales Contact Data

– It’s hard to view/digest/process data on the fly using MongoDB console• (X) Text processing for insight extraction?• (X) Prototype cool machine learning algorithms on the fly?

• Solution:– R and Rstudio Server

• Why not scala?• Why not python/ipython

MongoDB Console & Query

Rstudio Server

Pull MongoDB data into R data frame

• rmongodb (https://github.com/gerald-lindsly/rmongodb)

Transform Into a R data-frame

1 – Get the total count of your data set

2 – Construct Vectors for each column

3 – Loop through curser and insert values

Where are my apply functions?- Too bad. We are using mongo cursor :P

4 – Go into sub bson block to extract data (optional)

5 – Construct data frame and return

You are able to get the full example code here: http://goo.gl/tlyyXp

We now have a data frame to play with from MongoDB bson.

This is NOT a BIG DATA Stack

• It takes around 1 min to process 900Mb+ of bson from Mongo.

• NOT BIG data stack – Data should fit into the ram• Most of the data in the business world is not big

anyways.• It works fine for us (m1.large machine in AWS)

– CRM data is never big, not even after we pull in 3000+ additional signals.

– The term ‘Big-Data’ is seriously overrated, ‘Data Science’ however, is the key term here.

@Fliptop, we now use Rstudio to do

• Data Insight Extraction• Algorithm prototyping

If you REALLY want BIG Data

• Look into: HDFS + Pig/Hive + Hue(any other suggestion from the audience here?)

QA

• Winston Chen– Personal Blog: http://winston.attlin.com/– Twitter: @wingchen83– winston@fliptop.com

• Fliptop is hiring Data Scientists. Please email to:winston@fliptop.com

top related