big data-science-oanyc
TRANSCRIPT
June 2013
BIG DATA SCIENCE: A PATH FORWARD
CONFIDENTIAL | 2
linkedin.com/in/danmallinger/@danmallinger
www.thinkbiganalytics.com
Data Science Lead @ Think Big
Product/Brand Obsessive
Teacher
Occasional Engineer
CONFIDENTIAL | 3
TODAY
• High level exploration of the
• skills, tools, and techniques
• needed to achieve early success
• and to help you build
• your data science practice.
CONFIDENTIAL | 4
Understand our organizational needs for data science
Infrastructure: Technological tools and platforms.
Talent: Staff hired and trained.
Capabilities: Data science techniques utilized.
INFRASTRUCTURE, TALENT, & CAPABILITIES
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduceData Exploration
Basic Modeling PhD Math
Visualization Clustering CategorizationContinuous Models
Text Analysis
CONFIDENTIAL | 5
Boxed Solutions: Mahout & Platform
Toolkits: RHadoop, Scikit, etc.
You will need toolkits to solve unique problems
but smart techniques make that easier.
Boxed solutions are limited
but can be a good source of early velocity.
ANALYTICS TOOLS
CONFIDENTIAL | 6
Gigabytes from Stackoverflow
Questions from users
With metadata
Users have reputations
Questions open or closed
Follow along
Thinking about your data
To learn in a
Familiar context and
Plan
DATA
Presenter Audience
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 7
select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions;
STEP 1: EXPLORE
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Patterns through Hive Patterns through Tableau
CONFIDENTIAL | 8
Summaries of unstructured data
Time-since metrics
select transform(…)
using ‘python …’
Clustering: Browsing cohorts
/bin/mahout canopy
STEP 2: FEATURE BUILDING
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
SQL Windowing Cross-Record Features
CONFIDENTIAL | 9
• Sample (don’t parallelize)
• Naturally parallel
• SVD• Random Forests
• Estimators and Ensembles
• Bootstrapping• Localizing
• Advanced Parallelization
• Linear models with SGD• Neural networks
PARALLEL MODELS IN HADOOP
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 10
Single R model
run many times
over samples
and aggregated
m <- C5.0(status ~ …)
STEP 3: STRUCTURED MODEL (BAGGING)
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Mapper 1:Define n reducer keys
Send any record to reducer I with probability p
Reducer 1:Key: Id of sample
Value: List of recordsPerform analysis over records
Reducer 2:Key: One
Value: List of modelsAggregate the models (e.g. average)
Bagging a Model
CONFIDENTIAL | 11
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
We’ve created a structured model
to flag questions that won’t be closed
using Big Data.
But we haven’t used unstructured data.
CONFIDENTIAL | 12
TEXT ANALYSIS
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
• Is “the big dog” really different from “dog is big?”
• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”
• Language has lexical and syntactical features
• Different techniques leverage these in different ways
Bag of Words: Structure doesn’t matter
n-gram: Structure matters (but not that much)
Feature Extraction: BACON! BACON! BACON!
CONFIDENTIAL | 13
STEP 4: UNSTRUCTURED MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Similar to Hadoop’s Word Count
Create counts for token/category pairs
Use counts to calculate Information Gain
MR Job 1:Calculate information gain (IG) for all
tokens.
MR Job 2:Select tokens with largest IG.
Create structured data for record, tokens:question #4 | 0 | 1 | 0 | 1 | 1
MR Job 3:Build a classifier over the newly structured
data (prior slides)
Information Gain
CONFIDENTIAL | 14
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
We’ve created two models
One structured,
one unstructured.
But they don’t work together.
CONFIDENTIAL | 15
STEP 5: ENSEMBLE MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Join many models together
By using their output
As input to ensemble model.
Best when models perform differently
Exploit differences with nonlinearities
Like interaction effects.
EnsemblingMapper 1:
Load multiple modelsScore the models per record and output
Reducer 1:Key: Id of record
Value: List of model outputsJoin model outputs to make new records
MR Job 2:Build a model over the output data as if it
was raw data.
CONFIDENTIAL | 16
We’ve created two models:
one structured,
one unstructured
and have ensembled them
to create a single, powerful model
and solve a practical business problem.
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 17
This required simple infrastructure
a blend of analysis and scripting skills
an understanding of BIG data science techniques
but not a team of PhDs or a billion dollars.
HOW DID WE GET HERE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 18
Questions?
www.thinkbiganalytics.com@danmallinger