big data-science-oanyc

18
June 2013 BIG DATA SCIENCE: A PATH FORWARD

Upload: open-analytics

Post on 12-May-2015

3.582 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big data-science-oanyc

June 2013

BIG DATA SCIENCE: A PATH FORWARD

Page 2: Big data-science-oanyc

CONFIDENTIAL | 2

linkedin.com/in/danmallinger/@danmallinger

www.thinkbiganalytics.com

Data Science Lead @ Think Big

Product/Brand Obsessive

Teacher

Occasional Engineer

Page 3: Big data-science-oanyc

CONFIDENTIAL | 3

TODAY

• High level exploration of the

• skills, tools, and techniques

• needed to achieve early success

• and to help you build

• your data science practice.

Page 4: Big data-science-oanyc

CONFIDENTIAL | 4

Understand our organizational needs for data science

Infrastructure: Technological tools and platforms.

Talent: Staff hired and trained.

Capabilities: Data science techniques utilized.

INFRASTRUCTURE, TALENT, & CAPABILITIES

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduceData Exploration

Basic Modeling PhD Math

Visualization Clustering CategorizationContinuous Models

Text Analysis

Page 5: Big data-science-oanyc

CONFIDENTIAL | 5

Boxed Solutions: Mahout & Platform

Toolkits: RHadoop, Scikit, etc.

You will need toolkits to solve unique problems

but smart techniques make that easier.

Boxed solutions are limited

but can be a good source of early velocity.

ANALYTICS TOOLS

Page 6: Big data-science-oanyc

CONFIDENTIAL | 6

Gigabytes from Stackoverflow

Questions from users

With metadata

Users have reputations

Questions open or closed

Follow along

Thinking about your data

To learn in a

Familiar context and

Plan

DATA

Presenter Audience

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Page 7: Big data-science-oanyc

CONFIDENTIAL | 7

select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions;

STEP 1: EXPLORE

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Patterns through Hive Patterns through Tableau

Page 8: Big data-science-oanyc

CONFIDENTIAL | 8

Summaries of unstructured data

Time-since metrics

select transform(…)

using ‘python …’

Clustering: Browsing cohorts

/bin/mahout canopy

STEP 2: FEATURE BUILDING

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

SQL Windowing Cross-Record Features

Page 9: Big data-science-oanyc

CONFIDENTIAL | 9

• Sample (don’t parallelize)

• Naturally parallel

• SVD• Random Forests

• Estimators and Ensembles

• Bootstrapping• Localizing

• Advanced Parallelization

• Linear models with SGD• Neural networks

PARALLEL MODELS IN HADOOP

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Page 10: Big data-science-oanyc

CONFIDENTIAL | 10

Single R model

run many times

over samples

and aggregated

m <- C5.0(status ~ …)

STEP 3: STRUCTURED MODEL (BAGGING)

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Mapper 1:Define n reducer keys

Send any record to reducer I with probability p

Reducer 1:Key: Id of sample

Value: List of recordsPerform analysis over records

Reducer 2:Key: One

Value: List of modelsAggregate the models (e.g. average)

Bagging a Model

Page 11: Big data-science-oanyc

CONFIDENTIAL | 11

WHERE ARE WE?

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

We’ve created a structured model

to flag questions that won’t be closed

using Big Data.

But we haven’t used unstructured data.

Page 12: Big data-science-oanyc

CONFIDENTIAL | 12

TEXT ANALYSIS

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

• Is “the big dog” really different from “dog is big?”

• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”

• Language has lexical and syntactical features

• Different techniques leverage these in different ways

Bag of Words: Structure doesn’t matter

n-gram: Structure matters (but not that much)

Feature Extraction: BACON! BACON! BACON!

Page 13: Big data-science-oanyc

CONFIDENTIAL | 13

STEP 4: UNSTRUCTURED MODEL

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Similar to Hadoop’s Word Count

Create counts for token/category pairs

Use counts to calculate Information Gain

MR Job 1:Calculate information gain (IG) for all

tokens.

MR Job 2:Select tokens with largest IG.

Create structured data for record, tokens:question #4 | 0 | 1 | 0 | 1 | 1

MR Job 3:Build a classifier over the newly structured

data (prior slides)

Information Gain

Page 14: Big data-science-oanyc

CONFIDENTIAL | 14

WHERE ARE WE?

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

We’ve created two models

One structured,

one unstructured.

But they don’t work together.

Page 15: Big data-science-oanyc

CONFIDENTIAL | 15

STEP 5: ENSEMBLE MODEL

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Join many models together

By using their output

As input to ensemble model.

Best when models perform differently

Exploit differences with nonlinearities

Like interaction effects.

EnsemblingMapper 1:

Load multiple modelsScore the models per record and output

Reducer 1:Key: Id of record

Value: List of model outputsJoin model outputs to make new records

MR Job 2:Build a model over the output data as if it

was raw data.

Page 16: Big data-science-oanyc

CONFIDENTIAL | 16

We’ve created two models:

one structured,

one unstructured

and have ensembled them

to create a single, powerful model

and solve a practical business problem.

WHERE ARE WE?

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Page 17: Big data-science-oanyc

CONFIDENTIAL | 17

This required simple infrastructure

a blend of analysis and scripting skills

an understanding of BIG data science techniques

but not a team of PhDs or a billion dollars.

HOW DID WE GET HERE?

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Page 18: Big data-science-oanyc

CONFIDENTIAL | 18

Questions?

www.thinkbiganalytics.com@danmallinger