data science workshop

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Science with Hadoop

Fall, 2014

Ajay Singh Director, Technical Alliance


Agenda

•  Data Science

•  Machine Learning – quick overview

•  Data Science with Hadoop

•  Demo


Data Science


What is Data Science?

Data facts and statistics collected together for reference or analysis

Science The intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment.

Data Science The scien&fic explora+on of data to extract meaning or insight, and the construction of software systems to utilize such insight in a business context.

Someone who does this … Data Scientist


Where Can We Use Data Science?

Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates

Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens

Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security

Retail • Product recommendation • Inventory management • Price optimization

Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis

Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels


Data Science is an Iterative Activity

Visualize, Explore

Hypothesize; Model

Measure/Evaluate Acquire Data

Clean Data

Formulate the question Deploy


Data Science combines proficiencies…

Data Exploration

Feature Engineering

Raw Transforms

The data science process is comprised of three main tasks, requiring different skill types, including technical, analytical and programming.

Signal Processing

OCR

Geo-spatial

Normalize

Transform/aggregate

Sample

Dimensionality reduction

Feature Selection

NLP

Mutual Information

Data Modeling

Frequent Itemset

Anomaly Detection

Clustering

Collaborative Filter

Regression

Classification

Supervised Learning

Unsupervised Learning

Reporting Visualization Data Quality

technical analytical

A data scientist needs to be proficient in all these tasks.

Pre-processing


Data Science with Big Data…

Very large raw datasets are now available:

-  Log files

-  Sensor data

-  Sentiment information

With more raw data, we can build better models with improved predictive performance.

To handle the larger datasets we need a scalable processing platform like Hadoop and YARN


Data scientists master many skills Applied Science

•  Statistics, applied math •  Machine Learning •  Tools: Python, R, SAS, SPSS

Big data engineering •  Big data pipeline engineering

•  Statistics and machine learning over large datasets

•  Tools: Hadoop, PIG, HIVE, Cascading, SOLR, etc

Business Analysis •  Data Analysis, BI

•  Business/domain expertise

•  Tools: SQL, Excel, EDW

Data engineering •  Database technologies

•  Computer science

•  Tools: Java, Scala, Python, C++


Which makes them hard to find… Applied Science

•  Statistics, applied math •  Machine Learning •  Tools: Python, R, SAS, SPSS

Business Analysis •  Data Analysis, BI

•  Business/domain expertise

•  Tools: SQL, Excel, EDW

Data engineering •  Database technologies

•  Computer science

•  Tools: Java, Scala, Python, C++

Big data engineering •  Big data pipeline engineering

•  Statistics and machine learning over large datasets

•  Tools: Hadoop, PIG, HIVE, Cascading, SOLR, etc


The Data Science Team

Business Analyst

Data engineer Applied

Scientist


Machine Learning Overview


What is Machine Learning?

WALL-E was a machine that learned how to feel emotions after 700 years of experiences on Earth collecting human artifacts.

Machine learning is the science of getting computers to learn from data and act without being explicitly programmed. •  Machine learning is about the construction and

study of systems that can learn from data.

•  The core of machine learning deals with representation and generalization so that the system will perform well on unseen data instances and predict unknown events.

•  There is a wide variety of machine learning tasks and successful applications.


Supervised vs. Unsupervised learning

Data Modeling

Frequent Itemset

Anomaly Detection

Clustering


Regression

Classification

Supervised Learning


Supervised learning: Applications in which the training data is a set of “labeled” examples of the input vectors along with their corresponding target variable (labels) Unsupervised learning: Applications in which the training data comprises examples of input vectors WITHOUT any corresponding target variables. The goal is to unearth “naturally occurring patterns” in the data, such as in clustering Collaborative filtering: (recommendations engine) uses techniques from both supervised and unsupervised world.


Supervised Learning: learn from examples

Labeled dataset

Test data

Patient Age

Tumor Size

Clump Thickness

… Malignant?

55 5 3 TRUE

70 4 7 TRUE

85 4 6 FALSE

35 2 1 FALSE

… … … … FALSE

TRUE

Patient age Tumor size Clump … 72 3 3 66 4 4

Cancer model F(k1, k2, k3, k4)

Malignant

? ?

f(V1, V2, V3, …) = ?

Feature Matrix

Target function

Feature Vector


Classification: predicting a category

Some techniques: -  Naïve Bayes -  Decision Tree -  Logistic Regression -  SGD -  Support Vector Machines -  Neural Network -  Ensembles


Regression: predict a continuous value

Some techniques: -  Linear Regression / GLM -  Decision Trees -  Support vector regression -  SGD -  Ensembles


Example: Ad Click-Through Rates in Ad Search

Rank = bid * CTR Predict CTR for each ad to determine placement, based on: -  Historical CTR -  Keyword match -  Etc…


Unsupervised Learning: detect natural patterns

Age State Annual Income Marital status

25 CA $80,000 M 45 NY $150,000 D 55 WA $100,500 M 18 TX $85,000 S … … … …

No labels

Model Naturally occurring (hidden) structure


Clustering: detect similar instance groupings

Some techniques: -  k-means -  Spectral clustering -  DB-scan -  Hierarchical

clustering


Example: market segmentation


Outlier Detection: identify abnormal patterns

Example: identify engine anomalies Features: -  Heat generated -  Vibration of engine


Outlier Detection Target Function: outlier factor

Outlier factor (0…1)

ID Total$ Age City OF 101 $200 25 SF 0.1 102 $350 35 LA 0.05 103 $25 15 LA 0.2 … … … … 0.1

0.9 0.2 0.15 0.1

Some techniques: -  Statistical techniques -  Local outlier factor -  One-class SVM


Example: Credit Card Fraud Detection


Affinity Analysis: identifying frequent item sets

Y N N Y N Y N N Y N Y Y N Y N N N Y Y Y

Tx 1 Tx 2

Tx 3

Tx 4 Tx 5

…

Item

1

Item

2

Item

3

Item

4

Item

5

…

Y N N Y N Y N N Y N Y Y N Y N N N Y Y Y

Tx 1 Tx 2

Tx 3

Tx 4 Tx 5

…

Item

1

Item

2

Item

3

Item

4

Item

5

…

Goal: identify frequent item set Techniques: FP Growth, a priori


Example: Affinity Analysis

Use affinity analysis for -  store layout design -  Coupons


Product recommendation: predicting “preference”

Collaborative Filtering Identify users with similar “taste”


Collaborative filtering -> matrix completion

5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3

Har

ry p

otte

r

X-M

en

Hob

bit

Arg

o

Pira

tes

5 2 4 1 3 4 1 5 2 3 1 2 4 1 3

101 102 103 104 105 …

101 102 103 104 105 …

Har

ry p

otte

r

X-M

en

Hob

bit

Arg

o

Pira

tes


Example: Netflix


Hadoop and Data Science


• Data Lake: all the data in one place – Ability to store ALL the data in raw format – Data silo convergence

• Data/compute capabilities available as shared asset – Data scientists can quickly prototype a new idea without an up-front request for funding – YARN enables multiple processing applications

Hadoop Improves Data Scientist Productivity


I need new data

Finally, we start

collec+ng

Let me see… is it any good?

Start 6 months 9 months

“Schema change” project

Let’s just put it in a folder on HDFS

Let me see… is it any good?

3 months

My model is awesome!

“Schema on read” Accelerates Data Innovation


Hadoop is ideal for pre-processing

Feature Engineering

Raw Transforms

Signal Processing

OCR

Geo-spatial

Normalize

Transform/aggregate

Sample


Feature Selection

NLP

Mutual Information

Data Modeling

Frequent Itemset

Anomaly Detection

Clustering


Regression

Classification

Supervised Learning


Pre-processing

Build a better feature matrix -  More/new features -  More instances -  Faster and at more scale


Training a Supervised Learning model with Hadoop

•  Typically “training set” is not that large –  In this case, it’s very common to train on a high-memory node –  Using existing tools: R, Python Scikit-learn or SAS

•  For really large training sets that don’t fit in memory –  SAS –  Spark ML-Lib is a promising (albeit new) solution –  Mahout is workable in some cases (but future is unclear)

•  Hadoop is also useful in parameter tuning: –  Grid-search: optimizing the model’s parameters


Scoring a Supervised Learning Model with Hadoop

•  Scoring of a single instance is usually fast •  Some use-cases require frequent batch re-scoring of a

large population (e.g, 20M customers): -  Use PMML scoring engine (e.g., Zementis, Pattern) -  Custom implementation with Python, R, Java, etc


Unsupervised learning with Hadoop

•  Clustering: –  Many clustering algorithms are parallelizable –  Distributed K-means is popular and available in Spark ML-Lib &

Mahout

•  Collaborative Filtering: –  Alternating Least Squares (ALS) – very parallelizable –  ALS implemented in Mahout, Spark ML-Lib, others –  Item-based or user-based collaborative filtering available in Mahout


Deployment Considerations: Hadoop and Spark

Page 37

•  User runs Spark (or ML-Lib) job directly from Edge Node •  Scala API or Java API •  Python API also good

•  Spark runs directly as a YARN job

•  No need to install anything else

Spark ML-Lib Edge node

Spark . .

. . .

. . Spark

YARN


Deployment Considerations: Hadoop and R

Page 38

•  R and relevant packages installed on each node

•  User runs R on high-memory node •  Rstudio or Rstudio server •  RCloud

•  Interfaces to Hadoop

•  RMR: run map-reduce with R •  RHDFS: access HDFS files from R •  RHIVE: run hive queries from R •  RHBase: Hbase from R •  RODBC

Rstudio, Rcloud Rhadoop RHive

R . .

. . .

. . R

YARN

R high-memory node


Deployment Considerations: Hadoop and Python

Page 39

•  Python and relevant packages installed on each node and high-memory nodes

•  User runs Python on high-memory node •  IPython notebook is a great UI

•  Interfaces to Hadoop

•  PyDoop: access HDFS from Python •  Map-reduce jobs with Hadoop streaming •  Python UDFs with PIG

IPython Pandas, Scikit-learn Numpy, Scipy Matplotlib PyDoop

PythonScikit-learn

Pandas. .

. . .

. .Python

Scikit-learnPandas

YARN

Python high-memory node


Supervised Learning with Hadoop More details + demo


Model

Predict

Supervised Learning Workflow

Feature Extraction

Train the

Model Model

Raw Data

(Train)

Labels

New Data

Feature Extraction Labels

Training

Predicting

Eval Model

Feature Matrix

Feature Vector


Close up: Feature Extraction

Raw Data

ID Total$ Age City Target

101 200 25 SF

102 350 35 LA

103 25 15 LA

… … … …

Feature Matrix Feature Engineering

Raw Transforms

Signal Processing

OCR

Geo-spatial

Normalize

Transform/aggregate

Sample


Feature Selection

NLP

Mutual Information

TB, PB

Feature Extraction

MB, GB


How Big is your Feature Matrix?

Example: •  10M rows, 100 features •  Each feature = 8 bytes (double) •  Total memory = ~7.5GB


Close-Up: Training the Model

Train the

Model Training

Set Model Eval

Model Metric

-  Feature matrix randomly split into “training” (70%) and “validation” set (30%) -  Model is built using training set and error measure is computed over validation set -  Iterative process or grid-search to determine the best algorithm and choice of

parameters so that: -  We get optimal model accuracy -  We prevent over-fitting

Validation Set


Evaluating Performance of a Classifier

•  Determine “confusion matrix” •  Compute metrics: precision, recall, accuracy and specificity

Actual

Yes No

Pred

icte

d Yes True positives

False positives

No False negatives

True negatives

Confusion Matrix

From confusion matrix, we can compute these metrics: Precision = % of positive predicts that are correct Recall = % of positive instances that were predicts as positive F1 score = a measure of test’s accuracy, combining precision and recall Accuracy = % of correct classifications


Demo overview • Datasets:

– Airline delay data (we’re using only 2007, 2008 years) –  http://stat-computing.org/dataexpo/2009/the-data.html

– Weather data from http://ncdc.noaa.gov/ • Goal:

– Predict delay (delayTime >= 15 mins) in flights – For simplicity, limited to flights originating from ORD

• Tools: – Pre-process: PIG or Spark on Hadoop – Modeling: Scikit-learn or Spark/ML-Lib or R


Demo Flow

Feature Extraction

Train the

Model

Predict / Score

Model Raw Data

(Train)

Labels

Raw Data (Test)

Feature Extraction Labels

Training Prediction

Airline, Weather (2007) ORD_2007

ORD_2008

Airline, Weather (2008)


DEMO now!

data science workshop

Documents

data clean data

data facts

data science process

data scientist page

quick overview data

analysis science

science statistics

yarn page