machine learning with apache spark

38
Machine Learning With Apache Spark CodeMash, Sandusky, Ohio, Jan 5-8, 2016 David Taieb STSM-IBM Cloud Data Services

Upload: ibm-cloud-data-services

Post on 21-Apr-2017

1.862 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Machine Learning with Apache Spark

Machine Learning With Apache Spark

CodeMash, Sandusky, Ohio, Jan 5-8, 2016David TaiebSTSM-IBM Cloud Data Services

Page 2: Machine Learning with Apache Spark

©2015 IBM Corporation

Introduction

David [email protected]

Developer AdvocateIBM Cloud Data Services

Our mission:We are here to help developers realize their most ambitious projects.

https://developer.ibm.com/clouddataservices/connect/

Page 3: Machine Learning with Apache Spark

©2015 IBM Corporation

Big data, cloud and the rise of business Analytics

‣Data being collected by enterprises grows exponentially : ERP, embedded systems (IOT)

‣Cloud, with high availability and huge capacity, make more data available for analytics

‣Big data and cloud create new opportunities:- Organizations: more effective decision-

making process, richer client interactions- Business users: discover new insights,

better decision-making process- Developers: access to diverse data sources

and new tools that increase productivity

Page 4: Machine Learning with Apache Spark

©2015 IBM Corporation

Why Business Analytics with big data“In God we trust. All others bring data”

W. Edwards Deming

‣Every day, companies make bet-the-business decisions about their customers, competitors and new products

‣Time available for decision-making is shrinking (sometimes real-time)

‣As more and more companies go digital, data becomes the world’s newest resource for competitive advantage

‣Decision making has moved from the elite few to the empowered many

‣Few organizations can keep pace with the appetite for data

Page 5: Machine Learning with Apache Spark

Business Analytics TypesDescriptive Analytics Predictive Analytics Prescriptive Analytics

Look at the reason for past success or failure

What is probably going to happen in the future?

What’s my best actions?

• Use interactive querying and visualization to explore and communicate data

• Discover insight and trends• correlation between 2

seemingly unrelated variables

• Data mining• Generate hypothesis and

models

• Predict occurrence of future events using probability (confidence)

• Product recommendations• Classification

• Help make the right decision based on the data

• Find optimal solution to a given problem

Page 6: Machine Learning with Apache Spark

Taking Analytics a step further with Cognitive Systems

‣ Use natural language processing and machine learning algorithms to unlock knowledge from massive amount of structured and unstructured data

Decide• Ingest and analyze domain sources, info models• Generate evidence based decisions with confidence• Learn with new outcomes and actions• e.g. - Next generation Apps Probabilistic Apps

Ask• Leverage vast amounts of data• Ask questions for greater insights• Natural language inquiries• e.g. - Next generation Chat

Discover• Find the rationale for given answers• Prompt for inputs to yield improved responses• Inspire considerations of new ideas • e.g. - Next generation Search Discovery

IBM Watson

Page 7: Machine Learning with Apache Spark

IBM Cloud Data ServicesResources for developers to get, build, and analyze on the IBM

Cloud

Page 8: Machine Learning with Apache Spark

©2015 IBM Corporation

What is spark

Spark is an open sourcein-memory

computing framework for distributed data processing

and iterative analysis

on massive data volumes

Page 9: Machine Learning with Apache Spark

©2015 IBM Corporation

Spark Core Libraries

Spark Core

general compute engine, handles distributed task dispatching, scheduling

and basic I/O functions

Spark SQL

Spark Streaming

Mllib (machine learning)

GraphX (graph)

executes SQL

statements

performs streaming

analytics using micro-batches

common machine

learning and statistical algorithms

distributed graph

processing framework

Page 10: Machine Learning with Apache Spark

©2015 IBM Corporation

Key reasons for interest in Spark Open Source

Fast

distributed data

processing

Productive

Web Scale

•In-memory storage greatly reduces disk I/O•Up to 100x faster in memory, 10x faster on disk

•Largest project and one of the most active on Apache•Vibrant growing community of developers continuously improve code base and extend capabilities

•Fast adoption in the enterprise (IBM, Databricks, etc…)

•Fault tolerant, seamlessly recompute lost data from hardware failure•Scalable: easily increase number of worker nodes•Flexible job execution: Batch, Streaming, Interactive

•Easily handle Petabytes of data without special code handling•Compatible with existing Hadoop ecosystem

•Unified programming model across a range of use cases•Rich and expressive apis hide complexities of parallel computing and worker node management

•Support for Java, Scala, Python and R: less code written•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX

Page 11: Machine Learning with Apache Spark

©2015 IBM Corporation

IBM is all-in on its commitment to Spark

11

Foster CommunityEducate 1M+ data

scientists and engineers via online courses

Sponsor AMPLab, creators and evangelists of Spark

Infuse the PortfolioIntegrate Spark throughout portfolio

3,500 employees working on Spark-related topicsSpark however customers want it – standalone, platform or products

Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss

Launch Spark Technology Cluster (STC), 300 engineers

Open source SystemMLPartner with databricks

Contribute to the Core

Page 12: Machine Learning with Apache Spark

©2015 IBM Corporation

Spark MLLib‣Extension to the Spark Core API that provide a library of easy to use

Machine learning algorithms.‣Highly scalable: Leverages Spark ability to work with massive amount of

data‣Fast: Designed for parallel computing‣Cover common Machine Learning algorithms:

- Regression- Classification- Clustering- Recommender Systems- Text Analytics

Page 13: Machine Learning with Apache Spark

©2015 IBM Corporation

What is Machine Learning and where is it used‣Subfield of computer science that focuses on getting computers to

learn from data:- Recognize patterns- Make predictions

‣Example use:- Spam filters- Netflix recommendations- Self-driving cars- Watson- …

Page 14: Machine Learning with Apache Spark

©2015 IBM Corporation

Typical Machine Learning Flow diagram

Data Acquisition

Data Preparation

Data Annotation

(Ground Truth)

Model Training

• Cleansing• Shaping• Enrichment

Model Testing

Training Set

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model

Page 15: Machine Learning with Apache Spark

©2015 IBM Corporation

MLLib Algorithm Overview• Predictive analytics• Recommendations

• Collaborative Filtering• Matrix Factorization

• Feature extraction and Transformation• TF-IDF• HashingTF• Word2Vec• StandardScaler• Normalizer

• Model Evaluation/Metrics• Binary Classification Metrics• Multi Class Metrics• Regression Metrics

Page 16: Machine Learning with Apache Spark

©2015 IBM Corporation

Predictive analyticsContinuous Output Discrete Output

Supervised Learning

(require Ground-Truth)

• Regression - Linear - Ridge - Lasso - Isotonic• Decision Tree• RandomForest• GradientBoostedTree

• Classification - Logistic Regression - SVM - NaiveBayes• Decision Tree• RandomForest• GradientBoostedTree• K-NN (available as add-on spark

package)

Unsupervised Learning

(no Ground-Truth data required)

• Clustering - KMeans - Gaussian Mixture• Dimensionality Reduction - PCA - SVD

• FP-Growth

Page 17: Machine Learning with Apache Spark

©2015 IBM Corporation

Featured demo: Flight Delay Predictor‣Use training data collected from flight stats and enriched with weather observations

from “Insight for Weather” service on Bluemix ‣Train multi-class classifier that, given and flight departure weather observations,

can predict the flight delay class:- 0 = Canceled- 1 = On Time- 2 = Delay less than 2 hours- 3 = Delay between 2 and 4 hours- 4 = Delay more than 4 hours

‣Provide metrics measurement for each algorithms- Accuracy- Precision- Recall

Page 18: Machine Learning with Apache Spark

©2015 IBM Corporation

Architecture

Weather

Simple Data Pipes

Airports

Flight Schedules

Flight StatusFlightstats Cloudant

Metadata Training Set

Test Set

Blind Set

Custom Connector run every 24 hours

Notebook

Page 19: Machine Learning with Apache Spark

©2015 IBM Corporation

Get‣ Identify data sources:

- flightstats.com: https://developer.flightstats.com- Airport metadata: FS Code, geolocation,…- Flight Schedules- Flight Status

- Weather Observations- Insight for Weather on Bluemix

‣ Storage:- Cloudant

‣ Tool used:- Simple Data Pipes custom connector to build Training, Test and Blind data set

‣ Constraints:- Weather service provide past observations as far as 24 hours back only- Flightstats API key is a 30 day trial version, limited to 20,000 calls only

Page 20: Machine Learning with Apache Spark

©2015 IBM Corporation

Custom Pipes Connector to build training data sethttps://developer.ibm.com/clouddataservices/simple-

data-pipe/

Page 21: Machine Learning with Apache Spark

©2015 IBM Corporation

Run every 24 hoursBecause Weather service doesn’t return observations older than 24 hours, the data set must be ran every 24 hours

Page 22: Machine Learning with Apache Spark

©2015 IBM Corporation

Build: Explore the data with Notebook

Page 23: Machine Learning with Apache Spark

©2015 IBM Corporation

Loading training data set

Page 24: Machine Learning with Apache Spark

©2015 IBM Corporation

Build: Visualize and explore data setScatter plot of flights delays based on temperature in Departing and Arrival airports

Page 25: Machine Learning with Apache Spark

©2015 IBM Corporation

Build: Visualize and explore data setScatter plot of flights delays based on wind speed in Departing and Arrival airports

Page 26: Machine Learning with Apache Spark

©2015 IBM Corporation

Constraints‣Past weather observations provided by the “Insight for Weather” service have more

details than forecast data:- Limit the number of features used to train the models to the intersections of the 2.

‣Restrict the training data to weather forecast at departure and arrival airport- Would adding weather data from various point in the route increase the model performance?

‣Difficult to get enough representative data because I was using a trial account on flightstats- Ideally, I would use more airports with better representative weather

‣Didn’t use any categorical features‣For simplicity: Use IPython Notebook as the user interface

- Make the experience less compelling for Business users- To avoid writing too much code in the Notebook, encapsulate some of the business logic in a Python

library- Doesn’t cover as much of the Spark API as Scala

Page 27: Machine Learning with Apache Spark

©2015 IBM Corporation

Load labeled data RDD

Page 28: Machine Learning with Apache Spark

©2015 IBM Corporation

Load labeled data RDD

Page 29: Machine Learning with Apache Spark

©2015 IBM Corporation

Build: NaiveBayes Classification

Page 30: Machine Learning with Apache Spark

©2015 IBM Corporation

Build: Decision Tree classification

Page 31: Machine Learning with Apache Spark

©2015 IBM Corporation

Build: Random Forest classification

Page 32: Machine Learning with Apache Spark

©2015 IBM Corporation

Build: Performance measurementsLoad blind data

Page 33: Machine Learning with Apache Spark

©2015 IBM Corporation

Build: Compare metrics between different models

Page 34: Machine Learning with Apache Spark

©2015 IBM Corporation

Naïve Bayes vs Decision Tree‣Probabilistic: compute the probability

of a data instance to be in a specific class

‣Assume that each feature (variable) is independent from the others

‣Performance depends on the predictive nature of the features (non predictive features will affect the accuracy)

‣Works well with low amount of training data. Doesn’t need all the possibilities

‣Doesn’t work with categorical features.

‣Non-Probabilistic: partition the data into subsets that best describe the variable

‣The deeper the tree, the better the model fits the data

‣Watch out for overfiting: need to prune the tree

‣Can handle categorical or continuous features

‣No need for input to be scaled or standardized: Set you features and go!

‣Requires a lot of data covering all possibilities

Page 35: Machine Learning with Apache Spark

©2015 IBM Corporation

Analyze: Run model

Page 36: Machine Learning with Apache Spark

©2015 IBM Corporation

Code: Run Model

Page 37: Machine Learning with Apache Spark

©2015 IBM Corporation

If you want to know more

‣https://developer.ibm.com/clouddataservices/

‣https://github.com/ibm-cds-labs/pipes-connector-flightstats

‣http://spark.apache.org/docs/latest/mllib-guide.html

‣https://console.ng.bluemix.net/data/analytics/

Page 38: Machine Learning with Apache Spark

©2015 IBM Corporation

Thank you