machine learning by example - apache spark

Post on 20-Mar-2017

363 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Service Symphony Ltd

Apache Spark Machine Learning by Example

Meeraj Kunnumpurath25th of February 2017

1

Introduction❖ Working as technologist and software architect for couple of decades, at number of leading

financial institutions in the UK

❖ Authored a number books on Enterprise Java, Web Services and SOA

❖ Spoken at a number of technology conferences

❖ Founded Service Symphony Ltd in 2009 serving leading financial services customers building mission critical middleware

❖ Engineer with a keen interest in ML, AI and Data Science

❖ Blog: http://www.servicesymphony.com/blog

❖ Email: meeraj@servicesymphony.com

❖ Presentation: https://www.slideshare.net/MeerajKunnumpurath/machine-learning-by-example-apache-spark

❖ GitHub: https://github.com/kunnum/sandbox/tree/master/notebooks

2

Agenda❖ Introduction to using ML with Apache Spark

❖ Hands-on example driven approach

❖ Not a deep dive into Apache Spark Architecture

❖ Neither a deep dive into ML algorithms

❖ Examples built using Apache Zeppelin

❖ Some of the examples are from Spark ASF documentation

3

Apache Spark - Overview❖ Open source large scale distributed data processing fabric

❖ Offers multiple components addressing different facets of data science for big and fast data processing, ML, analytics and data ingestion

❖ Ability to process large amount of data in memory spanning multiple process spaces

❖ Initially started as a research project in UC Berkeley

❖ Originally released under BSD, top level ASF project licensed under ASL 2.0 since 2014

❖ One of the most active open source project, arguably the most active ASF project

❖ Adopted, extended and commercialised by multiple vendors playing in the data science realm

4

Apache Spark - Architecture

5

Apache Spark - Architecture

6

Apache Spark - Architecture

7

Scala - Spark Natural Transition

❖ Interest in Spark stemmed from deep interest in Scala and functional programming

❖ Data processing echo system built around Scala, with a strong synergy in Scala’s design motivations

❖ Extends Scala’s idiomatic functional programming model to transcend beyond process boundaries

❖ Spark RDDs - Scala collections on steroids

8

Spark - Scala Notebook

9

Spark - Scala Notebook

10

ML Components

11

ML Components❖ Data Structures

❖ Vectors and Matrices

❖ Data Frames

❖ Feature Extractors and Transformers

❖ Estimators

❖ Models

❖ Pipelines

❖ Evaluators

❖ Tuning Aids

12

ML Components - Notebook

13

ML Components - Notebook

14

ML Components - Notebook

15

ML Components - Notebook

16

Spark ML - Pipeline Architecture

❖ Dataframe

❖ Estimator

❖ Transformer

❖ Pipeline

❖ Parameter

17

Spark ML - Pipeline Architecture

18

Training time flowPipeline in estimator mode

Pipeline.fit()Creates a pipeline model

Spark ML - Pipeline Architecture

19

Test time flowPipeline in transformer mode

PipelineModel.transform()Creates dataframe with augmented prediction columns

ML Pipeline Notebook

20

ML Pipeline Notebook

21

ML Pipeline Notebook

22

ML Pipeline Notebook

23

ML Pipeline Notebook

24

Regression❖ Supervised Learning Algorithm for predicting continuous labels

❖ Multiple Algorithms

❖ Linear Regression

❖ Generalised Linear Regression

❖ Decision Tree Regression

❖ Random Forest Regression

❖ Gradient Boosted Tree Regression

❖ Survival Regression

❖ Isotonic Regression

❖ Works with input feature vectors and labelled points

25

Regression

26

Linear Regression - Notebook

27

Linear Regression - Notebook

28

Linear Regression - Notebook

29

Linear Regression - Notebook

30

Linear Regression - Notebook

31

Classification❖ Supervised learning for predicting discrete labels

❖ Multiple algorithms

❖ Binomial and polynomial logistic regression

❖ Decision tree classifier

❖ Random forest classifier

❖ Gradient boosted tree classifier

❖ Multi-layer neural network classifier

❖ Naive Bayes Classifier

32

Classification

33

Classification - Notebook

34

Classification - Notebook

35

Classification - Notebook

36

Classification - Notebook

37

Classification - Notebook

38

Classification - Notebook

39

Classification - Notebook

40

Classification - Notebook

41

Clustering❖ Unsupervised learning algorithm based on similarity

vectors

❖ Multiple algorithms

❖ K-Means Clustering

❖ LDA - Latent Dirichlet Allocation

❖ Bisecting K-Means

❖ Gaussian Mixture Model

42

Clustering

43

Clustering - Notebook

44

Clustering - Notebook

45

Clustering - Notebook

46

Clustering - Notebook

47

Clustering - Notebook

48

Clustering - Notebook

49

Clustering - Notebook

50

Clustering - Notebook

51

Collaborative Filtering

❖ Commonly used for recommender systems

❖ Uses ALS (Alternating Least Squares) to learn latent factors in user to item association

❖ Default assumption is based on explicit feedback for matrix factorization

❖ You an explicitly enable implicit preferences

52

Collaborative Filtering

53

Collaborative Filtering - Notebook

54

Collaborative Filtering - Notebook

55

Collaborative Filtering - Notebook

56

Collaborative Filtering - Notebook

57

Collaborative Filtering - Notebook

58

Collaborative Filtering - Notebook

59

Model Tuning

❖ API to tune an individual estimator or the entire pipeline using a normalised parameter model

❖ API to support k-fold cross validation

❖ API to evaluate performance on linear regression, as well as binomial and polynomial classification

❖ API for performing training validation split

60

Model Tuning - Notebook

61

Model Tuning - Notebook

62

Model Tuning - Notebook

63

Model Tuning - Notebook

64

Model Tuning - Notebook

65

Model Tuning - Notebook

66

Questions

67

top related