machine learning and apache mahout : an introduction

33
+ Varad Meru Software Development Engineer Orzota, Inc. about.me/vrdmr Machine Learning and Apache Mahout © Varad Meru, 2013

Upload: varad-meru

Post on 08-Sep-2014

5.381 views

Category:

Technology


3 download

DESCRIPTION

An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).

TRANSCRIPT

Page 1: Machine Learning and Apache Mahout : An Introduction

+

Varad MeruSoftware Development EngineerOrzota, Inc.about.me/vrdmr

Machine Learning

and Apache Mahout

© Varad Meru, 2013

Page 2: Machine Learning and Apache Mahout : An Introduction

+Who Am I

Orzota, Inc. Making BigData Easy Designing a Cloud-based platform for ETL, Analytics

Past Work Experience Persistent Systems Ltd.

Recommendation Engines and User Behavior Analytics.

Area of Interest Machine Learning Distributed Systems Recommendation Engines

2

Page 3: Machine Learning and Apache Mahout : An Introduction

+Outline

Introduction

Machine Learning Introduction and History Types of Learning Algorithms Applications What’s New

Apache Mahout History Architecture Applications and Examples

Conclusion© Varad Meru, 2013

3

Page 4: Machine Learning and Apache Mahout : An Introduction

+

Machine LearningRise of the Machine-Era

4

Page 5: Machine Learning and Apache Mahout : An Introduction

+Introduction

Term coined by Arthur Samuel "Field of study that gives computers the ability to learn

without being explicitly programmed“.

Branch of Artificial Intelligence and Statistics

Focuses on prediction based on known properties

Used as a sub-process in Data Mining. Data Mining focuses on discovering new, unknown

properties.

“Machine Learning is Programming Computers to optimize a Performance Criterion using

Example Data or Past Experience”

5

Page 6: Machine Learning and Apache Mahout : An Introduction

+Learning Algorithms

Supervised Learning Labelled input data. Creating classifiers to predict unseen inputs.

Unsupervised Learning Unlabelled input data. Creating a function to predict the relation and output

Semi-Supervised Learning Combines Supervised and Unsupervised Learning

methodology

Reinforcement Learning Reward-Punishment based agent.

6

Page 7: Machine Learning and Apache Mahout : An Introduction

+Supervised Learning

Learn from the Data

Data is already labelled Expert, Crowd-sourced or case-based labelling of data.

Applications Handwriting Recognition Spam Detection Information Retrieval

Personalisation based on ranks Speech Recognition

Introduction

7

Page 8: Machine Learning and Apache Mahout : An Introduction

+Supervised Learning

Decision Trees

k-Nearest Neighbours

Naive Bayes

Logistic Regression

Perceptron and Multi-level Perceptrons

Neural Networks

SVM and Kernel estimation

Algorithms

8

Page 9: Machine Learning and Apache Mahout : An Introduction

+Supervised LearningExample: Naive Bayes Classifier

President Obama’s Speech’s Word Map

9

Page 10: Machine Learning and Apache Mahout : An Introduction

+Supervised LearningExample: Naive Bayes Classifier

A Spam Document’s Word Map

10

Page 11: Machine Learning and Apache Mahout : An Introduction

+Supervised LearningExample: Naive Bayes Classifier

Running a test on the Classifier

Classifier

“Order a trial Adobe chicken daily EAB-List new summer

savings, welcome!”

11

SpamBin

Page 12: Machine Learning and Apache Mahout : An Introduction

+Unsupervised Learning

Finding hidden structure in data

Unlabelled Data

SMEs needed post-processing to verify, validate and use the output

Used in exploratory analysis rather than predictive analytics

Applications Pattern Recognition Groupings based on a distance measure

Group of People, Objects, ...

Introduction

12

Page 13: Machine Learning and Apache Mahout : An Introduction

+Unsupervised Learning

Clustering k-Means, MinHash, Hierarchical Clustering

Hidden Markov Models

Feature Extraction methods

Self-organizing Maps (Neural Nets)

Algorithms

13

Page 14: Machine Learning and Apache Mahout : An Introduction

+Unsupervised LearningExample K-Means

14

Source: http://apandre.wordpress.com/visible-data/cluster-analysis/

Page 15: Machine Learning and Apache Mahout : An Introduction

+Learning ProblemCat and Dog Problem

Humans can easily classify which is a cat and which is a dog.

But how can a computer do that?

Some attempts used Clustering Mechanisms to solve it – Co-occurence Clustering, Deep Learning

15

Page 16: Machine Learning and Apache Mahout : An Introduction

+

Apache MahoutScalable Machine Learning Library

© Varad Meru, 2013

16

Page 17: Machine Learning and Apache Mahout : An Introduction

+History and Etymology

Inspired from MapReduce for Machine Learning on Multicore” Ng et. al.

Written in Java. Apache License.

Founders Mahout – Isabel Drost, Grant Ingersoll,

Karl Witten. Taste – Sean Owen

Mahout – Keeper/Driver of Elephants.

Current Release – 0.8 (stable)

© Varad Meru, 2013

17

Page 18: Machine Learning and Apache Mahout : An Introduction

+Need

BigData Ever-growing data. Yesterday’s methods to

process tomorrow’s data Cheap Storage

Scalable from Ground Up Should be build on top of

any existing Distributed Systems framework

Should contain distributed version of ML algorithms

Size Classification Tools

LinesSample Data

Analysis and Visualisation

Whiteboard,Bash, ...

KBs – low MBsPrototype Data

Analysis and Visualisation

Matlab, Octave, R, Processing, Bash, ...

MBs – low GBs

Online Data

StorageMySQL (DBs), ...

Analysis

NumPy, SciPy, Pandas, Weka..

VisualisationFlare, AmCharts, Raphael

GBs – TBs – PBs

Big Data

StorageHDFS, Hbase, Cassandra,...

AnalysisHive, Giraph, Hama, Mahout

18

Page 19: Machine Learning and Apache Mahout : An Introduction

+Mahout Modules

Evolutionary Algorithms

Classification

Clustering Recommenders

Regression FPM Dimension Reduction

UtiliesLucene/Vectorizer

MathVectors/ Matrics/SVD

Collections(Primitives)

Hadoop

Applications

19

Page 20: Machine Learning and Apache Mahout : An Introduction

+Recommender Systems

© Varad Meru, 2013

20

Page 21: Machine Learning and Apache Mahout : An Introduction

+Recommender Systems

Types of Recommender Systems Content Based Recommendations Collaborative Filtering Recommendations

User-User Recommendations Item-Item Recommendations

Dimensionality Reduction (SVD) Recommendations

Applications Products you would like to buy People you might want to connect with Potential Life-Partners Recommending Songs you might like ...

21

Introduction

Page 22: Machine Learning and Apache Mahout : An Introduction

+Recommender Systems

22

Collaborative Filtering in Action

Assuming people have seen at least one movie. Cold Start?

1: seen

0: not seen

© Varad Meru, 2013

Page 23: Machine Learning and Apache Mahout : An Introduction

+Collaborative Filtering in Action

Tanimoto Coefficient

NA – Number of Customers who bought A

NB – Number of Customers who bought B

NC – Number of Customers who bought A and B

© Varad Meru, 2013

CBA

C

NNN

NbaT

),(

23

Page 24: Machine Learning and Apache Mahout : An Introduction

+Collaborative Filtering in Action

Cosine Coefficient

NA – Number of Customers who bought A

NB – Number of Customers who bought B

NC – Number of Customers who bought A and B

© Varad Meru, 2013

BA

C

NN

NbaC

),(

24

Page 25: Machine Learning and Apache Mahout : An Introduction

+Apache Mahout

Two Modes Stand-alone non distributed (“Taste”) Scalable Distributed Algorithmic version

for Collaborative Filtering

Top-level Packages Data Model User Similarity Item Similarity User Neighbourhood Recommender

25

Recommender System Architecture

Page 26: Machine Learning and Apache Mahout : An Introduction

+Naive Bayes Classifier

26

Classifier

“Order a trial Adobe chicken daily EAB-List new summer

savings, welcome!”

Page 27: Machine Learning and Apache Mahout : An Introduction

+Naive Bayes Classifier

Naive Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.

Training: Read the Features Calculate per-Document

Statistics Normalize across Categories Calculate normalizing factor

of each label

Testing Classification (fifth job, explicitly invoked)

© Varad Meru, 2013

27

Page 28: Machine Learning and Apache Mahout : An Introduction

+K-Means Clustering

28

Iterations

Page 29: Machine Learning and Apache Mahout : An Introduction

+K-Means Clustering

29

MapReduce Version

Page 30: Machine Learning and Apache Mahout : An Introduction

+ Summary• Machine Learning

• Learning Algorithms• Varied Applications

• Mahout• Scaling to Giga/Tera/Peta Scale• Free and Open Source

30

Page 31: Machine Learning and Apache Mahout : An Introduction

+More Info.

1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012.

2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)

3. http://mahout.apache.org/ - Apache Mahout Project Page

4. http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout

5. [VIDEO] “Collaborative filtering at scale” by Sean Owen

6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub.

© Varad Meru, 2013

31

Page 32: Machine Learning and Apache Mahout : An Introduction

+

Questions?

© Varad Meru, 2013

32

Page 33: Machine Learning and Apache Mahout : An Introduction

+ Thank YouGo BigData!!!

33

© Varad Meru, 2014