hpcc systems engineering summit presentation - collaborative research with fau and lexisnexis deck 2

Developing Machine Learning Algorithms on HPCC/ECL Platform

Industry/University Cooperative Research

LexisNexis/Florida Atlantic University

Victor Herrera – FAU – Fall 2014

Developing ML Algorithms On

HPCC/ECL Platform

LexisNexis/Florida Atlantic University Cooperative Research

HPCC ECL ML Library

High Level Data Centric Declarative Language

Scalability Dictionary Approach

Open Source

LexisNexis HPCC

Platform

Big Data Management

Optimized Code

Parallel Processing

Machine Learning

Algorithms

Principal Investigator: Taghi Khoshgoftaar Data Mining and Machine Learning

Researchers: Victor Herrera Maryam Najafabadi

LexisNexis

Flavio Villanustre VP Technology & Product

Edin Muharemagic Data Scientist and Architect

Project Team

Machine Learning

Algorithm

Classification Model

Training Data

Constructor Businessman Doctor

Learning Classification

Constructor

Project Scope

• Supervised Learning for Classification

• Big Data • Parallel Computing

Implement Machine Learning Algorithms in HPCC ECL-ML library, focusing on:

ECL-ML Learning Library

Supervised Learning

Classifiers

• Naïve Bayes • Perceptron • Logistic

Regression

Regression Linear Regression

Unsupervised Learning

Clustering KMeans

Association Analysis

• APrioriN • EclatN

Project Starting Point April 2012

Community

Machine Learning Documentation

Review Code

• Comment • Merge

• Commit • Pull Request

Process comments

ML-ECL Library Contributor

ML-ECL Library Administrator

• Use • Contribute

ECL – ML Library

HPCC-ECL Training

HPCC-ECL Documentation

Forums

ECL-ML Open Source Blueprint

Algorithm Complexity Naïve Bayes Decision Trees

• KNN • Random Forest

Language Knowledge

• Transformation • Aggregations

Data Distribution Parallel Optimization

Implementation Complexity

• Understand Existing Code

• Fix Existing Modules

Add Functionality Design Modules

Data Categorical (Discrete)

Continuous Mixed

(Future Work)

Experience Progress

Example: Classifying Iris-Setosa Decision Tree vs. Random Forest

// Learning Phase minNumObj:= 2; maxLevel := 5; // DecTree Parameters learner1:= ML.Classify.DecisionTree.C45Binary(minNumObj, maxLevel); tmodel:= learner1.LearnC(indepData, depData); // DecTree Model // Classification Phase results1:= learner1.ClassifyC(indepData, tmodel); // Comparing to Original compare1:= Classify.Compare(depData, results1); OUTPUT(compare1.CrossAssignments), ALL); // CrossAssignment results

// Learning Phase treeNum:=25; fsNum:=3; Pure:=1.0; maxLevel:=5; // RndForest Parameters learner2 := ML.Classify.RandomForest(treeNum, fsNum, Pure, maxLevel); rfmodel:= learner2.LearnC(indepData, depData); // RndForest Model // Classification Phase results2:= trainer2.ClassifyC(indepData, rfmodel); // Comparing to Original compare2:= Classify.Compare(depData, results2); OUTPUT(compare2.CrossAssignments), ALL); // CrossAssignment results

<= 0.6 > 0.6

<= 4.7 > 4.7

<= 1.5 > 1.5 <= 1.8 > 1.8

<= 1.7 > 1.7

// Learning Phase minNumObj:= 2; maxLevel := 5; // DecTree Parameters learner1:= ML.Classify.DecisionTree.C45Binary(minNumObj, maxLevel); tmodel:= learner1.LearnC(indepData, depData); // DecTree Model // Classification Phase results1:= learner1.ClassifyC(indepData, tmodel); // Comparing to Original compare1:= Classify.Compare(depData, results1); OUTPUT(compare1.CrossAssignments), ALL); // CrossAssignment results

// Learning Phase treeNum:=25; fsNum:=3; Pure:=1.0; maxLevel:=5; // RndForest Parameters learner2 := ML.Classify.RandomForest(treeNum, fsNum, Pure, maxLevel); rfmodel:= learner2.LearnC(indepData, depData); // RndForest Model // Classification Phase results2:= trainer2.ClassifyC(indepData, rfmodel); // Comparing to Original compare2:= Classify.Compare(depData, results2); OUTPUT(compare2.CrossAssignments), ALL); // CrossAssignment results

Supervised Learning

Naïve Bayes

Maximum Likelihood

Probabilistic Model

Decision Tree

Top-Down Induction

Recursive Data Partitioning

Decision Tree Model

Case Based

KD-Tree Partitioning

Nearest Neighbor Search

No Model – Lazy KNN Algorithm

Ensemble

Data Sampling Tree Bagging

F. Sampling Rdn Subspace

Ensemble Tree Model

Classification

Class Probability

Missing Values Classification

Area Under ROC

Project Contribution

• Development of Machine Learning Algorithms in ECL Language, focused on the implementation of Classification Algorithms, Big Data and Parallel Processing.

• Added functionality to Naïve Bayes Classifier and implemented Decision Trees, KNN and Random Forest Classifiers into ECL-ML library.

• Currently working on Big Data Learning & Evaluation : – Experiment design and implementation of a Big Data CASE

Study – Analysis of Results. – Classifiers enhancements based upon analysis of results.

Overall Summary

Thank you.

hpcc systems engineering summit presentation - collaborative research with fau and lexisnexis deck 2

learning library

machine learning researchers

learning phase minnumobj

ecl library contributor

maxlevel tmodel

ecl library administrator

fau principal investigator

maxlevel rfmodel

Technology

hpcc systems engineering summit presentation - collaborative...

hpcc with grids and clouds

floridaatlantic locked guide.pdfkids club marks fau-00095a...

floridaatlantic locked guide.pdfsecondary athletic mark...

how cpl online has progressed with hpcc systems & enhanced...

introduction to hpcc for cem956

database week - nyc€¦ · tackling big data with hpcc...

hpcc systems engineering summit - applications of hpcc...

hpcc systems vs hadoop

wht/082311 hpcc systems flavio villanustre vp, products and...

hpcc orientation day

hpcc client tools

pertinencia - autoevaluacion fau 1 - copia pertinencia fau...

hpcc systems engineering summit presentation - leveraging...

soso’o le fau i le fau

hpcc systems: big data nlp with hpcc systems – a ... ·...

hpcc final report kosovo

hpcc euler

introduction to the hpcc

hpcc cluster application