hpcc systems engineering summit presentation - collaborative research with fau and lexisnexis deck 2
Post on 13-Jul-2015
70 Views
Preview:
TRANSCRIPT
Developing Machine Learning Algorithms on HPCC/ECL Platform
Industry/University Cooperative Research
LexisNexis/Florida Atlantic University
Victor Herrera – FAU – Fall 2014
Developing ML Algorithms On
HPCC/ECL Platform
LexisNexis/Florida Atlantic University Cooperative Research
HPCC ECL ML Library
High Level Data Centric Declarative Language
Scalability Dictionary Approach
Open Source
LexisNexis HPCC
Platform
Big Data Management
Optimized Code
Parallel Processing
Machine Learning
Algorithms
Victor Herrera – FAU – Fall 2014
FAU
Principal Investigator: Taghi Khoshgoftaar Data Mining and Machine Learning
Researchers: Victor Herrera Maryam Najafabadi
LexisNexis
Flavio Villanustre VP Technology & Product
Edin Muharemagic Data Scientist and Architect
Project Team
Victor Herrera – FAU – Fall 2014
Victor Herrera – FAU – Fall 2014
Machine Learning
Algorithm
Classification Model
Training Data
Constructor Businessman Doctor
Learning Classification
Constructor
Project Scope
• Supervised Learning for Classification
• Big Data • Parallel Computing
Implement Machine Learning Algorithms in HPCC ECL-ML library, focusing on:
ECL-ML Learning Library
ML
Supervised Learning
Classifiers
• Naïve Bayes • Perceptron • Logistic
Regression
Regression Linear Regression
Unsupervised Learning
Clustering KMeans
Association Analysis
• APrioriN • EclatN
Project Starting Point April 2012
Victor Herrera – FAU – Fall 2014
Community
Machine Learning Documentation
Review Code
• Comment • Merge
• Commit • Pull Request
Process comments
ML-ECL Library Contributor
ML-ECL Library Administrator
• Use • Contribute
ECL – ML Library
HPCC-ECL Training
HPCC-ECL Documentation
Forums
ECL-ML Open Source Blueprint
Victor Herrera – FAU – Fall 2014
Victor Herrera – FAU – Fall 2014
Algorithm Complexity Naïve Bayes Decision Trees
• KNN • Random Forest
Language Knowledge
• Transformation • Aggregations
Data Distribution Parallel Optimization
Implementation Complexity
• Understand Existing Code
• Fix Existing Modules
Add Functionality Design Modules
Data Categorical (Discrete)
Continuous Mixed
(Future Work)
Experience Progress
Example: Classifying Iris-Setosa Decision Tree vs. Random Forest
Victor Herrera – FAU – Fall 2014
// Learning Phase minNumObj:= 2; maxLevel := 5; // DecTree Parameters learner1:= ML.Classify.DecisionTree.C45Binary(minNumObj, maxLevel); tmodel:= learner1.LearnC(indepData, depData); // DecTree Model // Classification Phase results1:= learner1.ClassifyC(indepData, tmodel); // Comparing to Original compare1:= Classify.Compare(depData, results1); OUTPUT(compare1.CrossAssignments), ALL); // CrossAssignment results
// Learning Phase treeNum:=25; fsNum:=3; Pure:=1.0; maxLevel:=5; // RndForest Parameters learner2 := ML.Classify.RandomForest(treeNum, fsNum, Pure, maxLevel); rfmodel:= learner2.LearnC(indepData, depData); // RndForest Model // Classification Phase results2:= trainer2.ClassifyC(indepData, rfmodel); // Comparing to Original compare2:= Classify.Compare(depData, results2); OUTPUT(compare2.CrossAssignments), ALL); // CrossAssignment results
A4
0
A4 A4
1 1 2
A3
A4
1 2
<= 0.6 > 0.6
<= 4.7 > 4.7
<= 1.5 > 1.5 <= 1.8 > 1.8
<= 1.7 > 1.7
Example: Classifying Iris-Setosa Decision Tree vs. Random Forest
Victor Herrera – FAU – Fall 2014
// Learning Phase minNumObj:= 2; maxLevel := 5; // DecTree Parameters learner1:= ML.Classify.DecisionTree.C45Binary(minNumObj, maxLevel); tmodel:= learner1.LearnC(indepData, depData); // DecTree Model // Classification Phase results1:= learner1.ClassifyC(indepData, tmodel); // Comparing to Original compare1:= Classify.Compare(depData, results1); OUTPUT(compare1.CrossAssignments), ALL); // CrossAssignment results
// Learning Phase treeNum:=25; fsNum:=3; Pure:=1.0; maxLevel:=5; // RndForest Parameters learner2 := ML.Classify.RandomForest(treeNum, fsNum, Pure, maxLevel); rfmodel:= learner2.LearnC(indepData, depData); // RndForest Model // Classification Phase results2:= trainer2.ClassifyC(indepData, rfmodel); // Comparing to Original compare2:= Classify.Compare(depData, results2); OUTPUT(compare2.CrossAssignments), ALL); // CrossAssignment results
Example: Classifying Iris-Setosa Decision Tree vs. Random Forest
Victor Herrera – FAU – Fall 2014
Victor Herrera – FAU – Fall 2014
Supervised Learning
Naïve Bayes
Maximum Likelihood
Probabilistic Model
Decision Tree
Top-Down Induction
Recursive Data Partitioning
Decision Tree Model
Case Based
KD-Tree Partitioning
Nearest Neighbor Search
No Model – Lazy KNN Algorithm
Ensemble
Data Sampling Tree Bagging
F. Sampling Rdn Subspace
Ensemble Tree Model
Classification
Class Probability
Missing Values Classification
Area Under ROC
Project Contribution
• Development of Machine Learning Algorithms in ECL Language, focused on the implementation of Classification Algorithms, Big Data and Parallel Processing.
• Added functionality to Naïve Bayes Classifier and implemented Decision Trees, KNN and Random Forest Classifiers into ECL-ML library.
• Currently working on Big Data Learning & Evaluation : – Experiment design and implementation of a Big Data CASE
Study – Analysis of Results. – Classifiers enhancements based upon analysis of results.
Victor Herrera – FAU – Fall 2014
Overall Summary
Thank you.
Victor Herrera – FAU – Fall 2014
top related