distributed logistic model trees

37
Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD

Upload: stratio

Post on 07-Jan-2017

58 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Distributed Logistic Model Trees

Distributed Logistic Model Trees, Stratio Intelligence

Mateo Álvarez and Antonio Soriano

@StratioBD

Page 2: Distributed Logistic Model Trees

Aerospace Engineer, MSc in

Propulsion Systems (UPM), Master

in Data Science (URJC).

Working as data scientist and Big

Data developer at Stratio Big Data in

the data science department

mateo-alvarez

Page 3: Distributed Logistic Model Trees

Ph.D. in Telecommunications, MSc in

Electronic Systems Engineering and

Telecommunication Technologies,

Systems and Networks (UPV), and MSc

“Big Data Expert” (UTAD).

Working as data scientist and Big Data

developer at at Stratio Big Data in the

data science department

@Phd_A_Soriano

Page 4: Distributed Logistic Model Trees

Why using interpretable algorithms instead of “black boxes”

Logistic Regression

Decision Trees

Variance-Bias tradeoff

Metrics

Demo

Logistic Model Trees

Distributed implementation

Cost function & configuration params

Demo

Page 5: Distributed Logistic Model Trees

• Why use interpretable algorithms instead of “black boxes”

• Logistic Regression

• Decision Trees

• Variance-Bias tradeoff

@StratioBD

Page 6: Distributed Logistic Model Trees

Accuracy ExplainabilityVS

Medical Studies Power management Financial environment Criminal activity

Page 7: Distributed Logistic Model Trees

Threshold

Pro

bab

ility

Feature

Page 8: Distributed Logistic Model Trees

Threshold

Pro

bab

ility

Local bad adjust

Feature

Page 9: Distributed Logistic Model Trees

Threshold

Pro

bab

ility

Feature

Page 10: Distributed Logistic Model Trees

Feature 1

Root node

Leaf node Leaf node

Page 11: Distributed Logistic Model Trees

Feature 1

Root node

Feature 2 Feature 2

Leaf node Leaf node Leaf node Leaf node

Page 12: Distributed Logistic Model Trees

Feature 1

Root node

Feature 2 Feature 2

Leaf node Leaf node Leaf node Leaf node

Local bad adjust

Page 13: Distributed Logistic Model Trees

Mea

n E

rro

r

Model complexity

Test error

Training error

Model complexity

Variance

Total error

Bias2

Erro

r OverfittingUnderfitting

Page 14: Distributed Logistic Model Trees

Variance

Bias

Page 15: Distributed Logistic Model Trees

Missing important variables for the problem to make the predictions

Variance

Bias

Page 16: Distributed Logistic Model Trees

Overfitting to the sample/training data

Variance

Bias

Page 17: Distributed Logistic Model Trees

Irreducible error on prediction

Variance

Bias

Page 18: Distributed Logistic Model Trees

• Logistic Model Trees

• Distributed implementation

• Cost function & configuration parametres

• Demo

@StratioBD

Page 19: Distributed Logistic Model Trees

Root node

Performance metrics

Page 20: Distributed Logistic Model Trees

Performance metrics

Feature 1

Root node

Performance metrics

Performance metrics

Page 21: Distributed Logistic Model Trees

Feature 1

Root node

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Performance metrics

Page 22: Distributed Logistic Model Trees

Feature 1

Root node

Feature 2 Feature 2

Page 23: Distributed Logistic Model Trees

Feature 1

Root node

Feature 2

Page 24: Distributed Logistic Model Trees

Feature 1

Root node

Feature 2

Page 25: Distributed Logistic Model Trees

DISTRIBUTED IMPLEMENTATION

Spark’s Decision Tree(distributed implementation of random forests)

Spark’s Logistic Regression / weka’s Logistic Regression on the nodes

Page 26: Distributed Logistic Model Trees

LMT Cost function to fix the logistic regression threshold

• AccuracyCostFunction• ConfusionMatrix• PrecisionCostFunction• PrecisionRecallCostFunction• RocCostFunction

The same cost function for pruning criteria

Performance metrics

Performance metrics

Performance metrics

Page 27: Distributed Logistic Model Trees

Big datasetsPower of spark to distribute building the

tree and logistic regressions

ADVANTAGES OF THIS IMPLEMENTATION

Medium datasetsDistributed tree growth and weka’s

logistic regression

Small datasetsAlthough it can be slow to

distribute the data for the decision tree, cost functions can be still

used and specific optimization for particular cases

Page 28: Distributed Logistic Model Trees

Example of DLMT algorithm in a synthetic dataset

Page 29: Distributed Logistic Model Trees

• Metrics

• Demo

@StratioBD

Page 30: Distributed Logistic Model Trees

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

Precission

True Positive Rate (Recall)

False Positive Rate

TPR = TP/(TP+FN) Insensitive to unbalance

FPR = FP/(FP+TN) Insensitive to unbalance

Precision = TP/(TP+FP) Sensitive to unbalance

Accuracy = (TP+TN)/(TP+TN+FP+FN) Sensitive to unbalance

Page 31: Distributed Logistic Model Trees

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

True Positive Rate (Recall)

False Positive Rate

AUROC (AUC): TPR/FPR -> Insensitive to unbalance! TPR

FPR

Best performance

Page 32: Distributed Logistic Model Trees

PREDICTION

Positive Negative

TRUE CONDITION

Positive True Positives False Negatives

Negative False Positives True Negatives

True Positive Rate (Recall)

AUPRC: Precision/TPR -> Sensitive to unbalance!

Precission

Precision

Recall

Best performance

Page 33: Distributed Logistic Model Trees

f

f

1

n

BenchmarkABF

Data

Algorithms

Page 34: Distributed Logistic Model Trees

@StratioBD

Page 35: Distributed Logistic Model Trees

Accuracy ExplainabilityVS

Performance Metrics:AUROC, AUPRC, ACCURACY

Automatic Benchmarking Framework

f

f

1

n

BenchmarkABF

1

2

3 4

Page 36: Distributed Logistic Model Trees

THANK YOU

UNITED STATES

Tel: (+1) 408 5998830

EUROPE

Tel: (+34) 91 828 64 73

[email protected]

www.stratio.com

@StratioBD

Page 37: Distributed Logistic Model Trees

[email protected] ARE HIRING

@StratioBD