[women in data science meetup atx] decision trees

Decision Trees

Decision TreesNikolaos Vergos, Ph.D.Senior Data Scientist, Accordion Health

OverviewWhat are decision treesBuilding decision treesPurity MetricsEntropyInformation GainGINI indexStopping growth of a decision treeEnsemble learning

Purity: an important concept in decision trees, determine the decision we take at each step

Ensemble: how to combine trees together for better performance2

What are decision trees?A flowchart-like structure; graph decision support modelA non-parametric supervised learning technique that can be used for both categorical (classification) and continuous (regression) outputVisually engaging and very easy to interpretExcellent model for someone transitioning into the world of data science:Require little data preparationAble to handle multi-output problems

Very flexibleSupport all kinds of features

mostly used for classificationgood for binary and multi-class3

Surviving the titanic

Interconnected nodes act as a series of questions / test conditionsTerminal nodes (leaves) show the output metricSource: http://www.kdnuggets.com/2016/09/decision-trees-disastrous-overview.html

Upside down tree

root at the top : all of our datathinning out in each stepterminal nodes (leaves)4

QuestionsHow does the algorithm choose which variables to include in the tree?How does the algorithm choose where variables should be located on the tree?How does the algorithm decide to stop growing the tree?

Growing an optimal decision tree for a training data set is computationally a very hard problemWe can still grow a good enough tree greedy algorithms have good performance (they choose the immediately best option available at each step)Hunts algorithm: greedy, recursive algorithm that leads to local optimum

Several algorithms for decision trees,ID3, CART

Optimal DT NP complete

different beginning might lead to different tree5

Building a decision treeRecursively partition records into smaller and smaller subsetsPartitioning decision depends on purity:Different variables and split options are evaluated to determine which split will provide the greatest separation between classesGoal of a decision tree: to have nodes consisting entirely of members of a single classThe "impurity" of a node (the extent to which that node is imbalanced) should be minimized.Several metrics quantify impurity

Purity: for binary classification,want as many data points to belong to the same category6

EntropyData Set: S, each member of which belongs to a class c1, c2, , cn

H = 0 : all elements are same classH = 1 : even split between classes

p: proportion in class j

Maximum entropy = total disorderZero entropy = total order

Evaluate entropy within a PARENT, then evaluate entropy in potential CHILDREN7

Information GainStems from entropyH(parent) (weighted average) * H(children)

ParentX < 4X < 3Source: ACM-SIGKDD Meetup

How much more information we have gained with each potential split;

Which split reduces entropy (imbalance) the most8

GINI IndexExpected error rate:How often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.GINI = 0: All elements are same class (perfect separation, perfect purity)GINI = 0.5: Even split between classes (equal representation)Similar process:Calculate GINI Gain for each potential splitChoose split with the highest GINI Gain

Misclassification frequency9

GINI Gain

ParentX < 4X < 3

GINI Gain: G(parent) (weighted average) * G(children)

When to use which?Only ~ 2% performance differenceEntropy might be a bit slower to compute (due to the logarithm)Gini for continuous attributes, Entropy for categoricalGini to minimize misclassification, Entropy for exploratory analysisGini will tend to find the largest class, Entropy tends to find groups of classes that make up ~50% of the dataDefault in scikit-learn: Gini. Entropy also available.

No actual difference between the two

Historically (different fields of science)Gini for numerical featuresEntropy for categorical

In the textbook: can define everything from scratchIn real life: scikit-learn or R

11

When to stop growing? How about overfitting?Pure leavesPre-set depth of tree: the length of the longest path from the root to a leafNumber of cases in node less than minimum number of cases setSplitting criteria less than certain threshold

Decision Trees are prone to overfittingPre-pruning: set a minimum threshold on the gain, and stop when no split achieves a gain above this threshold.Postpruning: build the full tree, and then perform pruning as a post-processing stepNot currently supported in scikit-learn (0.18)

Decision Trees overfit: they learn the data set so

threshold: similar to gradient descent optimizer, can decide to stop when the IG/GG from node to node is too insignificant (convergence)12

Ensemble LearningDecision Trees can be weak learners with a tendency to overfit training dataWe can combine several weak learners into an overall strong learnerAveraging methods for reducing varianceBagging (Bootstrap Aggregating): use random subsets of training setRandom Forest Classifier: Build multiple decision trees and let them vote on how to classify inputs (scikit-learn). Only a subset of features considered to split a node.Boosting methods for reducing biasBase estimators (individual trees) are built sequentially; the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were (likely to be) misclassified by previous models.AdaBoost, Gradient Boosting, XGBoost

References & Further ReadingACM-SIGKDD Meetup: Advanced Machine Learning with PythonKevin Markham: Introduction to Decision Trees (slides, PDF)Pang-Ning Tan et. Al.: Introduction to Data Mining, Chapter 4Scikit-learn documentationAnalytics Vidhya: A Complete Tutorial on Tree Based Modeling from Scratch

Thank you for your [email protected]

@nvergos

Nikos Vergos

[women in data science meetup atx] decision trees

Data & Analytics