wei fan, ibm t.j.watson joe mccloskey, us department of defense philip yu, ibm t.j.watson

35
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson

Upload: linnea

Post on 30-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees. Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson. Three DM Problems. Classification: Label: given set of labels in training data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

A General Framework for Fast and Accurate Regression by Data

Summarization in Random Decision Trees

Wei Fan, IBM T.J.Watson

Joe McCloskey, US Department of Defense

Philip Yu, IBM T.J.Watson

Page 2: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Three DM Problems

Classification: Label: given set of labels in training data.

Probability Estimation: Similar to the above setting: estimate the

probability that x is an example of class y. Difference: no truth is given, i.e., no true

probability Regression:

Target value: continuous values.

Page 3: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Model Approximation True model or correct model.

Generates y for each x with probability P(y|x). Normally never known in reality.

Perfect model: never makes mistakes or has the same prediction as the true model.

Not always possible due to: Stochastic nature of the problem Noise in training data Data is insufficient

Page 4: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Optimal Model Loss function L(t,y) to evaluate performance.

Optimal decision decision y* is the label that minimizes expected loss when x is sampled repeatedly:

Examples 0-1 loss: y* is the label that appears the most often,

i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the

“empirical risk”.• If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict

fraud MSE or mean square error: predict average

Page 5: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

How we look for optimal models? Don’t impose “exact forms”:

Decision Trees, Classification based on Association rules, Production rules

Learner estimate structure as well as parameters

NP-hard for most “model representation”

Impose “exact forms”: logistic regression functions,

linear regression model, etc Learners estimate parameter

ONLY. Structure is pre-fixed Inductive Bias.

Decision tree is rather flexible, efficient yet powerful representation.

Page 6: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Consider Decision Tree Compromise between accuracy and model

complexity We think that simplest-structured hypothesis that fits

the data is the best. We employ all kinds of heuristics to look for it.

info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning, cost-

based pruning. Reality: tractable, but still pretty expensive Truth: none of purity check functions guarantee

accuracy over testing data.

Page 7: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Random Decision Tree -classification, regression, probability estimation

Key characteristics: Structure is randomly picked. Statistics are summarized from training data.

At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never

been chosen previously on a given decision path starting from the root to the current node.

A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Page 8: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Continued We stop when one of the following

happens: A node becomes too small. Or the total height of the tree exceeds some

limits:• Such as the total number of features.

Page 9: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Node Statistics Classification and Probability

Estimation: Each node of the tree keeps the number of

examples belonging to each class.

Regression: Each node of the tree keeps the mean value of

examples sorted into the node

Page 10: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Classification/Prob Estimatimation

During classification, each tree outputs posterior probability:

B1 < 0.5

Y

B2 > 0.7 B1 > 0.3

P1: 200P2: 10

N

Y N

P1: 30P2: 70

Y

… …

P(P1|x)=0.3

Page 11: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Regression During classification, each tree average

value of training examples that falls within each node

Age >30

Y

Capt> 70% Edu=PhD

AvgAGI=100K

N

Y N

AvgAGI=150K

Y

… …

Page 12: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Classification

The prediction from multiple random trees are averaged as the final output.

Classification: loss function is needed.

Page 13: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

A few words about some of its advantage Training can be very efficient.

Particularly true for very large datasets.

Natural multi-class probability. Natural multi-label classification and

probability estimation. Imposes very little about the

structures of the model.

Page 14: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Number of trees Sampling theory:

The random decision tree can be thought as sampling from a large (infinite when continuous features exist) population of trees.

Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance. In most cases, 10 are usually enough.

Worst scenario Only one feature is relevant. All the rest are noise. Probability:

Variance Deduction:

Page 15: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Donation Dataset-classification and prob estimation

Decide whom to send charity solicitation letter.

It costs $0.68 to send a letter. Loss function

Page 16: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Result

Page 17: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Credit Card Fraud-classification and prob estimation

Detect if a transaction is a fraud There is an overhead to detect a

fraud, {$60, $70, $80, $90} Loss Function

Page 18: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Result

Page 19: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Comparing with Boosting Don’t handle multi-class problems

naturally, ECOC Do not output probabilities. Inefficient. Boosting rounds is tricky. Sometimes,

more rounds can lead to overfitting. Inefficient. Implementation needs careful numerical

manipulation.

Page 20: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Comparing with Bagging Could be very inefficient particularly

for very large dataset i.e., bootstrap sampling needs linear

scan of the data. Do not output reliable probabilities.

Page 21: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Probability Estimation

Page 22: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Probability Estimation

Page 23: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Overfitting

Page 24: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Non-overfitting of RDT

Page 25: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Selectivity

Page 26: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Tolerance to data insufficiency

Page 27: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

GUIDE

Age >30

Y

Capt> 70% Edu=PhD

MLR

N

Y N

MLR

Y

… …

MLR y = a+a1*x1+a2*x2 + … ak*xk

Page 28: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Regression: single independent variable

Page 29: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

RDT

Page 30: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Depend on combination of 5 independent variables

Page 31: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

RDT

Page 32: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

It grows like …

Page 33: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Comparing with GUIDE Need to decide grouping variables and

independent variables. A non-trivial task. If all variables are categorical, GUIDE

becomes a single CART regression tree. Strong assumption and greedy-based

search. Sometimes, can lead to very unexpected results, like the one given earlier

Page 34: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Conclusion Imposing a particular form of model is not

a good idea to train highly-accurate models.

It may not even be efficient for some forms of models.

RDT has been show to solve all three major problems in data mining, classification, probability estimation and regressions, simply, efficiently and accurately.

Page 35: Wei Fan,  IBM T.J.Watson Joe McCloskey,  US Department of Defense Philip Yu,  IBM T.J.Watson

Selected Bibliography of RDT ICDM’03: “Is random model better? On its accuracy and efficiency”

(Fan, Wang, Yu and Ma) AAAI’04: “On the Optimality of Posterior Probability Estimation by

Random Decision Tree” (Fan) ICDM’05: “Effective Estimation of Posterior Probabilities: Explaining

the Accuracy of Randomized Decision Tree Approaches” (Fan, Greengrass, McCloskey, Yu, and Drummey)

ICDM’05: “Learning through Changes: An Empirical Study of Dynamic Behaviors of Probability Estimation Trees” (Zhang, Buckles, Peng, and Xu)

Master Thesis by Tony Liu, supervised by Kai Ming Ting, “The Utility of Randomness in Decision Tree Construction”, Monash University, 2005

KDD’06: “A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees”