random forests ujjwol subedi. introduction what is random tree? ◦ is a tree constructed randomly...

22
Random Forests Ujjwol Subedi

Upload: deborah-neal

Post on 19-Jan-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Random Forests

Ujjwol Subedi

Page 2: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Introduction What is Random Tree?

◦Is a tree constructed randomly from a set of possible trees having K random features at each node.

◦ Trees have a uniform distribution. ◦It can be generated efficiently and

the combination of large sets of random trees generally leads to accurate models.

Page 3: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Decision treesDecision trees are predictive models that use

a set of binary rules to calculate a target value.

Two types of decision trees.◦ Classification

Classification trees are used to create categorical data sets.

◦ Regression are used to create continuous data sets.

Page 4: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Here is the simple example of decision trees

Page 5: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

DefinitionRandom forests

first developed by Leo Breiman.It is group of un-pruned classification or regression trees

made from random selections of samples of the training data. Random forests are way of averaging multiple deep decision trees, trained on different parts of the same training set, with goal of overcoming over-fitting problem of individual decision trees. In other words, random forests are an ensemble learning method for classification and regression that operate by constructing a lot of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.

Page 6: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Random forests does not over fit. You can run as many trees as you want. It is fast. Running on a data set with 50,000 cases and 100 variables, it produced 100 trees in 11 minutes on a 800Mhz machine. For large data sets the major memory requirement is the storage of the data itself, and three integer arrays with the same dimensions as the data. If proximities are calculated, storage requirements grow as the number of cases

times the number of trees.

Page 7: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

How random Forest works?Each tree is grown as follows:

1. Random Record Selection: Each tree is trained on roughly 2/3rd of the total training data . Cases are drawn at random with replacement from the original data, this sample will be the training set for growing the tree.

2. Random variable selection: Some predictor variables, say m, are selected at random out of all the predictor variables and the best split on these m is used to split the node.

3. For each tree, using leftover data, calculate the misclassification rate – out of bag (OOB) error rate and aggregate error from all the trees to determine overall the OOB error rate for the classification.

Page 8: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

4. Each tree gives a classification and we say that the tree “votes” for that class. The forest chooses the classification having the most votes.

For example:If 500 trees are grown and 400 of them

predict that a particular pixel is forest and 100 predict it is a grass. Then the predicted output for that pixel will be forest.

Page 9: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Algorithm Let the number of training cases be N and number

of variables in the classifier be M. Number m of the input variables be used to

determine the decision at a node of the tree; m << M.

Choose the training set for this tree by choosing n times with replacement from all N available training cases. Use the rest cases to estimate the error to estimate the error of the tree by predicting their classes.

For each node of the tree, randomly choose m variables on which to base the decision at the node. Calculate the best split based on these variables in the training set.

Each tree is fully grown and not pruned.

Page 10: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Pros and ConsThen advantages of random forests :

◦ It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.

◦ It runs efficiently on large data sets.◦ It can handle thousands of input variables

without variable deletion.◦ It gives estimate of what variables are important

in the classification.◦ It has an effective method for estimating missing

data and maintains when large proportion of the data are missing.

◦ It computes proximities between pairs of cases that can be used in clustering, locating outliers .

Page 11: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Pros and Cons contd….Disadvantages are:

◦ Random forests have been observed to over fit for some datasets with noisy classification/regression tasks.

◦ For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more labels.

Page 12: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Parameters When running random forests there are

number of parameters that need to specified. The most common parameters are:◦ Input training data including predictor

variables.◦ The number of trees that should be built.◦ The number of predictor variables to be

used to create the binary rule for each split.

◦ parameters to calculate information related to error and variable significance.

Page 13: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Terminologies related to random forest algorithm

Bagging ( Bootstrap Aggregating) ◦ Generates m new training data set and each new training data set picks a

sample of observations with replacement. Then m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting(for classification).

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For b = 1, ..., B:Sample, with replacement, n training examples from X, Y; call these Xb, Yb. Train a decision or regression tree fb on Xb, Yb.

After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x':

or by taking the majority vote in the case of decision trees.

Page 14: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Terminologies contd..Out-of-Bag error rate

◦ As the forest is built on training data, each tree is tested on the 1/3rd of the samples not used in building that tree. This is the out-of-bag error estimate- an internal error estimate of a random forest

Bootstrap sample ◦ It is a random with replacement sampling method.

Proximities◦ These are one of the most useful tools in random

forests. The proximities originally formed a NxN matrix. After a tree is grown, put all of the data, both training and oob, down the tree. If cases k and n are in the same terminal node increase their proximity by one. At the end, normalize the proximities by dividing by the number of trees.

Page 15: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Missing Values ..Missing Data Imputation Fast way: replace

missing values for a given variable using the median of the non-missing values (or the most frequent, if categorical) Better way (using proximities): 1. Start with the fast way. 2. Get proximities. 3. Replace missing values in case i by a weighted average of non-missing values, with weights proportional to the proximity between case i and the cases with the non-missing values. Repeat steps 2 and 3 a few times (5 or 6).

Page 16: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Variables importanceRF computes two measures of

variable importance, one based on a rough-and-ready measure (Gini for classification) and the other based on permutations.

Page 17: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

ExampleIn this tree , it advise us based on

weather conditions, whether to play ball

Page 18: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Example contd…The random forest takes this

notion to the next level by combining with notion of an ensemble.

Page 19: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Results and Discussions Here classification results are

compared between the results of J48 and the Random forest.

Page 20: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features

Results and discussion contd..Table shows the Precision, Recall

and the F-measure for the random forest and J48 for the 20 data sets.

Page 22: Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features