random forests: the vanilla of machine learning - anna quach
TRANSCRIPT
![Page 1: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/1.jpg)
Welcome to my talk!
I’m currently a PhD student at Utah State University working underDr. Adele Cutler.
![Page 2: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/2.jpg)
Do we need hundreds of classifiers to solve real worldclassification problems?
121 data sets from the University of California, Irvine (UCI) database (excluding large-scale problems) and their own data to evaluate179 classifiers. Overall Random Forests performed the best in termsof accuracy!
See the paper here:http://www.jmlr.org/papers/v15/delgado14a.html
![Page 3: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/3.jpg)
Random Forests wins Kaggle competitions
http://blog.kaggle.com/2012/05/01/chucking-everything-into-a-random-forest-ben-hamner-on-winning-the-air-quality-prediction-hackathon/
![Page 4: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/4.jpg)
Random Forests: A seminal paper!
https://scholar.google.com/citations?user=mXSv_1UAAAAJ&hl=en&oi=ao
![Page 5: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/5.jpg)
Random Forests
The (very theoretical) paper can be found here:
https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
![Page 6: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/6.jpg)
The inventers of Classification and Regression Trees(CART)
(a) LeoBreiman
(b) JeromeFriedman
(c) Charles J.Stone
(d) Richard A.Olshen
![Page 7: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/7.jpg)
CART is actually published as a book.
https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418
![Page 8: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/8.jpg)
Building a Classification Tree - Predicting Fake Likers
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25Predict Fake Facebook Likes
Average Verified Page Likes
Cat
egor
y E
ntro
py
Facebook Like
RealFake
Download the data here:http://digital.cs.usu.edu/~kyumin/data/likers.html
![Page 9: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/9.jpg)
First split
Category Entropy: −∑k
i=1niN log ni
N where ni is the number of likedpages under category i, and N is the total number of pages like byuser u.Average Verified Page Likes: average proportion of verified pagesliked out of total number of pages liked by a user
![Page 10: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/10.jpg)
Second split
![Page 11: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/11.jpg)
Third split
![Page 12: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/12.jpg)
Code to build a classification tree
library(rpart)library(rpart.plot)
data = read.csv("FakeLiker-dataset.csv")colnames(data)[9] = "Entropy"colnames(data)[10] = "Average"
levels(data$Class) = c("Real", "Fake")cols = c("lightblue", "orange")[data$Class]
ctree = rpart(Class ~ Entropy + Average, data)prp(ctree,
extra = 1,box.palette = c("lightblue", "orange"))
![Page 13: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/13.jpg)
References on recursive partitioning (rpart) and rpart.plot
Some good reference to understanding how a CART works andexample code on the rpart R package can be found here:
https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
and a guide with plenty of examples on plotting nice tree can befound here:
http://www.milbo.org/rpart-plot/prp.pdf
![Page 14: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/14.jpg)
A visual introduction to a decision treeONE OF THE 10 AWARD-WINNING SCIENCE VISUALIZATIONSFROM THE 2016 VIZZIES
Learn how a classfication tree is built interactively here: http://www.popsci.com/how-machine-learning-works-interactive
![Page 15: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/15.jpg)
Bagging (Bootstrap Aggregating)
Fit each tree to bootstrap samples (random sample withreplacement) from the data and combine by voting (classification)or averaging (regression).
http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf
![Page 16: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/16.jpg)
The powers of Random Forests!
Random Forests is applicable to a wide variety of problems. Hereare some of the features of Random Forests:
I Classification and RegressionI Rank Important Feature (Most widely used)I Impute Missing ValuesI Local Variable Importance (Underused)I Unbalance classesI Naturally fits interactionsI Does not overfit as you add more treeI Detect patterns using proximities (Underused)I Requires little tuning! Has two possible parameters to tune –
mtry and depth for regression
![Page 17: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/17.jpg)
Original Implementation of Random Forests is in Fortran
(a) Leo Breiman (b) Adele Cutler
A good documentation of the capabilities of Random Forests andthe fortran code can be found here:https://www.stat.berkeley.edu/~breiman/RandomForests/
![Page 18: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/18.jpg)
Random Forests is a trademark
The commercial version of Random Forests, as well as videos aboutRandom Forests can be found here: https://www.salford-systems.com/products/randomforests
Salford Systems provides a user guide on how to use RandomForests in their Software, Salford Predictive Modeler (SPM). Findthe user guide here: http://media.salford-systems.com/pdf/spm7/RandomForestsModelBasics.pdf
![Page 19: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/19.jpg)
randomForest - the first Random Forests package in R
https://cran.r-project.org/web/packages/randomForest/
![Page 20: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/20.jpg)
Variable Importance
Mean_postsMean_pagesMean_photosSkewness_of_postsMaximum_posts_in_a_daySTD_Cat_TemporalShares_per_postsSelf_post_updatesFriends_CountLinks_per_postsComments_per_postsLikes_per_postsSTD_TemporalPost_Frequency_per_dayAverageAbout_CountYears_ActiveEntropy
0.00 0.04 0.08 0.12MeanDecreaseAccuracy
Mean_pagesMean_postsMean_photosShares_per_postsMaximum_posts_in_a_daySelf_post_updatesSkewness_of_postsFriends_CountLinks_per_postsSTD_Cat_TemporalComments_per_postsLikes_per_postsPost_Frequency_per_daySTD_TemporalAbout_CountAverageYears_ActiveEntropy
0 50 100 150MeanDecreaseGini
Rank of Important Features
![Page 21: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/21.jpg)
Variable Importance Definition
Random Forests computes two measures of variable importance:
1. Permutation Importance (Mean Decrease in Accuracy) ispermutation based.
For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. Thepermutation importance for each variable is the average of(error rate of permuted variable) - (error rate with nopermutation) over all the trees.
2. Gini Importance (Mean Decrease in Gini) is gini based forclassification.
Average of (Gini impurity of parent node) - (the gini impurityof child nodes) over all trees in the forest for each variable.
![Page 22: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/22.jpg)
randomForest code
The Random Forests can be built and display the importantvariables using the following code in R:
library(randomForest)
rf = randomForest(Class ~ ., data,importance = TRUE,ntree = 500)
varImpPlot(rf,scale = FALSE,main = "Rank of Important Features")
![Page 23: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/23.jpg)
Determining how many trees to use
plot(rf,main = "",ylim = c(0.05, 0.25),col = c("black", "lightblue", "orange"))
![Page 24: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/24.jpg)
Local Variable Importance
For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. The localvariable importance for each case i and variable j is the average of(error rate of permuted variable) - (error rate with no permutation)over all the trees.
![Page 25: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/25.jpg)
Local Variable Importance example on detecting fakeFacebook likes
Entropy Years_Active About_Count Average Post_Frequency_per_day
−0.412
0.448
−0.292
0.326
−0.446
0.468
−0.299
0.236
−0.174
0.173
![Page 26: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/26.jpg)
Code to extract local variable importance
library(MASS)
rf = randomForest(Class ~ ., data,importance = TRUE,localImp = TRUE,proximity = TRUE,ntree = 500,scale = FALSE)
impv = names(sort(rf$importance[, 3],decreasing = TRUE))[1:5]
parcoord(t(rf$localImportance)[, impv],col = cols,var.label = TRUE)
![Page 27: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/27.jpg)
Proximities
Proximities in Random Forests is defined as the proportion of thetime two observations (both in the out-of-bag sample) end up in thesame terminal node. The proximity measures can be visualized usingMultidimensional Scaling (MDS) plots. Using the MDS plot we canlearn more about our data:
I identify characteristics of unusual pointsI find clusters within classesI see which classes are overlappingI see which classes differI see which variables are locally important
![Page 28: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/28.jpg)
Visualizing the Proximites
MDS 1
MD
S 2
MDS 1
MD
S 3
MDS 2
MD
S 3
![Page 29: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/29.jpg)
Code to extract the proximities
loc = row(rf$prox)aprox = rbind(loc,rf$prox)prox = matrix(aprox, nrow = nrow(rf$prox))scalerf = cmdscale(1 - rf$prox, eig = T, k = 3)$points
plot(scalerf[, 1], scalerf[, 2], col = cols,xlab = "MDS 1", ylab = "MDS 2",xlim = c(-0.5, 0.5),ylim = c(-0.5, 0.5),xaxt = "n",yaxt = "n")
![Page 30: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/30.jpg)
Local Variable Importance in interactive plots
We can find interesting patterns using an interactive plot.
Read more about irfplot (interactive random forests plots) here:http://digitalcommons.usu.edu/gradreports/134/
![Page 31: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/31.jpg)
Brushing in interactive plots
![Page 32: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/32.jpg)
randomForest
A short paper on the randomForest package can be found here:http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf
![Page 33: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/33.jpg)
Random Forests presentations by Dr. Adele Cutler
A more comprehensive set of notes on Random Forests by Dr. AdeleCutler can be found here:http://www.math.usu.edu/adele/RandomForests/UofU2013.pdfhttp://www.math.usu.edu/adele/RandomForests/Ovronnaz.pdf
![Page 34: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/34.jpg)
Current Research
http://www.amstat.org/meetings/wsds/2016/onlineprogram/AbstractDetails.cfm?AbstractID=303499
![Page 35: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/35.jpg)
Current Research - Improving the interpretation of RandomForests
https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314849
![Page 36: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/36.jpg)
Remembering Leo Breiman
1928 – 2005
Read more about Leo Breiman’s life’s work from the article writtenby Dr. Adele Cutler: https://arxiv.org/pdf/1101.0917.pdf
![Page 37: Random Forests: The Vanilla of Machine Learning - Anna Quach](https://reader030.vdocument.in/reader030/viewer/2022021507/58f2c8141a28aba9098b45c3/html5/thumbnails/37.jpg)
Contact Information
Additional questions regarding Random Forests can be emailed tome at [email protected].