objective analyze treatment predictions direct/ constrained...

Objective Approach

Reduce Complexity

Analyze Relationships

Classification Analyze

treatment effects

Predictions Direct/ Indirect

Constrained

Classic PCA CANDISK

PCA + 2nd set of vectors

CANCOR DISCRIM MANOVA DISCRIM

Modern

NMDS NMDS + 2nd set of vectors

CCA RDA dbRDA* MRT

CLUSTER MRPP permMANOVA permANOVA

CART RF

* Alternative technique not covered in this class

Classification and Regression Trees (CART)

Multivariate Fundamentals: Prediction

MCMT>=-30.85

MCMT>=-25.8

MWMT< 16.45

MCMT< -30.85

MCMT< -25.8

MWMT>=16.45

0.674

n=103

0.534

n=99

0.172

n=64

1.2

n=35

0.601

n=26

2.92

n=9

4.15

n=4

Error : 0.214 CV Error : 0.412 SE : 0.122

Objective: Determine what drives relationships between response and predictor variables in more detail (unimodal or bimodal relationships)

CART is Univariate

MRT is Multivariate

We aim to answer: “What distinguishes my groups within my predictor variables?”

Classification and Regression Trees can use both categorical and continuous numeric response variables

– If response is categorical a classification tree is used to identify the "class" within which a target variable would likely fall into

– If response is continuous a regression tree is used to predict it's value

– We cover both in Lab 8

Also referred to as decision trees

If we can specifically determine what drives a relationship, we can use that information to predict a response under new conditions

The math behind CART (and MRT in multivariate space)

Consider: “What drives species frequency?”

MAT Try and look at a ordination

When the relationship is not linear, ordinations do not work out cleanly E.g. Species are both low species frequency, but have very different MAT thresholds, so where do I draw the arrow to capture this information?

Spec

ies

Freq

uen

cy

MAT °C

1 2 3 4 5 6 7

The math behind CART

Alternatively we can build a decision tree to better define and illustrate the species frequency-temperature relationship

Think of this as a cluster analysis where splits are constrained by environmental variables (like in Constrained Gradient Analysis)

Spec

ies

Freq

uen

cy

MAT °C

1 2 3 4 5 6 7 High Low Low

≥ 2°C < 2°C

< 6°C ≤ 6°C

Node

Leaf

The math behind CART

CART is an iterative top-down process that aims to minimize within group variation

To start the tree, CART empirically investigating various thresholds in various predictor variables to find the first split in the response variable dataset that minimizing variation within groups (like Cluster Analysis)

However unlike Cluster Analysis, the external variables (the predictors) are imposed as a constraint to create the clusters

E.g. Using environmental thresholds to create clusters of inventory plots with similar species composition

The process then repeats for the two sub-groups, until no significant amount of additional variation ca be explained by further splits

CART in R

There are other R packages that build univariate Classification (categorical) and Regression (numeric) Trees – e.g. tree and rpart

To simplify for this class we will use the package mvpart which is primarily designed to execute Multivariate Regression Trees (MRT), but can handle CART as well

CART in R: library(mvpart)

mvpart(ResponseVariable,EquationOfPredictors,data=predictorData, (mvpart package)

xv="p", all.leaves=T)

Vector of response variable (univariate)

Equation of Predictors : Variable1 include single predictor Variable1+Variable2 include multiple predictors

To run CART you need to install the mvpart package

Data table of your predictor variables E.g. Environmental Variables

Turn on the option all.leaves=T to generate the number of observations and the average frequency at each node and leaf

Specifying xv="p“ allows you to interactively pick the tree size you want to generate

Green line – equivalent to “variance explained by each split” statistic

CART in R

Picking the tree size is a good option to specify because it allows you to pick the best tree which includes well supported splits that explain a significant portion of the variation

By specifying xv="p" R will generate a screen-plot for decision guidance

Size of tree – number of splits

Red line – minimum relative error corresponding to minimum variance explained plus one standard error

Blue line – tree performance associated with splits

Cross-validation prediction

You should pick a tree size under the red line, between the orange and red marks

The bigger the tree the bigger the breakdown among data points – you have to determine how far you want to breakdown your data (you might go too far and remove groupings that you want)

If you don’t specify xv="p" in your mvpart statement then the tree size at the orange mark will be used

Orange mark – well supported splits that explain sufficient variation

Red mark – reasonably well supported splits explaining some additional variation

CART in R R will output a regression tree

If you have a big tree the tree image will be crowded (this is a problem in mvpart), so save the image as a enhanced metafile (option in the save)

You can then import the emf file into Powerpoint, ungroup it twice and move the labels around to make the image more legible and publishable

MCMT>=-30.85

MCMT>=-25.8

MWMT< 16.45

MCMT< -30.85

MCMT< -25.8

MWMT>=16.45

0.674

n=103

0.534

n=99

0.172

n=64

1.2

n=35

0.601

n=26

2.92

n=9

4.15

n=4

Error : 0.214 CV Error : 0.412 SE : 0.122

We build a model to look at a single species frequency with 5 predictor variables

The number of data points that fall into this group (e.g. n = 64 data points)

The average species frequency for the group (e.g. 4.15%)

Predictor variable associated with data split

Errors associated with the tree size:

Error: Residual error how much variation is not explained by the tree

CV Error: Summarized cross-validated relative error for all predictors (zero for perfect predictor to close to one for a poor predictor)

You want small values for both!

CART in R We build a model to look at a single species frequency with 5 predictor variables

The variation explained by each split

For each node (and leaf) details about the split are provided: • Number of observations used • Mean of the group • Mean square error of the group • How many observations are divided into each

side of the split • Primary splits (potential alternative

predictors)

Improvement values indicate how much variation would be explained if that split was based on an alternative variable

If the improve value at a split is the same for a different predictor variable, the alternative predictor variable could be used to explain the groupings

Multivariate Regression Trees (MRT)


MCMT>=-14.85

MAT>=0.6

MWMT>=16

MAT< -1.3

MAT>=-0.4

MSP< 304.5

MWMT< 15.9

MSP< 332

MCMT< -14.85

MAT< 0.6

MWMT< 16

MAT>=-1.3

MAT< -0.4

MSP>=304.5

MWMT>=15.9

MSP>=332

810 : n=136

334 : n=80

71.1 : n=64

9.93 : n=40 13.8 : n=24

154 : n=16

0.0026 : n=8 0.848 : n=8

303 : n=56

0.487 : n=8 6.12 : n=16

158 : n=32

58 : n=24

10.5 : n=16 0.495 : n=8

0.791 : n=8

ABIELAS

PICEENG

PINUCON

PICEGLA

POPUTRE

PINUBAN

Error : 0.053 CV Error : 0.102 SE : 0.0313

Objective: Determine what drives relationships between multiple

response and predictor variables in more detail (unimodal or bimodal

relationships )

Just like CART, but multivariate space

We aim to answer: “What distinguishes my groups within my predictor variables?”

Like CART, MRT can use both categorical and continuous numeric response variables

If we can specifically determine what drives a relationship, we can use that information to predict a response under new conditions

MRT in R

MRT in R: library(mvpart)

mvpart(ResponseMatrix,EquationOfPredictors,data=predictorData, (mvpart package)

xv="p", all.leaves=T)

Matrix of response variables E.g: Frequencies for multiple species

Equation of Predictors: Variable1 include single predictor Variable1+Variable2 include multiple predictors

To run CART you need to install the mvpart package

Data table of your predictor variables E.g. Environmental Variables

Turn on the option all.leaves=T to generate the number of observations and the average frequency at each node and leaf

Specifying xv="p“ allows you to interactively pick the tree size you want to generate

To make the MRT output easier to interpret, response variables should be normalized prior to conducting the MRT analysis

Green line – equivalent to “variance explained by each split” statistic

MRT tree size (same as CART)

Picking the tree size is a good option to specify because it allows you to pick the best tree which includes well supported splits that explain a significant portion of the variation

By specifying xv="p" R will generate a screen-plot for decision guidance

Size of tree – number of splits

Red line – minimum relative error corresponding to minimum variance explained plus one standard error

Blue line – tree performance associated with splits

Cross-validation prediction

You should pick a tree size under the red line, between the orange and red marks

The bigger the tree the bigger the breakdown among data points – you have to determine how far you want to breakdown your data (you might go too far and remove groupings that you want)

If you don’t specify xv="p" in your mvpart statement then the tree size at the orange mark will be used

Orange mark – well supported splits that explain sufficient variation

Red mark – reasonably well supported splits explaining some additional variation

MRT in R

Each leaf on the tree has a barplot associated with it that tells us how the response variables similarly respond in each group E.g: Species frequency in each community group

Because data is normalized, all of the bars above the line are the ones driving the split

Unfortunately you CANNOT change the colours of these barplots in R – you will have to modify everything in Powerpoint

MCMT>=-14.85

MAT>=0.6

MWMT>=16

MAT< -1.3

MAT>=-0.4

MSP< 304.5

MWMT< 15.9

MSP< 332

MCMT< -14.85

MAT< 0.6

MWMT< 16

MAT>=-1.3

MAT< -0.4

MSP>=304.5

MWMT>=15.9

MSP>=332

810 : n=136

334 : n=80

71.1 : n=64

9.93 : n=40 13.8 : n=24

154 : n=16

0.0026 : n=8 0.848 : n=8

303 : n=56

0.487 : n=8 6.12 : n=16

158 : n=32

58 : n=24

10.5 : n=16 0.495 : n=8

0.791 : n=8

ABIELAS

PICEENG

PINUCON

PICEGLA

POPUTRE

PINUBAN

Error : 0.053 CV Error : 0.102 SE : 0.0313

The number of data points that fall into this group (e.g. n = 40 data points)

Predictor variable associated with data split

Errors associated with the tree size:

Error: Residual error how much variation is not explained by the tree

CV Error: Summarized cross-validated relative error for all predictors (zero for perfect predictor to close to one for a poor predictor)

You want small values for both!

MRT in R

MCMT>=-14.85

MAT>=0.6

MWMT>=16

MAT< -1.3

MAT>=-0.4

MSP< 304.5

MWMT< 15.9

MSP< 332

MCMT< -14.85

MAT< 0.6

MWMT< 16

MAT>=-1.3

MAT< -0.4

MSP>=304.5

MWMT>=15.9

MSP>=332

810 : n=136

334 : n=80

71.1 : n=64

9.93 : n=40 13.8 : n=24

154 : n=16

0.0026 : n=8 0.848 : n=8

303 : n=56

0.487 : n=8 6.12 : n=16

158 : n=32

58 : n=24

10.5 : n=16 0.495 : n=8

0.791 : n=8

ABIELAS

PICEENG

PINUCON

PICEGLA

POPUTRE

PINUBAN

Error : 0.053 CV Error : 0.102 SE : 0.0313

These 8 plots have similar species communities with high frequencies of ABIELAS (subalpine fir) and PICEENG (Englemann spruce) and moderate frequency of PINUCON (lodgepole pine) compared to the average on all sites

Interpretation: Due to climate (Cool temperatures), and species composition, these are likely high elevation sites

These 16 plots have similar species communities with high frequency of POPUTRE (aspen) and moderate frequency of PICEGLA (white spruce) compared to the average on all sites

Interpretation: Due to climate (Cold winters, wet summers), and species composition, these are likely boreal mixedwood sites

These 24 plots have similar species communities with moderate frequency of PINUCON (lodgepole pine) and moderately low frequency of PINUBAN (jack pine) compared to the average on all sites

Interpretation: Due to climate (Cold winters, cool summers), and species composition, these are likely boreal highland sites

These 8 plots have similar species communities with very high frequency of POPUTRE (aspen), moderately frequency of PICEGLA (white spruce), and moderately low frequency of other the species compared the average on all sites

Interpretation: Due to climate (Cold winters, warm summers), and species composition, these are likely northern boreal sites

MRT in R We build a model to look at a single species frequency with 5 predictor variables

The variation explained by each split

For each node (and leaf) details about the split are provided: • Number of observations used • Mean of the group • Mean square error of the group • How many observations are divided into each

side of the split • Primary splits (potential alternative

predictors)

Improvement values indicate how much variation would be explained if that split was based on an alternative variable

If the improve value at a split is the same for a different predictor variable, the alternative predictor variable could be used to explain the groupings See full output in Lab 8

Random Forest


Current Climate 2050s

Objective: Determine what drives relationships between response and

predictor variables in more detail then use this relationship to predict response at a new location

Bootstrapped version of CART (univariate) that allows you to better investigate if a relationship between response and predictors exists and what predictor variables drive that relationship (more reliable)

Like CART & MRT, Random Forest can use both categorical (Classification technique) and continuous numeric (Regression technique) response variables

Leo Breiman (1928-2005) Adele Cutler (1950- Present )

https://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwiuzs-gvYzLAhUCzGMKHUH9AqQQjRwIBQ&url=https%3A%2F%2Fprojecteuclid.org%2Fdownload%2Fpdf_1%2Feuclid.ss%2F1009213290&psig=AFQjCNG98TiAvz_MnUT08mIWh7Md_ri9Jw&ust=1456268549100245

The math behind Random Forest

MANY trees are iteratively built (bootstrap) based on subsets of the data (typically 70% of the data) then the remaining portion of the data is used to test the tree that was built

Think of the Price is Right Game Plinko

For each tree you use a subset of the data to “set up the pegs” on the board E.g. Environmental predictors

Then the remaining portion of the data represents the “disks” you will play A.K.A. Out-of-bag sample

The “slots” at the bottom ($) represent the group classes (categorical response) or numeric values (numeric response) in our data E.g. Species Classes, Ecosystem Classes

When you slide your disks down the board, the set-up of the pegs will determine what slot the disk will fall into

If the pegs are set up well (e.g. there is a strong relationship between response and predictors) the disk will fall into the correct slot

Random Forest in R

Random Forest in R: library(randomForest)

mvpart(ResponseVector,EquationOfPredictors,data=predictorData,

trees=n, importance=T) (randomForest package)

Vector of response variable (univariate) E.g: Species frequency (numeric) : Regression Ecosystem classes (categorical) : Classification

Equation of Predictors: Variable1 include single predictor Variable1+Variable2 include multiple predictors

To run Random Forest you need to install the randomForest package Data table of your predictor variables

E.g. Environmental Variables

Turn on the option importance=T to generate the a statistic as to how important each predictor variable contributed to achieving correct answers

The number of trees you want to generate

The fewer trees you use, the faster the program

But the less reliable the prediction

Random Forest in R – Regression (numeric response variable)

You should look at the summary of the error in your predictions (disks that fell into the wrong slot) based on the number of trees you use using

The fewer the trees you use the faster the Random Forest program will run, but the less reliable the predictions

More trees give Random Forest a better chance to establish a strong relationship between response and predictors

You want to pick your number of trees where your error asymptotes

In this example we should increase the number of trees from 100 to either 200 or 300

Random Forest in R – Classification (categorical response variable)

For classification trees (categorical response variable) curves will be generated for each class included

Again you want to pick your number of trees where your error asymptotes for all classes

In this example 100 trees seems sufficient

Random Forest in R – Regression (numeric response variable)

mean decrease in accuracy

mean decrease in MSE at node

Importance measures show how much MSE or Impurity increase when that variable is randomly permuted within tree construction

If you randomly permute a variable that does not gain you anything in prediction, then you will only see small changes in Impurity and MSE

Important variables will change the predictions by quite a bit if randomly permuted, so you will see bigger changes

The further the importance variable is to the right, the MORE important that predictor variable is within the tree


Importance will give you values of mean decrease in accuracy for each predictor for each class as well as overall for all classes

Additionally you will get a measure of mean decrease in Gini – which represents how often each predictor variable contributed to a correct classification

The further the importance variable is to the right, the MORE important that predictor variable is within the tree

For classification random forest ONLY we can also look at the classification error rate for all categories


Rows represent the actual Group classes

Columns represent the predicted Group classes based on the random forest analysis

The class.error column indicate the % of the row Group class that we misidentified

E.g. a value of 0.125 for Group 16 indicates 12.5% of data points known to belong to Group 16 were misclassified across all trees under random forest

Ideally we want all classification errors to be small as that indicates Random Forest is a GOOD Prediction Model

For regression (numeric) you can use mean(out4$rsq) to get a pdeudo R2 value as an indication of the goodness-of-fit for the random forest model

Random Forest Predictions – Regression (numeric response variable)


We can simply apply the output from random forest to a new set of environmental variables to see how the modelled variable (e.g. species frequency) will respond

Random Forest Predictions – Classification (categorical response variable)


We can simply apply the output from random forest to a new set of environmental variables to see how the modelled variable (e.g. ecosystem class) will correspond

objective analyze treatment predictions direct/ constrained...

Documents