boosted regression trees a method to explore biology-environment relationships

16
Boosted Regression Trees A method to explore biology-environment relationships Sophie Mormede, Matt Pinkerton National Institute of Water and Atmospheric Research, Wellington, NZ May 2010

Upload: keaton

Post on 23-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Boosted Regression Trees A method to explore biology-environment relationships. Sophie Mormede, Matt Pinkerton National Institute of Water and Atmospheric Research, Wellington, NZ May 2010. Two main uses of BRT. to investigate the ecological dependence of a species on the environment - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Boosted Regression Trees A method to explore biology-environment relationships

Boosted Regression Trees

A method to explore biology-environment relationships

Sophie Mormede, Matt Pinkerton

National Institute of Water and Atmospheric Research, Wellington, NZ

May 2010

Page 2: Boosted Regression Trees A method to explore biology-environment relationships

Two main uses of BRT• to investigate the ecological dependence

of a species on the environment

• to determine "habitat preference" in order to extrapolate patchy biological data to a larger domain

Page 3: Boosted Regression Trees A method to explore biology-environment relationships

An example• WHAT: Predict toothfish and bycatch species

distributions over the Ross Sea (88.1 & 882A–B)• WHY:

– layers for bioregionalisation– input to systematic conservation planning– to investigate overlap of TOA and prey species– to consider potential changes in species distribution

under climate change scenarios– to help in estimating biomass from the small number

of research trawls (WGR)

• HOW: GLM / GAM (not very satisfactory), BRT, General Dissimilarity Matrices, …

Page 4: Boosted Regression Trees A method to explore biology-environment relationships

Project outcomes so far• Predictions seem to make sense, and

confidence intervals

• Quality of depth data critical (use gebco08, modified with fishing depth)

• Still need to validate models on a different area (882E?, Kerguelen?)

Page 5: Boosted Regression Trees A method to explore biology-environment relationships

BRT – what is it all about then?• Regression Tree:

– Recursive binary splits– Stopping criterion – Allows interactions natively if wanted (tree complexity)

• Boosting = forward stagewise model fitting:– A truncated tree (1-10 splits)– Computed the fitted values and residuals– Fit and add a new tree to the residuals, repeating

many times (number of trees > 1000)

Page 6: Boosted Regression Trees A method to explore biology-environment relationships

More about BRT• Boosting with stochasticity:

– At each step a proportion of dataset is randomly selected (bag fraction) to be fitted to, improves model performance

• Cross validation (CV):– To avoid overfitting, test model on withheld

parts of the data – also estimates overfitting

• You can bootstrap BRTs (I used 1000 bootstraps)

Page 7: Boosted Regression Trees A method to explore biology-environment relationships

Pros of BRT• Copes with NAs, • Copes with non normally-distributed

environmental variables (no transforms), • Copes with outliers• Allows multiple levels of interactions• Unlikely to overfit as much as GLM,

quantifies• 20-30% improvement of fits compared with

GLM / GAM• Runs on R

Page 8: Boosted Regression Trees A method to explore biology-environment relationships

Cons of BRT• Cons of BRT

– Does not give smooth / monotonic responses– Still some overfitting – need to be careful– Slow when using bootstrapping

• Cons of any prediction method– Only as good as the environmental layers– Predict only in the domain we have data for

(need to mask other areas)

Page 9: Boosted Regression Trees A method to explore biology-environment relationships

BRT process• Optimise BRT setup (which variables, how

many interactions, based on deviance)

• Run full models and bootstraps

• Run reduced models with only variables that were significant

• Bootstrap predictions based on reduced model, and calculate CI

• Plot

Page 10: Boosted Regression Trees A method to explore biology-environment relationships

Back to the example environmental variables we used

• Bathymetry (Gebco 2008, modified for fishing depth)

• Chlorophyll A summer (remote sensing)• Ice15 and ice85 (satellite data) – not used• Rugosity (Gebco08) • Near bottom current speed, temperature and

salinity (HIGEM circulation model)

• Use only variables that make biological sense!

Page 11: Boosted Regression Trees A method to explore biology-environment relationships

Predictor variables• For each species, predict proportion of

hooks that caught a fish– Akin to binomial per hook

• Transform to normalise data– Y = arcsin [ sqrt (fish per hook) ]

• Predict with BRT using Gaussian link

• Also predict binomial for all but toothfish (only 5% null catch)

• Could also do fish per line

Page 12: Boosted Regression Trees A method to explore biology-environment relationships

Example - TOA predictionpreliminary results

Page 13: Boosted Regression Trees A method to explore biology-environment relationships

Other example – Oithona similis Pinkerton et al. (2010)

Oithona similisThe most abundant animal in the world?

BRT10 30 100

re la tiveabundance

CPR database

Page 14: Boosted Regression Trees A method to explore biology-environment relationships

Last example – species richnessLeathwick et al. (2006)

Page 15: Boosted Regression Trees A method to explore biology-environment relationships

Others methods to considerGeneral Dissimilarity Modelling

• General Dissimilarity Modelling: Multivariate response variable

• Pros– predict communities based on environmental

variables (multiple species analysed)– Classification part of the process

• Cons– No bootstrapping– How many species??

Page 16: Boosted Regression Trees A method to explore biology-environment relationships

Classification• Classifications (clusters): separates areas

based on layers (environment, biology etc)

• Options– Use biology layers from BRT? – Use environmental layers too? (double-

dipping?)– Use GDM directly for predictions and

classifications?

• Number of classes…