a review of: "analysis of phylogenetics and evolution with r"

3
694 SYSTEMATIC BIOLOGY VOL. 56 Faith, D. P. 2007a. Probabilistic PD. http://www.edgeofexistence. org/forum/forum_posts.asp.TID=13&PID=136#136. Faith, D. P. 2007b. Phylogenetic diversity and conservation. In Conser- vation biology: Evolution in action (S. P. Carroll and C. Fox, eds). Oxford University Press, New York. Faith, D. P., and A. M. Baker. 2006. Phylogenetic diversity (PD) and bio- diversity conservation: Some bioinrormatics challenges. Evol. Bioinf. Online 2:70-77. Faith, D. P., C. A. M. Reid, and J. Hunter. 2004. Integrating phylogenetic diversity, complementarity, and endemism. Conserv. Biol. 18:255- 261. Faith, D. P., and P. A. Walker. 1996. Diversity—TD. Pages 63-74 in Bio- Rap, rapid assessment of biodiversity; volume 3: Tools for assessing biodiversity priority areas (D. P. Faith and A. O. Nicholls, eds). Cen- tre for Resource and Environmental Studies, Australian National University, Canberra. Forest, F., R. Grenyer, M. Rouget, T. J. Davies, R. M. Cowling, D. P. Faith, A. Balmford, J. C. Manning, S. Proches, M. van der Bank, G. Reeves, T. A. J. Hedderson, and V. Savolainen. 2007. Preserving the evolution- ary potential of floras in biodiversity hotspots. Nature 445:757-760. Hartmann, K., and M. Steel. 2006. Maximizing phylogenetic diversity in biodiversity conservation: Greedy solutions to the Noah's Ark Problem. Syst. Biol. 55:644-651. Isaac, N. J. B., S. T. Turvey, B. Collen, C. Waterman, and J. E. M. Baillie. 2007. Mammals on the EDGE: Conservation priorities based on threat and phylogeny. PLoS ONE 2:e296. IUCN. 1980. World conservation strategy: Living resource conser- vation for sustainable development. IUCN-UNEP-WWF, Gland, Switzerland. Lewis, L. A., and P. O. Lewis. 2006. Unearthing the molecular diversity of desert soil green algae. Syst. Biol. 54:936-947. Margules, C. R., and R. L. Pressey. 2000. Systematic conservation plan- ning. Nature 405:243-253. Mayden, R. L. 1997. A hierarchy of species concepts: The denouement in the saga of the species problem. Pages 381-424 in Species: The units of biodiversity (M. F. Claridge, H. A. Dawah and M. R. Wilson, eds). Chapman & Hall, London. Mirth, B. Q., S. Klaere, and A. von Haeseler. 2006. Phylogenetic diversity within seconds. Syst. Biol. 55:769-773. Moritz, C. 2002. Strategies to protect biological diversity and the evolutionary process that sustain it. Syst. Biol. 51:238- 255. Proches, S., J. R. U. Wilson, and R. M. Cowling. 2006. How much evo- lutionary history in a 10x 10 m plot? Proc. Roy. Soc. Lond. BBiol. Sci. 273:1143-1148. Sechrest, W., T. M. Brooks, G. A. B. da Fonseca, W. R. Konstant, R. A. Mittermeier, A. Purvis, A. B. Rylands, and J. L. Gittleman. 2002. Hotspots and the conservation of evolutionary history. Proc. Natl Acad. Sci. USA 99:2067-2071. Steel, M. 2005. Phylogenetic diversity and the greedy algorithm. Syst. Biol. 54:527-529. Webb, C. O., D. D. Ackerly, M. A. McPeek, and M. J. Donoghue. 2002. Phylogenies and community ecology. Annu. Rev. Ecol. Syst. 33:475- 505. Webb, C. O., G. S. Gilbert, and M. J. Donoghue. 2006. Phylodiversity- dependent seedling mortality, size structure, and disease in a bornean rain forest. Ecology 87:S123-S131. Witting, L., and V. Loescke. 1995. The optimization of biodiversity con- servation. Biol. Conserv. 71:205-207. Yesson, C, and A. Culham. 2006. A phyloclimatic study of Cyclamen. BMC Evol. Biol. 6:72. Daniel P. Faith, The Australian Museum, 6 College St, Sydney NSW 2010, Australia; E-mail: [email protected] Syst. Biol. 56(4):694-696,2007 Copyright © Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150701475589 Analysis of Phylogenetics and Evolution with R—Emmanuel Par- adis. 2006. Springer-Verlag, Heidelberg. 223 pp. ISBN 978-0-387-32914- 7. €39.95, US $49.95 (paperback). When doing phylogenetic analyses biologists usually re- strict themselves to "canned" packages, such as PAUP*, PAML, or MrBayes. Canned packages are typically fairly easy to learn and use, and they offer a range of analy- sis methods. The great advantage of these packages is that they allow the practitioner to focus on evolution- ary questions without needing to understand the mathe- matics and computer science behind the algorithms they are applying. Developers of phylogenetic methods, on the other hand, have no choice but to learn program- ming languages. This might seem like an obvious divi- sion, but relatively new programming environments like Python and R mean that the boundary between biolo- gist and programmer is getting much more blurry. These languages are built up from packaged modules, which can be combined in any way the user desires; and it is straightforward for users to create new modules of their own. Languages like R take longer to learn than canned packages, but their nature allows the user to "mix and match" a wide range of methods that have been devel- oped for phylogenetic analysis or to develop their own methods. The real question is whether the advantages of doing phylogenetics in an environment like R out- weigh the increased learning cost—Emmanuel Paradis would say an emphatic "yes," and he pursues this line of thought in the book being reviewed here. Paradis is one of the authors of the R package "APE: Analyses of Phylogenetics and Evolution"; in his new book he sets out how this package and other features of R can be used to perform phylogenetic analyses. R, ac- cording to its Internet homepage, is "a language and en- vironment for statistical computing and graphics ... that provides a wide variety of statistical and graphical tech- niques, and is highly extensible." Importantly, R is free as well as open source. The fact that modern phyloge- netics is a statistically based discipline makes R a natural choice for developing phylogenetic tools, and so a book that describes the current state of the play is timely. This book will appeal to two main audiences: phylogenetic practitioners seeking to perform analyses without wor- rying about the intricacies of how different R modules get things done and developers of phylogenetic algorithms. at Stanford Medical Center on March 5, 2013 http://sysbio.oxfordjournals.org/ Downloaded from

Upload: barbara-r

Post on 05-Dec-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Review of: "Analysis of Phylogenetics and Evolution with R"

694 SYSTEMATIC BIOLOGY VOL. 56

Faith, D. P. 2007a. Probabilistic PD. http://www.edgeofexistence.org/forum/forum_posts.asp.TID=13&PID=136#136.

Faith, D. P. 2007b. Phylogenetic diversity and conservation. In Conser-vation biology: Evolution in action (S. P. Carroll and C. Fox, eds).Oxford University Press, New York.

Faith, D. P., and A. M. Baker. 2006. Phylogenetic diversity (PD) and bio-diversity conservation: Some bioinrormatics challenges. Evol. Bioinf.Online 2:70-77.

Faith, D. P., C. A. M. Reid, and J. Hunter. 2004. Integrating phylogeneticdiversity, complementarity, and endemism. Conserv. Biol. 18:255-261.

Faith, D. P., and P. A. Walker. 1996. Diversity—TD. Pages 63-74 in Bio-Rap, rapid assessment of biodiversity; volume 3: Tools for assessingbiodiversity priority areas (D. P. Faith and A. O. Nicholls, eds). Cen-tre for Resource and Environmental Studies, Australian NationalUniversity, Canberra.

Forest, F., R. Grenyer, M. Rouget, T. J. Davies, R. M. Cowling, D. P. Faith,A. Balmford, J. C. Manning, S. Proches, M. van der Bank, G. Reeves, T.A. J. Hedderson, and V. Savolainen. 2007. Preserving the evolution-ary potential of floras in biodiversity hotspots. Nature 445:757-760.

Hartmann, K., and M. Steel. 2006. Maximizing phylogenetic diversityin biodiversity conservation: Greedy solutions to the Noah's ArkProblem. Syst. Biol. 55:644-651.

Isaac, N. J. B., S. T. Turvey, B. Collen, C. Waterman, and J. E. M. Baillie.2007. Mammals on the EDGE: Conservation priorities based on threatand phylogeny. PLoS ONE 2:e296.

IUCN. 1980. World conservation strategy: Living resource conser-vation for sustainable development. IUCN-UNEP-WWF, Gland,Switzerland.

Lewis, L. A., and P. O. Lewis. 2006. Unearthing the molecular diversityof desert soil green algae. Syst. Biol. 54:936-947.

Margules, C. R., and R. L. Pressey. 2000. Systematic conservation plan-ning. Nature 405:243-253.

Mayden, R. L. 1997. A hierarchy of species concepts: The denouementin the saga of the species problem. Pages 381-424 in Species: Theunits of biodiversity (M. F. Claridge, H. A. Dawah and M. R. Wilson,eds). Chapman & Hall, London.

Mirth, B. Q., S. Klaere, and A. von Haeseler. 2006. Phylogenetic diversitywithin seconds. Syst. Biol. 55:769-773.

Moritz, C. 2002. Strategies to protect biological diversity andthe evolutionary process that sustain it. Syst. Biol. 51:238-255.

Proches, S., J. R. U. Wilson, and R. M. Cowling. 2006. How much evo-lutionary history in a 10x 10 m plot? Proc. Roy. Soc. Lond. BBiol. Sci.273:1143-1148.

Sechrest, W., T. M. Brooks, G. A. B. da Fonseca, W. R. Konstant, R.A. Mittermeier, A. Purvis, A. B. Rylands, and J. L. Gittleman. 2002.Hotspots and the conservation of evolutionary history. Proc. NatlAcad. Sci. USA 99:2067-2071.

Steel, M. 2005. Phylogenetic diversity and the greedy algorithm. Syst.Biol. 54:527-529.

Webb, C. O., D. D. Ackerly, M. A. McPeek, and M. J. Donoghue. 2002.Phylogenies and community ecology. Annu. Rev. Ecol. Syst. 33:475-505.

Webb, C. O., G. S. Gilbert, and M. J. Donoghue. 2006. Phylodiversity-dependent seedling mortality, size structure, and disease in abornean rain forest. Ecology 87:S123-S131.

Witting, L., and V. Loescke. 1995. The optimization of biodiversity con-servation. Biol. Conserv. 71:205-207.

Yesson, C, and A. Culham. 2006. A phyloclimatic study of Cyclamen.BMC Evol. Biol. 6:72.

Daniel P. Faith, The Australian Museum, 6 College St, Sydney NSW 2010,Australia; E-mail: [email protected]

Syst. Biol. 56(4):694-696,2007Copyright © Society of Systematic BiologistsISSN: 1063-5157 print / 1076-836X onlineDOI: 10.1080/10635150701475589

Analysis of Phylogenetics and Evolution with R—Emmanuel Par-adis. 2006. Springer-Verlag, Heidelberg. 223 pp. ISBN 978-0-387-32914-7. €39.95, US $49.95 (paperback).

When doing phylogenetic analyses biologists usually re-strict themselves to "canned" packages, such as PAUP*,PAML, or MrBayes. Canned packages are typically fairlyeasy to learn and use, and they offer a range of analy-sis methods. The great advantage of these packages isthat they allow the practitioner to focus on evolution-ary questions without needing to understand the mathe-matics and computer science behind the algorithms theyare applying. Developers of phylogenetic methods, onthe other hand, have no choice but to learn program-ming languages. This might seem like an obvious divi-sion, but relatively new programming environments likePython and R mean that the boundary between biolo-gist and programmer is getting much more blurry. Theselanguages are built up from packaged modules, whichcan be combined in any way the user desires; and it isstraightforward for users to create new modules of theirown. Languages like R take longer to learn than cannedpackages, but their nature allows the user to "mix and

match" a wide range of methods that have been devel-oped for phylogenetic analysis or to develop their ownmethods. The real question is whether the advantagesof doing phylogenetics in an environment like R out-weigh the increased learning cost—Emmanuel Paradiswould say an emphatic "yes," and he pursues this lineof thought in the book being reviewed here.

Paradis is one of the authors of the R package "APE:Analyses of Phylogenetics and Evolution"; in his newbook he sets out how this package and other features ofR can be used to perform phylogenetic analyses. R, ac-cording to its Internet homepage, is "a language and en-vironment for statistical computing and graphics ... thatprovides a wide variety of statistical and graphical tech-niques, and is highly extensible." Importantly, R is freeas well as open source. The fact that modern phyloge-netics is a statistically based discipline makes R a naturalchoice for developing phylogenetic tools, and so a bookthat describes the current state of the play is timely. Thisbook will appeal to two main audiences: phylogeneticpractitioners seeking to perform analyses without wor-rying about the intricacies of how different R modules getthings done and developers of phylogenetic algorithms.

at Stanford Medical C

enter on March 5, 2013

http://sysbio.oxfordjournals.org/D

ownloaded from

Page 2: A Review of: "Analysis of Phylogenetics and Evolution with R"

2007 BOOK REVIEWS 695

Doing phylogenetics in R is a fundamentally differ-ent approach from using packaged software such asPAUP* or MrBayes. In the latter programs you load yourdata, perform some analysis that the designer of the pro-gram has predefined (e.g., a search for the maximum-likelihood tree under some explicit model), and thenoutput the result (e.g., a tree with branch lengths). Cer-tainly, the program will manipulate the data along theway, but the user has no access to the data or the ma-nipulations. The use of scripts (e.g., PAUP blocks) makesthese programs potentially very powerful, but you arestill limited to the set of commands provided. In R thedata set is stored in active memory, and the ways youcan manipulate it are limited only by your imagination,programming skills, and the methods that others havemade available. In some cases these limitations may besevere!

The downside of this flexibility is that there is going tobe a long learning curve for users unfamiliar with R—anR newbie is not going to begin by designing new pack-ages. A more likely scenario is that they will use somepackage that has already been developed. After learn-ing to use that package effectively, they will gain con-fidence to use other packages and perhaps eventuallydesign their own analyses using the general features ofR. The book by Paradis seems to have been written withthis scenario in mind—it is filled with examples and casestudies illustrating how to use features of the existingpackages. It is the sort of book where you sit in frontof your computer and read a bit, tinker a bit, read a bitmore, and gradually increase your confidence with theR environment.

The book begins with a section on why Paradis thinksR is an effective environment for phylogenetic analysis.He makes a good case for using R as a platform, empha-sizing R's flexibility and the opportunity to do highlyintegrated analyses. Some of the reasons for using R areforward-looking in the sense that, although R doesn'tcurrently do as many different phylogenetic analyses aspeople might like, in time a greater range of methods willbe implemented. This introductory chapter includes de-tails on how to find and install the basic R system andthe packages required for phylogenetics. Starting fromscratch on a computer without R, I found it fairly straight-forward to get up and running.

Chapter 2 gives a whirlwind introduction to the gen-eral features of R. Most people would want to look atsome other tutorial material in addition to what is pre-sented here. This is not a criticism, and indeed Paradissuggests where to look for such tutorials.

Chapter 3 introduces us to the two main R packagesthat have been developed for doing phylogenetics—APEand ADE4—and to the phylogenetic data structures theyuse to represent trees and sequences. R has a number ofuseful features for manipulating trees, along with sim-ple reading, writing, and file-format conversion, thatinclude interchanging between dichotomous and mul-tichotomous trees, removing particular taxa, and root-ing and unrooting trees. These seem like little things, butthey rapidly become tedious and error prone if you at-

tempt to do them by manually editing Newick-formatfiles.

I found Chapter 4 particularly interesting—it dealswith plotting phylogenies. Here we are playing to oneof R's strengths in an area that has, as Paradis points out,been neglected in the phylogenetics discipline. R pro-vides very flexible tools for plotting phylogenies. Thebook gives examples of how to produce trees that arerooted, unrooted, or radial, with nodes annotated withtext, colored circles, or even mini-barcharts. There arealso facilities for exploring and displaying very largephylogenies by zooming in on subtrees. This chapter willappeal to anyone who has had to prepare all but the mostbasic tree for publication. Originally I thought it was oddto have the chapter on plotting trees before the chapteron phylogeny estimation, but having read it I think thatthe plotting capabilities of R could well be the hook thatgets many phylogeneticists to start using it.

Chapter 5 covers phylogeny estimation in R. To date,there are some distance-based methods available (NJ andsimple clustering algorithms) as well as maximum likeli-hood. Paradis suggests that Bayesian methods, althoughnot currently available in R, will be straightforward toimplement, because R was designed with statistical es-timation in mind, as many of the required ingredientsalready exist. A section of the chapter is devoted to mod-els of DNA substitution. The DNAmodel function in APEallows very flexible model specification, including bothpartitions and mixtures. There is an example of imple-menting a partitioned model, but I would have liked anexample showing how to create a mixture model as well.Other topics covered include model testing, bootstrap,consensus, and molecular dating.

For most biologists finding the best phylogenetic treeto describe some set of species is not an end in itself, butrather is a stepping stone to answering some questionabout evolution. Chapter 6 shows how methods in theAPE and ADE4 packages use R to help answer macroevo-lutionary questions. This is obviously a key interest of thedevelopers of these packages, as there is a wide rangeof methods available for comparing observations acrossspecies with respect to an underlying phylogeny.

Chapter 7 is specifically aimed at people wanting todevelop algorithms in R. It doesn't aim to be compre-hensive but rather gives a number of useful suggestions,strategies and pointers to the appropriate literature.

I think people will find Analysis of Phylogenetics andEvolution with JR to be a very useful reference book. Itdoesn't try to be everything to both the phylogeneticpractitioner and the algorithm developer, but insteadtakes the approach of providing something to get ev-eryone interested and also somewhere down the roadof using R productively. There are plenty of signpoststhroughout the book to let you know where to look forfurther information.

Supporting the use of a language like R has the poten-tial to be very useful to the phylogenetics community.Firstly, it helps the developers of new algorithms by pro-viding an environment where it is relatively easy to de-velop code without having to reinvent the wheel to do

at Stanford Medical C

enter on March 5, 2013

http://sysbio.oxfordjournals.org/D

ownloaded from

Page 3: A Review of: "Analysis of Phylogenetics and Evolution with R"

696 SYSTEMATIC BIOLOGY VOL. 56

basic things like input and output of phylogenetic data,or even less basic things like defining models of DNAsubstitution. As an algorithm developer, working in Ris likely to speed up your ability to turn an idea for analgorithm into something that can be used by others. Thecorollary of this is that R has the potential to be very goodfor practitioners in that, once they have learned to navi-gate the R environment, they get great flexibility and ac-cess to an ever-increasing number of specialist packages.Certainly, the cost in time of learning a language like R

exceeds the cost of learning a program like PAUP*, butthe long-term benefits are also likely to be greater—andthe more people that get involved in doing phylogeneticsin R, then the more modules will become available. Soovercome that activation-energy hump, buy the book,and get involved!

Barbara R. Holland, Institute of Fundamental Sciences, Massey Univer-sity, Private Bag 11222, Palmerston North 5301, New Zealand; E-mail:[email protected]

Syst. Biol. 56(4):696-698, 2007Copyright (c) Society of Systematic BiologistsISSN: 1063-5157 print / 1076-836X onlineDOl: 10.1080/10635150701475605

The Tree of Life: A Phylogenetic Classification.—Guillaume Lecointreand Herv6 Le Guyader. (Illustrated by Dominique Visset, translatedby Karen McCoy.) 2006. Belknap Press of Harvard University Press,Cambridge MA. 560 pp. ISBN 978-0-674-02183-9 (ISBN-10 0-674-02183-5). US$39.95, €34, £25.95 (hardback).

As the book review editor, it is obvious that I wouldrequest from the publisher a copy of any book with thistitle, with the intention of then sending it on to a reviewer.However, I just couldn't make myself do it in this case—Iloved the book too much to be able to part with it. So, Iam writing the review myself, and all of you will haveto buy your own copies.

There are, of course, a number of books around thattry to provide an overview of biodiversity, listing thedifferent taxonomic groups, along with an illustrated de-scription of their distinguishing features, the best knownbeing that of Margulis and Schwarz (1997). If nothingelse, all introductory biology textbooks also do this, al-though they never seem to present the other parts ofbiology in the same style. This is a poor way to explainbiodiversity, because it smacks too much of natural his-tory. Natural history is a very valuable thing, as cen-turies of human study will attest, but it should neverbe confounded with science. Science is about testing hy-potheses, and in our business these hypotheses are aboutevolutionary history. It has taken biologists a long time toexplicate the simple idea that it is the study of biodiver-sity that makes biology different from other sciences—the nature and scale of the interrelationships among or-ganisms is something that has never been conceived ofwithin physics and chemistry. Evolutionary history isour explanation for the origin of that biodiversity, andso the best way to present biodiversity is in the contextof phylogenetic trees. That is what this new book does.

So, what makes this book different is its strictly (al-most relentless) phylogenetic perspective. It is based ontrees, not only as metaphors but as the primary meansof communication—the text and the many excellent linedrawings are purely adjuncts to the trees. The book itself

is arranged as a set of nested clades, with each sectionintroduced by a tree along with a list of synapomorphies.The clades within that tree then follow on the subsequentpages, presented in the same manner, until the limit ofresolution of the book is reached. All of the putativelymonophyletic groups are numbered, for ease of locatingthem within the book, and almost all of them are named.Paraphyly and polyphyly are strictly excluded, no matterhow familiar the names of such groups might be.

This book thus demonstrates the art of making some-thing very difficult look surprisingly easy. This is nota feature of all books, because it requires great exper-tise from the authors, but it would surely be a usefulsynapomorphy if we could achieve it. In this case, thestraightforward arrangement, the simple writing style(translated well) and the direct presentation of phylo-genetic information all make the book accessible to thereader, both expert and non-expert alike. In short, thebook is unique. It not only represents the first thor-ough attempt to portray life from a purely phylogeneticperspective, it is an excellent implementation of thatidea.

As an added bonus, there is a 35-page introduction tophylogenetic systematics. This is among the best such in-troductions in any language. The candid and unadornedwriting style comes to the fore, so that the ideas and in-formation are comprehensible to the uninitiated withoutalienating the experts by oversimplification. None of thecomplications in phylogeny reconstruction are avoided(although the methodology concentrates on parsimonyanalysis), and yet the concepts are presented in a straight-forward and logical manner, with suitable illustrated ex-amples. The strengths and weaknesses of our currentmethods are presented without excuses, and the conse-quent classifications are presented without apologies tothe traditional approaches. Only long experience in thisline of work could produce a synthesis of this type. Ifyou ever need to explain phylogenetic systematics to aninterested member of the general public, then you cansafely direct them to this essay.

at Stanford Medical C

enter on March 5, 2013

http://sysbio.oxfordjournals.org/D

ownloaded from