phylogenetics in the cloud brian o’meara

36
Phylogenetics in the cloud Brian O’Meara http://www.brianomeara.info h t t p : / / x k c d . c o m / 2 8 7 /

Upload: andrew-osgood

Post on 14-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Phylogenetics

Phylogenetics in the cloudBrian OMearahttp://www.brianomeara.info

http://xkcd.com/287/Understand what phylogenetics is and its utility for life scientists (briefly)Know some of the computational pitfallsIdentify some of the available resourcesBecome ok with just saying yes to being a user (sometimes)Learning objectives

Doug Stone

Doug Stone

7 origins of agricultureAngiospermConifer52001801191115001958 origins of inbreeding

Number of papers per year in Scopus with phylogen* in any field. Markers show relative frequency of phylogen* and evolution: Phylogen* up to 38% the number of evolution

David CannatellaRyan & Rand, 1995

H5N1 bird flu: phylogeography & evolutionWallace et al, 2007Organ et al. Origin of avian genome size and structure in non-avian dinosaurs. Nature (2007) vol. 446 (7132) pp. 180-4

Here we use a novel bayesian comparative method to show that bone-cell size correlates well with genome size in extant vertebrates, and hence use this relationship to estimate the genome sizes of 31 species of extinct dinosaur, including several species of extinct birds. Our results indicate that the small genomes typically associated with avian flight evolved in the saurischian dinosaur lineage between 230 and 250 million years ago, long before this lineage gave rise to the first birds. By comparison, ornithischian dinosaurs are inferred to have had much larger genomes, which were probably typical for ancestral Dinosauria. Using comparative genomic data, we estimate that genome-wide interspersed mobile elements, a class of repetitive DNA, comprised 5-12% of the total genome size in the saurischian dinosaur lineage, but was 7-19% of total genome size in ornithischian dinosaurs, suggesting that repetitive elements became less active in the saurischian lineage.

Alfaro et al. Nine exceptional radiations plus high turnover explain species diversity in jawed vertebrates. P Natl Acad Sci Usa (2009) vol. 106 (32) pp. 13410-13414Get sequence dataBuild treeCalibrate tree to timeLook at treeGet cool dataAnswer questionGet sequence dataBuild treeCalibrate tree to timeLook at treeGet cool dataAnswer question

sequinr, ape, rentrez, ...

Phylota browser: automatically clusters genbank for you: you can just download the genesGet sequence dataBuild treeCalibrate tree to timeLook at treeGet cool dataAnswer question

NNNumber of atoms in the universe>1 million speciesRemember this is a log scale, so the number of trees rises faster than exponentially with the number of taxa. This becomes a problem when searching for trees. Im involved with a project (iPlant) attempting to deal with trees of tens of thousands to hundreds of thousands of taxa.

http://xkcd.com/287/

Smith et al. 200913,533 speciesNeeded 32 GB of RAM to run

Remember this is a log scale, so the number of trees rises faster than exponentially with the number of taxa. This becomes a problem when searching for trees. Adding two more species to the 13,533 species dataset shown earlier means increasing the search space of possible topologies by 183 million-fold. This is the same magnitude of changing the search space for an object from around campus (about a square mile) to anywhere on the planet -- and its just adding two species.Reuse!

treebase, rdryadOpenTree is coming soon

Coming soon

Magritte: this is not a pipe. It's a picture of a pipe. You can't actually use it for anything. The "Treachery of Images"

96% of published treesare available only aspictures of trees4% of published treesare available as actually reusable treesStoltzfus et al. 2013Magritte: this is not a pipe. It's a picture of a pipe. You can't actually use it for anything

241,465 TeraGrid/XSEDE jobs submitted by 8,598 unique users.Average of 171 new XSEDE Users registered in every month.5,216 CPU years of TeraGrid/XSEDE time.You can use the cipres science gateway. No R interface yetGet sequence dataBuild treeCalibrate tree to timeLook at treeGet cool dataAnswer question

Copyright 2005-2014. All Rights Reserved. Substantial duplication is not permitted. We encourage wide use of this resource, but until it is complete it should not be used to represent a synthesis for any taxonomic group. Currently large scale, automated, data-mining is not permitted. Consult the authors if you have any questions about appropriate use, or if you plan to publish results from the database.

Radically open, but only a tiny number of sourcesGet sequence dataBuild treeCalibrate tree to timeLook at treeGet cool dataAnswer question

HD TV: 1920 1080

13,533 names

largest computer monitors: 32802048 (can be tiled)

Laser printer: effectively 3600 4725 (can be tiled)

Even looking at trees can be challenging

OneZoomVideo from Imperial College London https://www.youtube.com/watch?v=LZ3n3mV4uVcGet sequence dataBuild treeCalibrate tree to timeLook at treeGet cool dataAnswer question

We are used to repositories of genbank data, but also repositories of geographic localities, datasets from publications, fossil data, etc. There are ROpenSci packages to access all this dataGet sequence dataBuild treeCalibrate tree to timeLook at treeGet cool dataAnswer question

iPlant discovery environment: also runs on XSEDE, can do many comparative methods

Coming soonUnderstand what phylogenetics is and its utility for life scientists (briefly)Know some of the computational pitfallsIdentify some of the available resourcesBecome ok with just saying yes to being a user (sometimes)Learning objectives