biodiversity informatics: why aren't we there yet?

Post on 10-May-2015

814 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk given at CISA 2013, Barcelona, 26 September 2013

TRANSCRIPT

Biodiversity informatics: why aren’t we there yet?

@rdmpage

http://iphylo.blogspot.com

I’ve often said I want a Google for biodiversity data…

…turns out what I should have asked for was a NSA for biodiversity

• There are known knowns, things we know that we know

• There are known unknowns, things we now know we don’t know

• But there are also unknown unknowns, things we do not know we don't know

known

unknown

knowns

unknowns

What do these diagrams tell us?

Implications

• Sequencing is cheap

• The flood of sequences is only going to increase

• How much of this is relevant to biodiversity?

• --

Numbers of new animal names

1923

WWI WWII

Implications

• Rate of new taxa being described is relatively constant

• Suggests taxonomists are working at capacity

• Most taxonomic work is in the past

• Compare this to exponential growth of sequencing• --

Mammals in GenBank

Proper Linnaean names

Aus sp.

Mammals

Proper Linnaean names

Aus sp.

“Invertebrates”

BOLD

Dark taxa

• Disconnect between taxonomy and genomics

• How much of this comprises taxa we already know about versus new diversity?

• Do we need taxonomic names?• --

100,000 articles from http://biostor.org (BHL)

1923 today

Scanned legacy

• BHL is more than pre-1923 literature

• The real gap is post-1923 to pre-open access (2003)

• Most of the 20th century taxonomic literature is “dark”

• --

Size of Wikipedia articles on mammals

Few, large articles

Many, small articles “long tail”

Power law

• We know a lot about a few species

• For most species we know very little (even in well-known groups)

• For poorly known species need to go to legacy literature

• --

PanTHERIA (2009)1923 2003

Legacy literature

• Legacy literature matters (even for well-studied taxa)

• Much of this will be in digitally “dark” period

• --

Publishers of taxonomy(# articles)

http://bionames.org

Publishers

• BioStor (BHL) is the single largest source of taxonomic literature

• Lots of tiny publishers (long tail)

• Commercial publishers important (Magnolia Press, Springer, Informa, Wiley, Elsevier, BioOne)

• Who do we talk to about data mining?• --

Taxonomic journals (articles/decade)

Implications

• Zootaxa is indeed a “mega journal”

• If we had to pick one journal to data mine it is Zootaxa

• --

GBIF

• The Global Biodiversity Information Facility is not evenly “global”

• Tells us as much about sampling as distribution of diversity

Flickr EOL group

Crowd sourcing

• Where is the “crowd”?

• It’s where the iPhones are…

GenBank animal sequences

GenBank host records

Implications

• GenBank is about more than genes

• GenBank has a wealth of information on location, and ecological interactions

Implications

• Phylogenetic data is not being archived (why not?)

• Makes it hard to reproduce studies

• Does data matter?

• What level of granularity should be citable?

What do these diagrams tell us?

top related