the biodiversity informatics landscape: a systematics perspective
Post on 27-Jan-2015
111 Views
Preview:
DESCRIPTION
TRANSCRIPT
Vince Smith
The biodiversity informatics landscape:a systematics perspective
Biodiversity Informatics HorizonsRome, 3-6 Sept 2013
Overview
1. Background – the biodiversity informatics domain• The problem (i.e. why are we here)• Representations of the domain (data, infrastructures, projects…)• Toward an integrated view (strategy)
2. Social challenges• Openness• Collaboration and communities • Standards, identifiers & protocols
3. (Big) data challenges• Mobilizing existing data (metadata, literature, collections) • New forms of data ([meta]genomics & observatories)
4. Synthetic challenges• Data Aggregation & linking• Visualisation• Modeling
5. Next steps (data infrastructures & funding)• Lessons learned: new informatics opportunities in H2020
1. Background
The problem – integrating biodiversity research
How to we join up these activities? How do we use this as a tool? Species conservation & protected areas
Impacts of human developmentBiodiversity & human health
Impacts of climate changeFood, farming & biofuels
Invasive alien species
What infrastructures do we need?(technologies, tools, standards…)What processes do we need?(Modelling, workflows…)What data do we need?(Genes, localities…)
Natural History – the foundation
"It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, …, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us.”
C. Darwin "On the Origin of Species”, 1859
Darwin’s “tangled bank”… Systematics, a foundational “law”
Ecological interactions
A granular understanding of biodiversity
Genes
GCGCGTACCTAG
Individuals
iiiiiiivvvi
Populations
12
123
Local populations
Species
ABCDEF
Global biodiversity
Interactions
A B C D E F- + + + + ++ - + + ++ + -+ - + -+ -
Biological networks
GenBank
Key problems• Landscape is complex, fragmented & hard to navigate• Many audiences (policy makers, scientists, amateurs, citizen scientists)• Many scales (global solutions to local problems)
Figure adapted from Peterson et al 2010
An informaticians view of biodiversity
A project centric view of biodiversity
A snapshot from 2009, “the dance of the initiatives”
The strategic view: community informatics challenges
GBIF GBIC Report(Coming soon)
EU Biodiversity Strategy(2011)
Biodiv. Inf. Challenges(2013)
Grand Challenges for Biodiversity Informatics(integrating activities for H2020)
2. Social challenges- Openness- Collaboration and communities - Standards, identifiers & links
Openness in biodiversity informatics
E. Archambault et. al., Proportion of Open Access Peer-Reviewed Papers at the European and World Levels--2004-2011, June 2013, Science-Metrix Inc.
“One-half of all papers are now freely available within a year or two of publication”
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it - subject, at most, to the requirement to attribute and/or share-alike.” http://opendefinition.org/
Many kinds of openness:• Open Access• Open Data• Open Science• Open Source
• Sharing data is a foundation for our activities
• Normal practice in some communities (molecular)
• Mandated by some funders & governments
Openness in biodiversity informatics
Many kinds of openness:• Open Access• Open Data• Open Science• Open Source
Need to continue to incentivise openness
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it - subject, at most, to the requirement to attribute and/or share-alike.”
• Sharing data is a foundation for our activities
• Normal practice in some communities (molecular)
• Mandated by some funders & governments
http://opendefinition.org/
Incentivise through credit via citation (e.g. BDJ)
What are Scratchpads? (http://scratchpads.eu)
Taxa Projects Regions Societies
544 Scratchpad Communities
by 6,644 active registered users
covering 91,631 taxa
in 535,317 pages. 81 paper citations in 2012
In total more than
1,300,000 visitors
e.g., Scratchpad Virtual Research Communities
Collaboration & communities
Making taxonomy a team sport
Our infrastructures need to facilitate collaboration
Standards, identifiers & protocols
Standards can’t be developed in isolation – they must be used
Key requirements:• Need to be inclusive, practical & extensible• Readable by humans & machines• Widely used
Good examples:• Darwin Core• CrossRef & DataCite DOIs• ORCHID Author identifiers
Gaps / Problems• Reuse & persistence of identifiers• Vocabularies & ontologies (time consuming / little reward)
Potential solutions• Build them into our credit systems• Show sematic reasoning potential (LOD & RDF demonstrators)
A foundation for integrationFacilitating data sharing across communities
3. (Big) data challenges- Mobilising existing data - New forms of data
Mobilising existing data
Collections• 1.5-3B specimens in collections worldwide• Fragments efforts / heterogeneity of process• Needs ambition (NHM: 20M in 5 yrs.) & coord.
Literature• >300M pages of biodiversity literature• BHL (41M pp.) an example of what can be done• Needs a sustainability & article metadata
Metadata registries• Data about data (cheaper & scalable)• e.g. bibliographic data, dataset portals
Informatics challenges• Storage & persistence• Automation & annotation• Incentives to digitise & fitness for use
Collections, literature & metadata
How can we quickly, efficiently and cost effectively mobilise biological data at scale?
Bibliography of Life (RefFinder & RefBank)
BHL literature
NHM Digitisation
Mobilising & managing new forms of data
New Molecular approaches• Molecular detection & monitoring of organisms is routine• Metagenomics (env. sequencing) commonplace• Becoming the 1° route to understanding biodiversity
Ecological observatories• Automated biodiversity detection• Remote sensing (e.g. satellite & acoustic data, drones, camera traps)• Monitoring conspicuous, rare or invasive spp. (algal blooms, palms) • Monitoring human activity
Informatics challenges• Very large quantities of data (2.5-10TB per researcher per yr.)• Doesn’t map well to existing data infrastructures• Challenge current networking & storage capacity • Digital and physical collections become equally important?
3-4 June 2013, NHM
22 July, 2013
Metagenomics & ecological observatories
These new data types do not depend on traditional taxonomy & systematics
4. Synthetic challenges- Data aggregation & linking- Visualisation- Modeling
Aggregation & linking
Portals bringing together distributed & diverse forms of data
Giving consistent and comprehensive access to all biological data
Several approaches, with different advantages• Tightly coupled to a few data sources
• (e.g. eMonocot, CDM)• Loosely coupled to many sources
• (e.g. BioNames, Wikipedia)• Hybrid forms (e.g. Canadensys, EOL, GBIF)
Informatics challenges• Portals are hard to sustain• New methods of data discovery & access• Create new windows (views) on content• New data structures, new types of database
Scalable but less accurate(3M taxon names, 93k phylogenies & 28k articles)
BioNames
Selective & accurate but hard to scale(276k taxa, 8k images, 13 keys & 3 phylogenies)
eMonocot
Visualisation
Visually synthesizing large, linked biodiversity datasets
Making biodiversity data accessible & understandable
NHM specimen records
http://data.nhm.ac.uk/globe/
Research opportunities• Tools integration (e.g. GeoCat, CartoDB)• Span multiple audiences
Outreach opportunities• Visually compelling story telling• Crowdsourcing tools (e.g. Notes From Nature)
Exploiting new technologies• Touch screens• Mobile• Location awareness
Informatics challenges• Very specific to individual use cases• Sustainability issues
Modeling the biosphere: a (the) 30 year goal?
Conceptually has many potential uses• Identifying trends• Explaining patterns• Making predictions• Real time alerts
- when data contradicts current knowledge• The ultimate policy tool
Major informatics challenges• Technical very difficult (many years off)• Needs effective prototypes & platforms• Some first steps e.g. OBOE, LEFT
Nature 2013, doi:10.1038/493295a
Reasoning across large, linked biodiversity datasets
A clear, singular, long-term vision, which biodiversity data can contribute too
5. Next steps
Lessons learned: new opportunities in H2020
PATHWAYS TO INTEGRATION (by addressing these social, data & synthetic challenges)
• Break out of the discipline, technical & project centric activities (it is unsustainable, inefficient & bad for science)
• Integrate & build on exiting programmes where possible (LifeWatch is a potential umbrella for these activities)
• Bridge the disconnect between informaticians & users (make the users informaticians & in informaticians users)
• Our products well suited to address these challenges
• Use H2020 as a mechanism to achieve integration
How do we join up these activities?
QUESTIONS
Possible biodiversity informatics design principles*
1. Start with needs - focus on real user needs (not just the ‘official process’)
2. Do less - if someone else is doing it, link to it or use it
3. Design with data - prototype and test with real users on the live website
4. Do the hard work to make it simple - let the computer take the strain
5. Iterate. Then iterate again. - iteration reduces risk & is more sustainable
6. Build for inclusion – it’s easier in the long run
7. Understand context - we are designing for people, not a screen or a brand
8. Build digital services, not websites - there is life beyond the website
9. Be consistent, not uniform - every circumstance is different
10. Make things open: it makes things better - it’s more sustainable
= experience from 7-years with the Scratchpads= lessons for infrastructures in H2020?
*https://www.gov.uk/designprinciples
Mobilising existing data: how to prioritise
Nick Poole, UK Collections Trust
CONTENT
METADATA
A LITTLE A LOT
Digitise a few things & invest in depth, description & promotion
Digitise lots of things, put little effort into description & promotion
FUN
OUTREACHLEARNING
RESEARCH
AGGREGATION DATA MINING
COLECTIONS MANAGEMENT
Collaboration & communities
• Very few recent single author papers• Most (fundable) science is cross-disciplinary• Need to incentivise data curation & annotation• Need mechanisms to share annotations
Our infrastructures need to facilitate collaboration
Joppa et al, 2011
CONE SNAILS BIRDS MAMMALS AMPHIBIANS SPIDERS PLANTS
Average dates when increasing numbers of taxonomists were involved in describing speciesMaking taxonomy a team sport
top related