highlighting fitness-for-use of published biodiversity data
TRANSCRIPT
HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA
Javier Otegui, Arturo H. AriñoUniversity of Navarra
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
PUBLISHING DATA
• Data published in papers• Data papers published• Data published
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
WHAT, WHERE, WHEN
OUR TARGET DATA
Primary Biodiversity Data RecordPBR
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
WHAT, WHERE, WHEN
PBR
Megaptera novaehollandiaeAdult female, live
Off North Truro, MA, USA42.101 N, 70.169 W
2010.09.29 21:47 GMT
Arturo H. AriñoAboard Dolphin VICanon Eos 450D, 200 mm lens
un
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
GBIF GEOREFERENCED DATA
237.348.923 animal data records by Oct. 2012 (total georeferenced records: 327.048.532)
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
THE CASE FOR DIRECT DATA PUBLICATION
• Access to massive data increasingly commonalized: GBIF
• Spectrum of possible uses increasing: new science, new paradigms
• Data-Intensive Science– Reliance on good data: Opportunity for discovery– Reliance on bad data: Risk of “undiscoveries”
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
WHAT, WHERE, WHEN
PBR
Nautilus pompilus4 specimens
Off Palau Islands
1921
Legit :unknownDet.: J.A. SalinasCollection: JDR at MZNA
un
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
BARRIERS TO DIRECT DATA PUBLICATION
• Data availability• Data sharing mechanisms• Data publication incentives• Data quality
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
DATA AVAILABILITY INCREASE
DATA AVAILABILITY INCREASE
GBIF, October 2012
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Ariño, 2010. Biodiv. Informat. 7: 15-26
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
ESTIMATED DATA IN MISSING COLLECTIONS
BCI
GBIF
GSAP –DNHC survey
unknown
est. CI
Cexp = 8.37K
Nexp = 2.01G
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
WHAT, WHERE, WHEN
PBR
Saccharomyces cerevisaeTCP1-beta
Cask in wreck of ARGO2 km E of Akta Képhalos
Stratum 2000 BC
Legit : Homer S.Det.: LoScanSQ-XCollection: Museum of Beer History (MBH)
un??
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
THE TRUST PARADOX
• Papers are generally more trusted than raw/downloadable data– Papers have gone through peer review
• Published data have common sources:– Experiments,– Observations,– Digitizations
• Raw data in published papers can go unreviewed– Review focuses on soundness, methods, conclusions– Data assumed to be true & correct
• Direct publication of data, in fact, should facilitate revision– Enforcing rules– Filtering– Pattern detection
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
FITNESS-FOR-USE
• FFU defines whether data can be used for a specific purpose
• Useful compromise for publishing data• FFU not equal to data quality
Quality Fitness-for-useIntrinsic to data Depends on intended use
Conceptual PragmaticalGood quality predicting good FFU Good FFU not predict good quality
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
FFU ASSESSMENT
• In 2006 AHA started analyzing our own DB for FFU, creating pattern-detection visualizations– First reported in TDWG-2006 (St. Louis)
• In 2008 we started to analyze raw & processed GBIF data (2.4G records by 2012) (JOT’s thesis)– Building on works by Chapman, Yesson, Wieckzorek, etc.,
changing scope and perspective• Started producing reports in 2009, 2010• Teamed up with GBIF-Sec, 2011• Created BIDDSAT, 2012
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Latitude Longitude
30.37, -87.176548.5584, -123.46342.3487, -123.7841.7866, -100.061-73.9071, 42.702838.8749, -104.8844.3964, -75.666842.1927, -89.10632.693, -79.960644.2124, -88.4241.6637, -81.378239.6992, -121.77846.13, -72.719638.7231, -77.067427.7349, -82.647936.0852, -121.61639.0901, -77.5203-83.1662, 43.062241.2956, -74.595645.5146, -73.813142.0755, -122.75941.1047, -81.494442.4792, -89.033340.6956, -74.8913...
-
+
Otegui & Ariño, 2009. Proceedings of the TDWG 2009 Annual Conference, Montpellier, FR
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
DATA INDEXING
ProviderB
ProviderA
ProviderC
ProviderD
GBIF index
?
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
DATA QUERYING
ProviderB
ProviderA
ProviderC
ProviderD
GBIF index
?
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Análisis detallado de GBIFDetailed assessment of GBIF
Bad data Good data
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Otegui , Ariño, Gaiji & Chavan, in press
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
CONTROL AND FFU TOOLS AT INDEXING
Gaiji et al., 22011-2012 – EMBARGOED DECEMBER 2012
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
CORRECTING MECHANISMS: EXAMPLES
• GBIF has implemented many georeferencing correction algorithms, such as e.g. coordinate/country match
• This removes many bogus data points, for example redressing reversed lat/long when serving data
• Still, original data need to be corrected: GBIF cannot alter original data (only tag them)
David Remsen, TDWG-2011. In ViBRANT, http://vbrant.eu/content/gbif-integration
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
2010.04.28
2011.14.09 ErrCode: 10
10: Fields “month” and “day” probably swapped
FILTER MECHANISMS: EXAMPLE
• Original data unchanged• Index entry corrected• Error entry generated in issue log
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
-
+
Otegui et al., 2012
FILTERS CANNOT GET ALL
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Otegui , Ariño, Gaiji & Chavan, in press
FILTERS CANNOT SOLVE ALL
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Modified from Otegui et al. In press
All GBIF data
Some date element wrong
Some date element missing
All date elements missing
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
BIDDSAT
• Tool to detect space-time and other patterns• Applicable to data publishers sharing data
through GBIF• Uses tailored visualizations• http://www.unav.es/unzyec/mzna/biddsat/• Open source: https://github.com/jotegui/BIDDSAT• Bioinformatics, DOI: 10.1093/bioinformatics/BTS359
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
0 100Percentage of completeness
Num
ber o
f col
lecti
ons
0
15
30
45
60
Source: BIDDSAT
DATA COMPLETNESS
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
0 100Percentage of completeness
Num
ber o
f col
lecti
ons
0
15
30
45
60
• Wrong implementation of exchange standards (DwC) – solvable
• Data loss – not solvable
• Limited room for improvement
Fuente: BIDDSAT
DATA COMPLETNESS
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Data Provider LEONIDAS, Resource SHIELDGBIF 2008/05 Version
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Data Provider LEONIDAS, Resource SHIELDGBIF 2009/09 Version
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
1/Jan31/Dec
1/Mar
1/Feb
1/Apr
1/May
1/Jun
1/Jul
1/Aug
1/Sep
1/Oct
1/Nov
1/Dec
Fall
Winter
SpringSummer
1750 Year 2012
-
+
Cronhorogram. Introduced by Ariño & Otegui, 2008, TDWG
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Source: BIDDSAT
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
- +
Hebdogram. Iintroduced by Ariño & Otegui, 2008. Proceedings of TDWG
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Ariño, Otegui & Robles, 2009
Provider 180All datasets
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
2008/05
Data ProviderCodename:BORODIN
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Data Provider Codename: BORODIN2009/092008/05
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
ActinopterigiiChordata
AnimaliaCell surface:
• Number of species• Number of records (PBR)
Treemap by Google Charts API on authors’ data
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Anim
alia
PlantaeFungi
INDEX TAXONOMY
Gaiji et al. in press
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
• Patchy data publishing… also in papers
• Opportunistic behavior: “Low-hanging fruit”
• Data can (and will) evolve
• The human factor still counts
PATTERNS OF PATTERNS
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
0
5
10
15
20
25
0 5 10 15
Clase de distancia
Cla
se d
e im
pre
cisi
ón
0 10000 20000 30000 40000
Chordata
Orthoptera
Lepidoptera
Hymenoptera
Diptera
Coleoptera
Thysanoptera
Collembola
Acari
Polychaeta
Oligochaeta
Nematoda
Georreferenciado
Localidad sincoordenadas
Sin localidad
PAPER WOES: PBR FROM LITERATURE
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
Publisher: Swedish
Publisher: German
Publisher: French
Publisher: British
Publisher: Norwegian
A MATTER OF CONVENIENCE
Otegui, Robles & Ariño, 2009. eBiosphere, London, UK.
Publisher: Parisien
Publisher: Spanish
JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012
CLASSIFICATIONACCORDING TO:
Ariño, Otegui & Robles, 2009
PROVIDERSP2K
GBIF RECORDSSAMPLE
EVOLUTIONARY DATA
T H E E N D
THANK YOU
T H E E N D
THANK YOU
WITH SPECIAL THANKS TO:
VISHWAS CHAVAN, SAMY GAIJI, ANDREA HAHN, TIM ROBERTSON, ANDTHE DIGIT SCIENCE SUBCOMITEE AND THE GSAP-NHC AND CNA TASK GROUPS
THE GBIF SECRETARIAT (COPENHAGUEN) ANDTHE SPANISH COORDINATION NODE (GBIF.ES)
ESTRELLA ROBLES AND THE PEOPLE AT THE DEPARTMENT OFZOOLOGY AND ECOLOGY (UNZYEC),
THE UNIVERSITY OF NAVARRA
No bytes were seriously harmed while preparing this PPTX.(And copies exist of those who actullay were anyway).
This file used 328 watt-hours, offset by forfeiting Cantonese roast duck for far too long.
All images, plots and analyses by the authors except where otherwise noted
PPTX © 2012 A.H. Ariño, University of Navarra
www.unav.es/unzyec
WITH SPECIAL THANKS TO:
VISHWAS CHAVAN, SAMY GAIJI, ANDREA HAHN, TIM ROBERTSON, ANDTHE DIGIT SCIENCE SUBCOMITEE AND THE GSAP-NHC AND CNA TASK GROUPS
THE GBIF SECRETARIAT (COPENHAGUEN) ANDTHE SPANISH COORDINATION NODE (GBIF.ES)
ESTRELLA ROBLES AND THE PEOPLE AT THE DEPARTMENT OFZOOLOGY AND ECOLOGY (UNZYEC),
THE UNIVERSITY OF NAVARRA
No bytes were seriously harmed while preparing this PPTX.(And copies exist of those who actullay were anyway).
This file used 328 watt-hours, offset by forfeiting Cantonese roast duck for far too long.
All images, plots and analyses by the authors except where otherwise noted
PPTX © 2012 A.H. Ariño, University of Navarra
www.unav.es/unzyec
BIDDSAT, WWW.UNAV.ES/UNZYEC/MZNA/BIDDSAT/, WWW.NCBI.NLM.NIH.GOV/PUBMED/22730433. SOON IN A PDF NEAR YOU.