highlighting fitness-for-use of published biodiversity data

47
HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA Javier Otegui, Arturo H. Ariño University of Navarra

Upload: javier-otegui

Post on 09-Aug-2015

24 views

Category:

Data & Analytics


0 download

TRANSCRIPT

HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA

Javier Otegui, Arturo H. AriñoUniversity of Navarra

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

PUBLISHING DATA

• Data published in papers• Data papers published• Data published

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

WHAT, WHERE, WHEN

OUR TARGET DATA

Primary Biodiversity Data RecordPBR

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

WHAT, WHERE, WHEN

PBR

Megaptera novaehollandiaeAdult female, live

Off North Truro, MA, USA42.101 N, 70.169 W

2010.09.29 21:47 GMT

Arturo H. AriñoAboard Dolphin VICanon Eos 450D, 200 mm lens

un

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

GBIF GEOREFERENCED DATA

237.348.923 animal data records by Oct. 2012 (total georeferenced records: 327.048.532)

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

THE CASE FOR DIRECT DATA PUBLICATION

• Access to massive data increasingly commonalized: GBIF

• Spectrum of possible uses increasing: new science, new paradigms

• Data-Intensive Science– Reliance on good data: Opportunity for discovery– Reliance on bad data: Risk of “undiscoveries”

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

WHAT, WHERE, WHEN

PBR

Nautilus pompilus4 specimens

Off Palau Islands

1921

Legit :unknownDet.: J.A. SalinasCollection: JDR at MZNA

un

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

BARRIERS TO DIRECT DATA PUBLICATION

• Data availability• Data sharing mechanisms• Data publication incentives• Data quality

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

DATA AVAILABILITY INCREASE

DATA AVAILABILITY INCREASE

GBIF, October 2012

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Ariño, 2010. Biodiv. Informat. 7: 15-26

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

ESTIMATED DATA IN MISSING COLLECTIONS

BCI

GBIF

GSAP –DNHC survey

unknown

est. CI

Cexp = 8.37K

Nexp = 2.01G

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

WHAT, WHERE, WHEN

PBR

Saccharomyces cerevisaeTCP1-beta

Cask in wreck of ARGO2 km E of Akta Képhalos

Stratum 2000 BC

Legit : Homer S.Det.: LoScanSQ-XCollection: Museum of Beer History (MBH)

un??

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

THE TRUST PARADOX

• Papers are generally more trusted than raw/downloadable data– Papers have gone through peer review

• Published data have common sources:– Experiments,– Observations,– Digitizations

• Raw data in published papers can go unreviewed– Review focuses on soundness, methods, conclusions– Data assumed to be true & correct

• Direct publication of data, in fact, should facilitate revision– Enforcing rules– Filtering– Pattern detection

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

FITNESS-FOR-USE

• FFU defines whether data can be used for a specific purpose

• Useful compromise for publishing data• FFU not equal to data quality

Quality Fitness-for-useIntrinsic to data Depends on intended use

Conceptual PragmaticalGood quality predicting good FFU Good FFU not predict good quality

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

FFU ASSESSMENT

• In 2006 AHA started analyzing our own DB for FFU, creating pattern-detection visualizations– First reported in TDWG-2006 (St. Louis)

• In 2008 we started to analyze raw & processed GBIF data (2.4G records by 2012) (JOT’s thesis)– Building on works by Chapman, Yesson, Wieckzorek, etc.,

changing scope and perspective• Started producing reports in 2009, 2010• Teamed up with GBIF-Sec, 2011• Created BIDDSAT, 2012

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Latitude Longitude

30.37, -87.176548.5584, -123.46342.3487, -123.7841.7866, -100.061-73.9071, 42.702838.8749, -104.8844.3964, -75.666842.1927, -89.10632.693, -79.960644.2124, -88.4241.6637, -81.378239.6992, -121.77846.13, -72.719638.7231, -77.067427.7349, -82.647936.0852, -121.61639.0901, -77.5203-83.1662, 43.062241.2956, -74.595645.5146, -73.813142.0755, -122.75941.1047, -81.494442.4792, -89.033340.6956, -74.8913...

-

+

Otegui & Ariño, 2009. Proceedings of the TDWG 2009 Annual Conference, Montpellier, FR

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

DATA INDEXING

ProviderB

ProviderA

ProviderC

ProviderD

GBIF index

?

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

DATA QUERYING

ProviderB

ProviderA

ProviderC

ProviderD

GBIF index

?

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Análisis detallado de GBIFDetailed assessment of GBIF

Bad data Good data

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Otegui , Ariño, Gaiji & Chavan, in press

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

CONTROL AND FFU TOOLS AT INDEXING

Gaiji et al., 22011-2012 – EMBARGOED DECEMBER 2012

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

CORRECTING MECHANISMS: EXAMPLES

• GBIF has implemented many georeferencing correction algorithms, such as e.g. coordinate/country match

• This removes many bogus data points, for example redressing reversed lat/long when serving data

• Still, original data need to be corrected: GBIF cannot alter original data (only tag them)

David Remsen, TDWG-2011. In ViBRANT, http://vbrant.eu/content/gbif-integration

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

2010.04.28

2011.14.09 ErrCode: 10

10: Fields “month” and “day” probably swapped

FILTER MECHANISMS: EXAMPLE

• Original data unchanged• Index entry corrected• Error entry generated in issue log

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

-

+

Otegui et al., 2012

FILTERS CANNOT GET ALL

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Otegui , Ariño, Gaiji & Chavan, in press

FILTERS CANNOT SOLVE ALL

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Modified from Otegui et al. In press

All GBIF data

Some date element wrong

Some date element missing

All date elements missing

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

BIDDSAT

• Tool to detect space-time and other patterns• Applicable to data publishers sharing data

through GBIF• Uses tailored visualizations• http://www.unav.es/unzyec/mzna/biddsat/• Open source: https://github.com/jotegui/BIDDSAT• Bioinformatics, DOI: 10.1093/bioinformatics/BTS359

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

0 100Percentage of completeness

Num

ber o

f col

lecti

ons

0

15

30

45

60

Source: BIDDSAT

DATA COMPLETNESS

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

0 100Percentage of completeness

Num

ber o

f col

lecti

ons

0

15

30

45

60

• Wrong implementation of exchange standards (DwC) – solvable

• Data loss – not solvable

• Limited room for improvement

Fuente: BIDDSAT

DATA COMPLETNESS

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Data Provider LEONIDAS, Resource SHIELDGBIF 2008/05 Version

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Data Provider LEONIDAS, Resource SHIELDGBIF 2009/09 Version

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

1/Jan31/Dec

1/Mar

1/Feb

1/Apr

1/May

1/Jun

1/Jul

1/Aug

1/Sep

1/Oct

1/Nov

1/Dec

Fall

Winter

SpringSummer

1750 Year 2012

-

+

Cronhorogram. Introduced by Ariño & Otegui, 2008, TDWG

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Source: BIDDSAT

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

- +

Hebdogram. Iintroduced by Ariño & Otegui, 2008. Proceedings of TDWG

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Ariño, Otegui & Robles, 2009

Provider 180All datasets

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

2008/05

Data ProviderCodename:BORODIN

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Data Provider Codename: BORODIN2009/092008/05

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

ActinopterigiiChordata

AnimaliaCell surface:

• Number of species• Number of records (PBR)

Treemap by Google Charts API on authors’ data

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Anim

alia

PlantaeFungi

INDEX TAXONOMY

Gaiji et al. in press

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

• Patchy data publishing… also in papers

• Opportunistic behavior: “Low-hanging fruit”

• Data can (and will) evolve

• The human factor still counts

PATTERNS OF PATTERNS

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

0

5

10

15

20

25

0 5 10 15

Clase de distancia

Cla

se d

e im

pre

cisi

ón

0 10000 20000 30000 40000

Chordata

Orthoptera

Lepidoptera

Hymenoptera

Diptera

Coleoptera

Thysanoptera

Collembola

Acari

Polychaeta

Oligochaeta

Nematoda

Georreferenciado

Localidad sincoordenadas

Sin localidad

PAPER WOES: PBR FROM LITERATURE

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

Publisher: Swedish

Publisher: German

Publisher: French

Publisher: British

Publisher: Norwegian

A MATTER OF CONVENIENCE

Otegui, Robles & Ariño, 2009. eBiosphere, London, UK.

Publisher: Parisien

Publisher: Spanish

JAVIER OTEGUI & ARTURO H. ARIÑO: HIGHLIGHTING FITNESS-FOR-USE OF PUBLISHED BIODIVERSITY DATA. TDWG2012, BEIJING, 22-X-2012

CLASSIFICATIONACCORDING TO:

Ariño, Otegui & Robles, 2009

PROVIDERSP2K

GBIF RECORDSSAMPLE

EVOLUTIONARY DATA

T H E E N D

THANK YOU

T H E E N D

THANK YOU

WITH SPECIAL THANKS TO:

VISHWAS CHAVAN, SAMY GAIJI, ANDREA HAHN, TIM ROBERTSON, ANDTHE DIGIT SCIENCE SUBCOMITEE AND THE GSAP-NHC AND CNA TASK GROUPS

THE GBIF SECRETARIAT (COPENHAGUEN) ANDTHE SPANISH COORDINATION NODE (GBIF.ES)

ESTRELLA ROBLES AND THE PEOPLE AT THE DEPARTMENT OFZOOLOGY AND ECOLOGY (UNZYEC),

THE UNIVERSITY OF NAVARRA

No bytes were seriously harmed while preparing this PPTX.(And copies exist of those who actullay were anyway).

This file used 328 watt-hours, offset by forfeiting Cantonese roast duck for far too long.

All images, plots and analyses by the authors except where otherwise noted

PPTX © 2012 A.H. Ariño, University of Navarra

www.unav.es/unzyec

WITH SPECIAL THANKS TO:

VISHWAS CHAVAN, SAMY GAIJI, ANDREA HAHN, TIM ROBERTSON, ANDTHE DIGIT SCIENCE SUBCOMITEE AND THE GSAP-NHC AND CNA TASK GROUPS

THE GBIF SECRETARIAT (COPENHAGUEN) ANDTHE SPANISH COORDINATION NODE (GBIF.ES)

ESTRELLA ROBLES AND THE PEOPLE AT THE DEPARTMENT OFZOOLOGY AND ECOLOGY (UNZYEC),

THE UNIVERSITY OF NAVARRA

No bytes were seriously harmed while preparing this PPTX.(And copies exist of those who actullay were anyway).

This file used 328 watt-hours, offset by forfeiting Cantonese roast duck for far too long.

All images, plots and analyses by the authors except where otherwise noted

PPTX © 2012 A.H. Ariño, University of Navarra

www.unav.es/unzyec

BIDDSAT, WWW.UNAV.ES/UNZYEC/MZNA/BIDDSAT/, WWW.NCBI.NLM.NIH.GOV/PUBMED/22730433. SOON IN A PDF NEAR YOU.