gbif checklist bank and the backbone

24
GBIF Checklist Bank Indexing & Backbone

Upload: markus-doering

Post on 23-Jan-2018

470 views

Category:

Science


1 download

TRANSCRIPT

Page 1: GBIF Checklist bank and the backbone

GBIF Checklist BankIndexing & Backbone

Page 2: GBIF Checklist bank and the backbone

Checklist Scope1.846 datasets registered 18 million name records

Plazi (1.131), Pensoft (178), CoL GSDs (156)

Page 3: GBIF Checklist bank and the backbone

Denormalized Checklist

Page 4: GBIF Checklist bank and the backbone

Normalized Checklist

Page 5: GBIF Checklist bank and the backbone

Checklist Challenges• Highly relational taxonomic data, almost all records linked in tree & basionym

• Wrong or missing records destroy dataset integrity, not just a single record! • Different to flat, unrelated occurrence records

• Data Quality • broken referential integrity • bad names or placeholders (e.g. «Unallocated Family») • missing or unused controlled vcabularies, e.g. «art» for rank species

• Name strings can be published in several ways • ScientificName • ScientificName + Authorship • Genus + SpeciesEpitheton + Rank + InfraspecificEpitheton + Authorship

• Classifications can be published in several ways • Normalised via parentNameUsageID • Normalised via parentNameUsage • Denormalised via Kingdom,Phylum,Class,Order,Family,Genus

Page 6: GBIF Checklist bank and the backbone

Checklist Indexing• Basic archive validation

• unique ids

• Checklist Normalizer • resolve relations • create implicit taxa from denormalised classification • interpret controlled vocabularies, e.g. rank • match to backbone • match to previous version to keep GBIF ids stable

• Checklist Importer • Inserts data to PostgresDB and solr index for searches

• Checklist Analyser • generate dataset metrics

Page 7: GBIF Checklist bank and the backbone

Organizing Occurrences

• GBIF needs a single, consistent taxonomy • for metrics, search, maps • considerable variation in higher taxa • synonymies can be very large

• Catalog of Life is largest single source • ~90% of GBIF occurrence records (thanks to birds) • ~50% of GBIF occurrence names (35% in 2010)

• GBIF needs to assemble a taxonomy • originally merged (noisy) names found

in occurrences. Resulted in lots of duplicates • improved by stitching together checklist datasets

Cronquist classification Mimosaceae: 3,200 species Caesalpiniaceae: 2,000 species Fabaceae: 14,000 species

“Modern” classification Fabaceae: 19,200 species

Mimosoideae: 3,200 species Cæsalpinioideae: 2,000 species Faboideae: 14,000 species

Page 8: GBIF Checklist bank and the backbone

Current Backbone Issues• Far too many accepted species (acc/syn)

• Cactaceae: GBIF 12.062 (342 syn), TPL 2.233 (5.422 syn) + 5.500 unknown • Genus Weingartia: GBIF 129 (0 syn), TPL 8 (26 syn) + 68 unknown

• Many accepted names based on the same basionym • Sulcorebutia breviflora Backeb. • Weingartia breviflora (Backeb.) Hentzschel & K.Augustin

• No synonyms with different authors possible • Poa pubescens R.Br. synonym of Eragrostis pubescens (R.Br.) Steud. • Poa pubescens Lej. synonym of Poa pratensis L. • merged all names with exact same canonical name

• list of known homonym genera (IRMNG) used to disambiguate between larger groups

Page 9: GBIF Checklist bank and the backbone

Backbone Building

• Overlay ordered sources • Start with Catalog of Life • Primary source defines status • Create new name if kingdom, canonical name & authorship do not exist in

current nub

• Ignore source name if … • not a major Linnean rank (infraspecifc ranks are included) • higher ranks above family (configurable per source) • status conflicts with already existing status • hybrid formula, cultivar, candidatus or placeholder names !!!

Catalogue of Life

Fauna Europaea

GRIN

MammalSpeciesWorld

Observations

Specimens 8000 Species Lists

10s of taxonomic resources

5M+ namesin Primary Data Index

NUBMerged

Match

Page 10: GBIF Checklist bank and the backbone

Backbone AssemblingAnimalia Archaea Bacteria Chromista Fungi Plantae Protozoa Viruses incertae sedis

• Nub build starts with 8 kingdoms

Page 11: GBIF Checklist bank and the backbone

Backbone AssemblingPlantae

Magnoliophyta Magnoliopsida

Asterales Asteraceae

Helianthus L. Helianthus anuus L.

• Catalog of Life is added • Defines higher classification

Plantae Magnoliophyta

Magnoliopsida Asterales

Asteraceae Helianthus L.

Helianthus anuus L.

Page 12: GBIF Checklist bank and the backbone

Backbone AssemblingPlantae

Magnoliophyta Magnoliopsida

Asterales Asteraceae

Helianthus L. Helianthus anuus L.

Cichorium Cichorium intybus L.

• Missing genera are created • Tribe is ignored

Asteraceae Cichorieae Lam & DC. [tribe]

Cichorium intybus L.

Page 13: GBIF Checklist bank and the backbone

Backbone AssemblingPlantae

Magnoliophyta Magnoliopsida

Asterales Asteraceae

Helianthus L. Helianthus anuus L.

Cichorium Linneaus Cichorium intybus L.

= C. balearicum Porta = C. byzantinum Clementi

• Synonyms respect authors • Author match very loose • Existing genus author updated

Plantae Asteraceae

Cichorium Linneaus Cichorium intybus Linneaus

= Cichorium balearicum Porta = Cichorium byzantinum Clem. = Cichorium byzantinum Clementi

Page 14: GBIF Checklist bank and the backbone

Backbone AssemblingPlantae

Magnoliophyta Magnoliopsida

Asterales Asteraceae

Helianthus L. Helianthus anuus L.

Cichorium L. Cichorium intybus L.

= C. balearicum Porta = C. byzantinum Clem.

• Prefer authors from nomenclators

Asteraceae Cichorium L. Cichorium byzantinum Clem.

Page 15: GBIF Checklist bank and the backbone

Backbone AssemblingAsteraceae

Helianthus L. Helianthus anuus L.

Agoseris Agoseris apargioides (Less.) Greene

= A. maritima Eastw. A. a. var. eastwoodiae (Fedde) Munz A. a. var. maritima (E. Sheld.) Baird

Cichorium L. Cichorium intybus L.

= C. balearicum Porta = C. byzantinum Clem.

• Infraspecifics are included

Asteraceae Agoseris apargioides (Less.) Greene

= A. maritima Eastw. A. a. var. eastwoodiae (Fedde) Munz A. a. var. maritima (E. Sheld.) Baird

Page 16: GBIF Checklist bank and the backbone

Backbone AssemblingAsteraceae

Helianthus L. Helianthus anuus L.

Agoseris Agoseris apargioides (Less.) Greene

= A. maritima Eastw. A. a. var. eastwoodiae (Fedde) Munz A. a. var. maritima (E. Sheld.) Baird

Agoseris eastwoodiae Fedde Agoseris maritima E. Sheld.

Cichorium L. Cichorium intybus L.

= C. balearicum Porta = C. byzantinum Clem.

• Other source treats them as species

• Same canonical maritima allowed twice - author different

Asteraceae Agoseris eastwoodiae Fedde Agoseris maritima E. Sheld.

Page 17: GBIF Checklist bank and the backbone

Final Cleanup - BasionymsAsteraceae

Helianthus L. Helianthus anuus L.

Agoseris Agoseris apargioides (Less.) Greene

= A. maritima Eastw. A. a. var. eastwoodiae (Fedde) Munz

= Agoseris eastwoodiae Fedde A. a. var. maritima (E. Sheld.) Baird

= Agoseris maritima E. Sheld. Cichorium L.

Cichorium intybus L. = C. balearicum Porta = C. byzantinum Clem.

• Finally basionyms are detected • by terminal epithet & author

within a family • Only 1 accepted per group

• the most trusted first stays

Page 18: GBIF Checklist bank and the backbone

Final Cleanup - AutonymsAsteraceae

Helianthus L. Helianthus anuus L.

Agoseris Agoseris apargioides (Less.) Greene

= A. maritima Eastw. A. a. var. apargioides A. a. var. eastwoodiae (Fedde) Munz

= Agoseris eastwoodiae Fedde A. a. var. maritima (E. Sheld.) Baird

= Agoseris maritima E. Sheld. Cichorium L.

Cichorium intybus L. = C. balearicum Porta = C. byzantinum Clem.

• Create missing autonyms

Page 19: GBIF Checklist bank and the backbone

Backbone Building Rules• Create missing genus or species in classification

• only for accepted taxa

• Create missing autonyms for infraspecific

• Detect basionyms based on terminal epithet & authorship • Assumes epithet & authorship in family is unique • Converts all but one accepted to synonyms

• Flag taxa as doubtful • genus or higher taxon without any species (IRMNG) • species (or infrasp.) with a parent genus (or species) considered to be a synonym

• moved to newly accepted genus (or species) • the case for potential children of synonymised basionym combination

Page 20: GBIF Checklist bank and the backbone

Backbone Sources• GBIF Backbone Patch

• Catalogue of Life

• World Register of Marine Species

• Dyntaxa - Svensk taxonomisk databas

• GRIN Taxonomy

• Fauna Europaea

• Integrated Taxonomic Information System

• Euro+Med Plantbase

• Interim Register of Marine and Nonmarine Genera

• The Clements Checklist

• IOC World Bird Names

• Mammal Species of the World

• Paleobiology Database

• Nomenclators

• International Plant Names Index

• Index Fungorum

• ZooBank

• Prokaryotic Nomenclature Up-to-date

• ICTV Master Species List

• Organisations

• Species Files

• Biodiversity Data Journal (Pensoft)

• ZooKeys (Pensoft)

• PhytoKeys (Pensoft)

• Plazi ???

Page 21: GBIF Checklist bank and the backbone

Backbone Matching

• Occurrence • fuzzy name match • classification match • allow higher rank matches

• Checklist • match kingdom • require straight canonical match • incl authorship comparison • no webservice yet, only embedded

Page 22: GBIF Checklist bank and the backbone

NameUsageParsed Name

Backbone Match

Citation

Dataset Metrics

Verbatim Record

Metrics

Extensions

• Checklists & Nubsame structure

• Parent-child hierarchy • normalized classification

• flexible ranks

• synonyms accepted rel.

• Dataset metrics as timeseries

• Basionym relation

Schema

Page 23: GBIF Checklist bank and the backbone

CLB Supported Extensions• Description: human paragraphs about some topic • Distribution: area ranges with statuses • Identifier: additional identifier for the record • Multimedia: image, video, sound • Literature references: bibliography • Occurrence (indexed via occurrence workflows) • Species Profile: extinct, marine, freshwater, terrestrial flags • Types and specimens: (overlaps with Occurrence) • Vernacular names: name with language & region

http://rs.gbif.org/extension/gbif/1.0/

Page 24: GBIF Checklist bank and the backbone

Normalizing Classifications