a generic data import layer for the berlin taxonomic information model

Post on 28-Jan-2016

58 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A generic data import layer for the Berlin Taxonomic Information Model. Anton Güntsch, Andreas Müller & Walter G. Berendsohn Botanic Garden and Botanical Museum Berlin-Dahlem Dept. of Biodiversity Informatics and Laboratories. The Berlin Taxonomic Information Model. - PowerPoint PPT Presentation

TRANSCRIPT

A generic data import layer for the Berlin Taxonomic

Information Model

Anton Güntsch, Andreas Müller & Walter G. BerendsohnBotanic Garden and Botanical Museum Berlin-Dahlem

Dept. of Biodiversity Informatics and Laboratories

A. Güntsch: A generic data import layer for the Berlin Model

The Berlin Taxonomic Information Model

Name Concept Reference

„FactualData“

Relation

• Concepts as name-reference pairs

• Explicit representation of relations between concepts

• Mechanisms for calculating factual data

A. Güntsch: A generic data import layer for the Berlin Model

Berlin Model used by

Euro+MedMed-ChecklistIOPI Species Plantarum InitiativeAlgaterraDendroflora of El SalvadorGerman Standard List of Vascular Plants

and FernsReference List of the German MossesEDIT WP6

A. Güntsch: A generic data import layer for the Berlin Model

Data imports (1)

Heterogeneous sources (e.g. text files, printer-formatted data, spread sheets, DBs)

Complex target model

Imports consume a substantial fraction of project costs which are often substantially underestimated.

A. Güntsch: A generic data import layer for the Berlin Model

Data imports (2)

Analysesource

Identifysemantic

units

Transforminto

appropriateprocessable

format

Parse toformat close

to targetmodel

Duplicatedetection and

importTesting

A. Güntsch: A generic data import layer for the Berlin Model

Data imports (2)

Analysesource

Identifysemantic

units

Transforminto

appropriateprocessable

format

Parse toformat close

to targetmodel

Duplicatedetection and

importTesting

Needs a great deal of human input

Can be automated

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: preparation

TargetBerlinModel

Database

XMLSource

XMLSoftSchema

XMLStrictSchema

Phase I Phase IIIPhase II

largely notautomatable

largelyautomated

fullyautomated

feedback

• Identify patterns

• Communicate problems

• Export to simple XML

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: preparation<Aizoaceae xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<AcceptedTaxa><Taxon>

<ID>7814</ID><Genus>Acrodon</Genus><Epithet>bellidiflorus</Epithet><AllAuthorsString>N.E.Br.</AllAuthorsString><SubSpeciesEpi>v</SubSpeciesEpi><AllAuthorsStringSubSpecies/><SpeciesName>Acrodon bellidiflorus</SpeciesName>

</Taxon><Taxon>

<ID>8566</ID><Genus>Acrodon</Genus><Epithet>subulatus</Epithet><AllAuthorsString>(Miller) N.E.Br.</AllAuthorsString><AllAuthorsStringSubSpecies/><SpeciesName>Acrodon subulatus</SpeciesName>

</Taxon></AcceptedTaxa><SynonymTaxa> […] </SynonymTaxa>

</Aizoaceae>

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase I

TargetBerlinModel

Database

XMLSource

XMLSoftSchema

XMLStrictSchema

Phase I Phase IIIPhase II

largely notautomatable

largelyautomated

fullyautomated

feedback

• Transform into soft schema xml

• Re-arrange, lump and split elements

• Don‘t check „taxonomic integrity“

• Tools: XSLT, Taxonomic Transformation Library (TTL), and others

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase I<BMIDataSource xmlns="http://www.bgbm.org/schemas/BMI/s0.7" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.bgbm.org/schemas/BMI/s0.7P:\XMLSchema\ImportSchicht\BMISoft0.7.xsd">

<MetaData> […] </MetaData><ConceptReference>

<RefCategory>database</RefCategory><RefString>Aizoaceae</RefString>

</ConceptReference><PotentialTaxa>

<PTaxon><TaxonName>

<Rank>species</Rank><GenusEpi>Acrodon</GenusEpi><SpeciesEpi>bellidiflorus</SpeciesEpi><AllAuthors>N.E.Br.</AllAuthors>

</TaxonName><TaxonStatus>Accepted</TaxonStatus><IdInSource>7814</IdInSource><RelatedTaxon ref="21" relType="basionym"/>

</PTaxon>[…]

</PotentialTaxa></BMIDataSource>

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase II

TargetBerlinModel

Database

XMLSource

XMLSoftSchema

XMLStrictSchema

Phase I Phase IIIPhase II

largely notautomatable

largelyautomated

fullyautomated

feedback

• Transform into strict schema XML

• Check data integrity

• Report malformed data

• Tool: TTL

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase II

<BMIDataSource xmlns="http://www.bgbm.org/schemas/BMI/0.7" […]><MetaData> […] </MetaData><ConceptReference>

<RefCategoryAbbrev>BK</RefCategoryAbbrev><RefString>refString</RefString><DatabaseID>4</DatabaseID>

</ConceptReference><PotentialTaxa>

<PTaxon><TaxonName>

<SpeciesName><GenusEpi>Acrodon</GenusEpi><SpeciesEpi>bellidiflorus</SpeciesEpi><AuthorTeam><AuthorTeamCache>N.E.Br.</AuthorTeamCache></AuthorTeam>

</SpeciesName></TaxonName><TaxonStatusAbbrev>A</TaxonStatusAbbrev><IdInSource>7814</IdInSource><RelatedTaxa> […] </RelatedTaxa>

</PTaxon></PotentialTaxa>

</BMIDataSource>

A. Güntsch: A generic data import layer for the Berlin Model

Step-by-step transformation of taxonomic information: phase III

TargetBerlinModel

Database

XMLSource

XMLSoftSchema

XMLStrictSchema

Phase I Phase IIIPhase II

largely notautomatable

largelyautomated

fullyautomated

feedback

• Import into database

• Duplicate detection and resolution

• No User interaction required

• Tools: Berlin Model Object Layer (BMOL)

A. Güntsch: A generic data import layer for the Berlin Model

Berlin Model Object Layer (BMOL)

Hides the database key systemDuplicate detectionCore-Module provides objects

corresponding to database entitiesMapper-Module interfaces with databasePersistence-Module manages data flow

between core-module and mapper-module

A. Güntsch: A generic data import layer for the Berlin Model

Outlook

Method has been successfully tested for import of Med Checklist I, II & IV

Further imports planned for 2006Programming of additional mapper

modules desirable

A. Güntsch: A generic data import layer for the Berlin Model

www.bgbm.org/biodivinf/

top related