a generic data import layer for the berlin taxonomic information model
Post on 28-Jan-2016
58 Views
Preview:
DESCRIPTION
TRANSCRIPT
A generic data import layer for the Berlin Taxonomic
Information Model
Anton Güntsch, Andreas Müller & Walter G. BerendsohnBotanic Garden and Botanical Museum Berlin-Dahlem
Dept. of Biodiversity Informatics and Laboratories
A. Güntsch: A generic data import layer for the Berlin Model
The Berlin Taxonomic Information Model
Name Concept Reference
„FactualData“
Relation
• Concepts as name-reference pairs
• Explicit representation of relations between concepts
• Mechanisms for calculating factual data
A. Güntsch: A generic data import layer for the Berlin Model
Berlin Model used by
Euro+MedMed-ChecklistIOPI Species Plantarum InitiativeAlgaterraDendroflora of El SalvadorGerman Standard List of Vascular Plants
and FernsReference List of the German MossesEDIT WP6
A. Güntsch: A generic data import layer for the Berlin Model
Data imports (1)
Heterogeneous sources (e.g. text files, printer-formatted data, spread sheets, DBs)
Complex target model
Imports consume a substantial fraction of project costs which are often substantially underestimated.
A. Güntsch: A generic data import layer for the Berlin Model
Data imports (2)
Analysesource
Identifysemantic
units
Transforminto
appropriateprocessable
format
Parse toformat close
to targetmodel
Duplicatedetection and
importTesting
A. Güntsch: A generic data import layer for the Berlin Model
Data imports (2)
Analysesource
Identifysemantic
units
Transforminto
appropriateprocessable
format
Parse toformat close
to targetmodel
Duplicatedetection and
importTesting
Needs a great deal of human input
Can be automated
A. Güntsch: A generic data import layer for the Berlin Model
Step-by-step transformation of taxonomic information: preparation
TargetBerlinModel
Database
XMLSource
XMLSoftSchema
XMLStrictSchema
Phase I Phase IIIPhase II
largely notautomatable
largelyautomated
fullyautomated
feedback
• Identify patterns
• Communicate problems
• Export to simple XML
A. Güntsch: A generic data import layer for the Berlin Model
Step-by-step transformation of taxonomic information: preparation<Aizoaceae xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<AcceptedTaxa><Taxon>
<ID>7814</ID><Genus>Acrodon</Genus><Epithet>bellidiflorus</Epithet><AllAuthorsString>N.E.Br.</AllAuthorsString><SubSpeciesEpi>v</SubSpeciesEpi><AllAuthorsStringSubSpecies/><SpeciesName>Acrodon bellidiflorus</SpeciesName>
</Taxon><Taxon>
<ID>8566</ID><Genus>Acrodon</Genus><Epithet>subulatus</Epithet><AllAuthorsString>(Miller) N.E.Br.</AllAuthorsString><AllAuthorsStringSubSpecies/><SpeciesName>Acrodon subulatus</SpeciesName>
</Taxon></AcceptedTaxa><SynonymTaxa> […] </SynonymTaxa>
</Aizoaceae>
A. Güntsch: A generic data import layer for the Berlin Model
Step-by-step transformation of taxonomic information: phase I
TargetBerlinModel
Database
XMLSource
XMLSoftSchema
XMLStrictSchema
Phase I Phase IIIPhase II
largely notautomatable
largelyautomated
fullyautomated
feedback
• Transform into soft schema xml
• Re-arrange, lump and split elements
• Don‘t check „taxonomic integrity“
• Tools: XSLT, Taxonomic Transformation Library (TTL), and others
A. Güntsch: A generic data import layer for the Berlin Model
Step-by-step transformation of taxonomic information: phase I<BMIDataSource xmlns="http://www.bgbm.org/schemas/BMI/s0.7" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.bgbm.org/schemas/BMI/s0.7P:\XMLSchema\ImportSchicht\BMISoft0.7.xsd">
<MetaData> […] </MetaData><ConceptReference>
<RefCategory>database</RefCategory><RefString>Aizoaceae</RefString>
</ConceptReference><PotentialTaxa>
<PTaxon><TaxonName>
<Rank>species</Rank><GenusEpi>Acrodon</GenusEpi><SpeciesEpi>bellidiflorus</SpeciesEpi><AllAuthors>N.E.Br.</AllAuthors>
</TaxonName><TaxonStatus>Accepted</TaxonStatus><IdInSource>7814</IdInSource><RelatedTaxon ref="21" relType="basionym"/>
</PTaxon>[…]
</PotentialTaxa></BMIDataSource>
A. Güntsch: A generic data import layer for the Berlin Model
Step-by-step transformation of taxonomic information: phase II
TargetBerlinModel
Database
XMLSource
XMLSoftSchema
XMLStrictSchema
Phase I Phase IIIPhase II
largely notautomatable
largelyautomated
fullyautomated
feedback
• Transform into strict schema XML
• Check data integrity
• Report malformed data
• Tool: TTL
A. Güntsch: A generic data import layer for the Berlin Model
Step-by-step transformation of taxonomic information: phase II
<BMIDataSource xmlns="http://www.bgbm.org/schemas/BMI/0.7" […]><MetaData> […] </MetaData><ConceptReference>
<RefCategoryAbbrev>BK</RefCategoryAbbrev><RefString>refString</RefString><DatabaseID>4</DatabaseID>
</ConceptReference><PotentialTaxa>
<PTaxon><TaxonName>
<SpeciesName><GenusEpi>Acrodon</GenusEpi><SpeciesEpi>bellidiflorus</SpeciesEpi><AuthorTeam><AuthorTeamCache>N.E.Br.</AuthorTeamCache></AuthorTeam>
</SpeciesName></TaxonName><TaxonStatusAbbrev>A</TaxonStatusAbbrev><IdInSource>7814</IdInSource><RelatedTaxa> […] </RelatedTaxa>
</PTaxon></PotentialTaxa>
</BMIDataSource>
A. Güntsch: A generic data import layer for the Berlin Model
Step-by-step transformation of taxonomic information: phase III
TargetBerlinModel
Database
XMLSource
XMLSoftSchema
XMLStrictSchema
Phase I Phase IIIPhase II
largely notautomatable
largelyautomated
fullyautomated
feedback
• Import into database
• Duplicate detection and resolution
• No User interaction required
• Tools: Berlin Model Object Layer (BMOL)
A. Güntsch: A generic data import layer for the Berlin Model
Berlin Model Object Layer (BMOL)
Hides the database key systemDuplicate detectionCore-Module provides objects
corresponding to database entitiesMapper-Module interfaces with databasePersistence-Module manages data flow
between core-module and mapper-module
A. Güntsch: A generic data import layer for the Berlin Model
Outlook
Method has been successfully tested for import of Med Checklist I, II & IV
Further imports planned for 2006Programming of additional mapper
modules desirable
A. Güntsch: A generic data import layer for the Berlin Model
www.bgbm.org/biodivinf/
top related