literature & interoperability: a working example using ants donat agosti, terry catapano, guido...
TRANSCRIPT
Literature & interoperability: a working example using ants
Donat Agosti, Terry Catapano, Guido Sautter, Christiana Klingenberg & Christie Stephenson
TDWG 2007, Bratislava
Participating organization
Main support by US-NSF, German DFG
Biodiversity monitoring, or what‘s out there?
Measuring and monitoring biodiversity means standard repetitive samples:Access to taxonomic data is the main impediment to run succesful surveys and to integrate survey into mainstream conservation, potentially one of the biggest user of taxonomic data
The question is: How can we provide the fastest way this content? What is doable, and what not?
Literature & interoperability
http://www.blsalptransit.ch/en/frameset_e.htm
A report from a break through in a long tunnel....
Literature & interoperabilityA report from a break through in a long tunnel....
For the first time, the entire production chain of ocr-ing, marking up, adding all the guids to produce a valid taxonx document is in place
We can provide a stable of encoded data/metadata which other applications can utilize (e.g. semant/iSpecies)
Literature & interoperability
Plazi.org• Sandbox and data provider• The principle: community involvement• Develop tools and solutions to access literature, both retrospective
and prospective literature• Make content available through exporting data into dedicated
databases• Provide an example of an input facility for Zoobank • Get around copyright by focusing on content by marking up
documents
• Explore digital taxonomic literature „Arxiv“• Drupal based with underlying DSpace repository and handle server
Literature & interoperability
Plazi workflow
Literature & interoperabilityPlazi products OCR-ed texts (dirty, clean)
ABBYY training files for fontsABBYY training files for journalsABBYY custom dictionary
Literature & interoperabilityGoldenGATEinteractions
- Get Guid from Hymenoptera Name Server for names-Add new names
Terminology follows ITIS; currently upload into Hymenoptera Name Server; query via html.
Literature & interoperabilityGoldenGATEinteractions
- Get Guid from Hymenoptera Name Server for names; ZooBank?-Add new names
- Get bibliographic Metadata from HNS (MODS)
- Get bibliographic Guids from bioguid
- Get geographic long/lat from geonames.org
Literature & interoperability• Products (1): documents
pdf, xslt-html, xml
Get one with pdf, xml
Pdf (original or scanned)
Html via XSLT
XML Taxonx
All documents with Guids: minimally Names, mods; max. bib.refs, specimen, localities
Literature & interoperability
Plazi workflow
Literature & interoperability• Products (2): Search and Retrieval Server
Literature & interoperabilitySearch and Retrieval Server: Output
Literature & interoperabilitySearch and Retrieval Server: Output
Literature & interoperabilitySearch and Retrieval Server: Output
Literature & interoperabilitySearch and Retrieval Server: Output
Search and Retrieval Server: Output
Products: What content do we have in store?• Goldstandard: 120+ taxonomic publications from
Madagascar, ranging from 1758-2007 (70% completed) (vertical)
• Recent publications continually added (horizontal standard)
• Series of publications describing elements of Taxonx, GoldenGATE, name finding algorithms (FindIT, FAT), compare approaches
• Increasing library of training files for ABBYY and analyzers for GoldenGATE
Literature & interoperability
Additionall products
• Training course for literature mark up to get the community involved
• Creating a neotropical catalogue of the ants using mark-up approach
• Development of metrics to measure mark-up production to optimize output for users (ecologists, taxonomists, etc.)
Literature & interoperability
Literature & interoperability
Ann. Soc. Entomol. Belg.
0
1
2
3
4
5
6
7
3961
3967
3956
3954
3855
3686
3920
3923
3712
3953
3786
3723
4001
4018
3715
3940
4022
4026
8070
HNS ID
min
Time per minute to produce clean OCR using ABBYY; publications in chronological order
Producing metrics to measure effort and compare various approaches and alogrithm
Literature & interoperability
1
10
100
1000
time/ character x 1000
characters/ 1000
markup time based on no. of characters / document
time/character x 1000 characters/1000
Time used to mark up documents in Taxonx in comparison to the number of pages per volume. Chronologica order
Producing metrics to measure effort and compare various approaches and alogrithm to mark up documents
Additionall products• Training course for literature mark up to
get the community involved• Creating a neotropical catalogue of the
ants using mark-up approach
• Development of metrics to measure mark-up production to optimize output for users (ecologists, taxonomists, etc.)
• Experience: mark up is expensive....
Literature & interoperability
pdfprint
Print + catalogueV
alue
for
sci
entis
t
image ocrclean
pdf/ocr struct.xml
semantxml
semantxml high
ocrdirty
s-xmllinked
data-base
cost
sLiterature & interoperability
?
How to best invest into the digitization of legacy publication?
NamesMarked-
up
treatmentsmarked-up
Finer grained mark up
ms submission(„Taxon-x-version“)
new ms alertPosting for review
Edited ms
Revised msPublication: pdf
Publication: hard copy
Publication database(„taxon-x-version“)
ontology
bibliography
analysis & ms preparation
ZooBank / NS
Character DB
Specimen DB
Description DB
Distribution DB
Char. Matrix DB
Phyl. Tree DB
Char-state Im.
Specimen Im.
Habitat Image
Leg. Publicat.
Tax
on D
B
New Data
feedback
Accepted ms
New taxon alert
….. to the Future of Publication: publication as a version control instrument