ebankii workshop 1 making scientific data openly available simon coles school of chemistry,...
TRANSCRIPT
eBankII Workshop1
Making Scientific Data Openly Available
Simon Coles
School of Chemistry,
University of Southampton
eBankII Workshop2
Scientific Data Overload!
Cl
Cl
Cl
Cl
Cl
Cl
ClCl Cl
Cl
Cl
ClCl
O
O
O
O
N
N
N
N
N+
O
O
O
N+
O
O
O
eBankII Workshop3
CombeChem: eScience testbed
Properties
X-Raye-Lab
Analysis
Propertiese-Lab
SimulationVideo
Diff
ract
omet
er
Grid Middleware
StructuresDatabase
eBankII Workshop4
Chemistry Publications
Ideas and interpretations Hooks into the literature
Results & derived data
Raw data!
eBankII Workshop5
eBankII Workshop6
Establishing common ground
• Understand the data creation process • Terminology and definitions
– Data– Metadata– Datafile– Dataset– Data holding
• Different views– Digital library researchers, computer scientists, chemists– Generic vs specific– Modeller vs practitioner
• Aim for a common ontology• Modelling the domain• Creating a metadata schema
eBankII Workshop7
Crystallography workflow
RAW DATA DERIVED DATA RESULTS DATA
eBankII Workshop8
Crystallography datasets
• Initialisation: mount new sample on diffractometer & set up data collection
• Collection: collect data• Processing: process and correct images• Solution: solve structure• Refinement: refine structure• CIF: produce CIF (Crystallographic Information
File format)• Report: generate Crystal Structure Report• Validation: generate report from structure checks
Within a dataholding are the following datasets:
eBankII Workshop9
Publishing, Informatics & Schemas
• Current schema is for publishing / advertising only• eCrystals publishing requires lightweight schema
only• eBank harvesting requires lightweight schema only• Aggregation and Linking requires a comprehensive
schema • Data management, Information delivery and
Searching services require a very rich schema
eBankII Workshop10
Deposition into the archive
eBankII Workshop11
An Archive entry
ecrystals.chem.soton.ac.uk
eBankII Workshop12
Access to the underlying data
eBankII Workshop13
Some metadata issues
• Using simple and qualified Dublin Core • Additional chemical information in schema for
harvesting e.g. empirical formula• Schema contains International Chemical Identifier
(InChI)• Specifies which ‘datasets’ are present in an entry• Links to ePrints (and other published literature)
derived from the data• Using vocabularies specific to crystallography
eBankII Workshop14
Harvesting: OAIster
eBankII Workshop15
Linking and aggregating
eBankII Workshop16
Embedded in a science portal
eBankII Workshop17
Current situation
• Version 2.0 eBank metadata schema• Pilot institutional e-data repository for harvesting (raw,
derived, results data) using EPrints software• Exports records as ebank_dc and oai_dc• Validation of schema & discussion with International
Union of Crystallography for developments (and wider deployment)
• Pilot eBank UK aggregator service• Developing search interface Version 1.0 • Testing with PSIgate physical sciences portal –
embedding eBank UK
eBankII Workshop18
What’s next?
• Generic metadata schema vs Subject specific schema • Validation against other schema (CCLRC Model)• (Eprints.org software: allow for more generic scientific data
and schemas?) • Metadata enhancement: keywords based on knowledge of
keywords in related publications?• Investigate identifiers: International Chemical Identifier • Explore context sensitive linking• Embedding into chemical and crystallographic research and
publishing• e-Learning embedding and pedagogic evaluation• Feasibility study in related domains
eBankII Workshop19
Crystallography Schema Breakout
• Describing non dc: terms– METS– SET container
• Rights– IPR– Copyright– Publisher– Funder
• Linking – DOI– Keyword ontology– Identifiers
• Data validation- Add validation dataset
- Other forms of validation: Mogul
• Chemical representation
- Naming conventions
- Empirical formula representation
• Relationship between repositories and harvesters
- Registration / subscription
• Syndication
- FRIENDS container
- RSS feeds