supporting dataset descriptions in the life sciences
Post on 11-Apr-2017
44 Views
Preview:
TRANSCRIPT
Supporting Dataset Descriptions in the Life Sciences
Alasdair J G GrayHeriot-Watt University www.macs.hw.ac.uk/~ajg33A.J.G.Gray@hw.ac.uk@gray_alasdair
FAIR Data Principles
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 2
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship Authors. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18
Degrees of FAIRness
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 3
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 4
Open PHACTS Explorer
5 April 2017
5
Data Cache (Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices
Identity Resolution
Service
IdentifierManagement
Service
“Adenosine receptor 2a”
EC2.43.4CS4532
P12374
Cor
e Pl
atfo
rm
ChEMBL-RDF
ChEMBLv13
Chem2Bio2RDF
SD
v13v12
v2 or v8
Which ChEMBL version?
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33
Historic Use Case~January 2012
Open PHACTS v2.1ChEMBL 20
http://tiny.cc/ops-datasets
6
Open PHACTS Provenance
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 7
Open PHACTS FAIR Data
5 April 2017
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 8
Data Reuse Challenges• Datasets available
– In many versions over time– In different formats– From many mirrors/registries
• Datasets build on each other• Files do not carry metadata• Registries
– Can be out-of-date– Can contain conflicting information
5 April 2017
Scientists require data provenance!
Goal: To be FAIR
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 9
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 10
Open PHACTS Dataset Description Guidelines
5 April 2017
Challenging for Publishers:• Datasets are complex• Evolve over time• Another publishing burden• Requires RDF knowledge• Descriptions are complex• Metadata precision
Tooling support required!
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 11
Open PHACTS Dataset Description Model
5 April 2017
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 12
Open PHACTS Dataset Description Guidelines
5 April 2017
Help me describe my data!
No! Use the Open PHACTS VoID Editor
Thanks for converting my data to RDF, can you help me make it findable by creating a VoID dataset description?
Dataset description Metadata Boring
Here are the guidelines, just write the terms in a text document.
Characters reproduced from Piled Higher and Deeper by Jorge Cham, http://phdcomics.com
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 14
Open PHACTS VoID Editor
5 April 2017
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 15
Open PHACTS VoID Editor
5 April 2017
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 16
Open PHACTS VoID Editor
5 April 2017
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 17
Open PHACTS Validator
5 April 2017
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 18
(Some) Life Sciences Metadata Specifications
5 April 2017
Depth
Reach
HCLS DataDesc
Bioschemas
Schema.org for biologyMinimum properties for • Finding data• Presenting search results
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 19
<div> <h1>Classic potato salad</h1> <div> Nutrition facts: <span>144 kcal</span>, </div>
Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . .
Structured data markup for web pages
Without markup
<div> <h1>Classic potato salad</h1> <div> Nutrition facts: <span>144 kcal</span>, </div>
Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . .
Structured data markup for web pages
Recipe
Nutrition
Calories
Ingridients
Title
Without markup
<div itemscope itemtype="http://schema.org/Recipe"> <h1 itemprop="name">Classic potato salad</h1> <div itemprop="nutrition” itemscope
itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div>
Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . .
Structured data markup for web pages
RDFaJSON-LD
Microdata With markup
Minimum informationControlled vocabularies
Cardinality
Data model
New properties24
The ELIXIR Implementation Study
1. Data Repositories
2. Datasets
3. Beacons4. Samples
5. P
lant
P
heno
type
s
6. Protein
Annotations
7. Bioschemas registry
8. Validation
Henning Hermjakob
Susanna A Sansone
Serena ScollenHelen Parkinson
Rafa Jimenez
???
Maria Martin
Audald Lloret
Alasdair Gray
Planning
Agreement
Adoption
Application
1
2
3
4
March-April 2017
May-June 2017
July-Oct 2017
Nov-Feb 2018
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 25
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 26
(Some) Life Sciences Metadata Specifications
5 April 2017
Depth
Reach
HCLS DataDesc
27
W3C HCLS GroupDumontier, M. et al. The health care and life sciences community profile for dataset descriptions. PeerJ 4, e2331 (2016). DOI:10.7717/peerj.2331
Use Case Requirements
Standard metadata requirements plus:
1. Resolvable identifiers for metadata
2. Descriptions of data identifiers
3. Data provenance
4. Data statistics
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 28
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 29
HCLS Dataset Descriptions
61 Metadata properties from 18 vocabularies5 Modules: Core, Identifiers, Provenance, Distributions, Stats
5 April 2017
Prescribed UsageElement Property Value Summary
LevelVersion Level
Distribution Level
Core MetadataType declaration rdf:type dctypes:Dataset MUST MUST SHOULD
Type declaration rdf:type void:Dataset or
dcat:DistributionMUST NOT
MUST NOT MUST
Title dct:title rdf:langString MUST MUST MUSTAlternative titles dct:alternative rdf:langString MAY MAY MAY
Description dct:description rdf:langString MUST MUST MUST
… … … … … …
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 30
ChEMBL: Summary Level
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 31
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 33
Implementations
RDF Platform
More coming…5 April 2017
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 34
(Some) Life Sciences Metadata Specifications
5 April 2017
Depth
Reach
HCLS DataDesc
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 35
Layered Descriptions
Minimal dataset description More detailed
description
Dataset
Sketch of content
5 April 2017
HCLS DataDesc
Configurable Tooling
Creation Validation
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 36
✗ ✓
Configurable Tooling
Creation Validation
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 37
✓ ✓
Constraint LanguagesShEx SHACL JSON Schema
Status W3C Draft CG Report
W3C Working Draft IETF Internet-Draft v5
Notation Concise notation Extended SPARQL JSONData model RDF RDF JSON (JSON-LD?)Open/closed Supported Supported ClosedResult format Defined DefinedConstraint types supported• Domain ✓ ✓ ✓• Values ✓ ✓ ✓• Cardinality ✓ ✓ ✓• Vocabulary ✓ ✓ ✗• Recursion ✓ ✗ ✗• Conformance
levels Extension Fixed ✗
Example Constraint
• Shape
• A Dataset– MUST be declared to be of type dctype:Dataset– MUST have a dcterms:title as a language typed string– MUST NOT have dcterms:created date
<Dataset> rdf:langString
.✗
Dates are associated with versions in HCLS
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 39
Example Validation
<Dataset> rdf:langString
.✗
• Shape
• Data
Valid
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 40
Example Validation
• Shape
• Data
<Dataset> rdf:langString
.✗
Not Valid
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 41
Example Validation
<Dataset> rdf:langString
.✗
• Shape
• Data
Valid
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 42
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
Shape
<Dataset> rdf:langString
.✗
Shape Expressions (ShEx)
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 43
ShEx: Validation<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
Validator can’t warn of missing property
Example data
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 44
<Dataset> { `MUST` rdf:type (dctypes:Dataset), `MUST` dct:title rdf:langString, `MAY` dct:alternative rdf:langString+, `MUST` !dct:created .}
Shape
<Dataset> rdf:langString
.✗
Requirement Levels
Validator can warn of missing property
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 45
Implementation
Validata• Web app front end• Javascript + HTML• Relies on ShEx-validator
– Validates documents– Returns report
https://github.com/HW-SWeL/Validata
ShEx-validator• Validation system• Validation API• Javascript
– nodejs engine• Reuses
– n3: RDF Library– ShExParser
https://github.com/HW-SWeL/ShEx-validator
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 46
http://hw-swel.github.io/Validata/ VALIDATA DEMO
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 47
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 48
(Some) Life Sciences Metadata Specifications
5 April 2017
Depth
Reach
HCLS DataDesc
Configurable Tooling
Creation Validation
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 49
✓ ✓
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 50
AcknowledgementsBioSchemas• Carole Gobel• Rafael JimenezFAIR Data• FAIRdom project• Jun ZhaoOpen PHACTS• Christian Brenninkmeijer• Lefteris Tatakis• Andra Waagmeester
Validata (MEng 2015)• Andrew Beveridge• Jacob Baungard Hansen• Johnny Val• Leif Gehrmann• Roisin Farmer• Sunil Khutan• Tomas Robertson
• Eric Prud’hommeaux
5 April 2017
@gray_alasdairwww.macs.hw.ac.uk/~ajg33 51
QuestionsValidata https://github.com/HW-SWeL/Validata• RDF constraint validation tool
– Configurable to any profile• Shape Expression (ShEx) constraints
Dumontier, M. et al. The health care and life sciences community profile for dataset descriptions. PeerJ 4, e2331 (2016). DOI:10.7717/peerj.2331
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18
www.macs.hw.ac.uk/~ajg33/A.J.G.Gray@hw.ac.uk@gray_alasdair
5 April 2017
top related