next generation cancer data discovery, access, and integration using prizms and nanopublications
DESCRIPTION
To encourage data sharing in the life sciences, supporting tools need to minimize effort and maximize incentives. We have created infrastructure that makes it easy to create portals that supports dataset sharing and simplified publishing of the datasets as high quality linked data. We report here on our infrastructure and its use in the creation of a melanoma dataset portal. This portal is based on the Comprehensive Knowledge Archive Network (CKAN) and Prizms, an infrastructure to acquire, integrate, and publish data using Linked Data principles. In addition, we introduce an extension to CKAN that makes it easy for others to cite datasets from within both publications and subsequently-derived datasets using the emerging nanopublication and World Wide Web Consortium provenance standards.TRANSCRIPT
![Page 1: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/1.jpg)
Next Generation Cancer Data Discovery, Access, and
Integration Using Prizms and Nanopublications
Jim McCusker@jpmccu, Timothy Lebo@timrdf, Michael Krauthammer,
and Deborah McGuinness@dlmcguinness
![Page 2: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/2.jpg)
What we’re trying to fix From: Data Sharing and Management SNAFU in 3 Acts
![Page 3: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/3.jpg)
What we’re trying to fix
Ah yes, SAM1 is the level of CXCR4 expression.
What is the content of the field called
“SAM1”?
From: Data Sharing and Management SNAFU in 3 Acts
![Page 4: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/4.jpg)
What we’re trying to fix
That is logical if you think about it.
And what is the content of the field
called “SAM2”?
From: Data Sharing and Management SNAFU in 3 Acts
![Page 5: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/5.jpg)
What we’re trying to fix
… What is the content of the field called
“SAM2”?
I don’t remember.
From: Data Sharing and Management SNAFU in 3 Acts
![Page 6: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/6.jpg)
Life Science data seems to start its life very
scruffy.
![Page 7: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/7.jpg)
5 Levels of Data Sharing, from scruffy to neat
Level 1: Basic data sharing Who, what, when, where, why
Level 2: Automated Conversion Computable RDF representations
Level 3: Semantic enhancement Human-enhanced RDF representations
Level 4: Semantic eScience Use of vocabularies with formal semantics
Level 5: Community-Based Standards Consensus use of preferred ontologies
![Page 8: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/8.jpg)
The Prizms Architecture
![Page 9: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/9.jpg)
Prizms User Interactions
![Page 10: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/10.jpg)
Provenance of Prizms
Prizms
healthdata.tw.rpi.edu
lod.melagrid.org
More Prizms Nodes: https://github.com/timrdf/prizms/wiki/Prizms-Nodes
prov:wasDerivedFrom
prov:wasDerivedFrom
Linking Open Govt. Data prov:wasDerivedFrom
![Page 11: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/11.jpg)
5 Levels in Prizms
Level 1: Basic data sharing CKAN dataset metadata + datapubs
Level 2: Automated Conversion Prizms raw conversions
Level 3: Semantic Conversion Prizms enhanced conversions
Level 4: Semantic eScience Level 3 + NCBO ontology recommender + similar tools
Level 5: Community-Based Standards Level 4 + Vocabulary reuse analysis
![Page 12: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/12.jpg)
Level 1: Basic Data Sharing
CKAN1 and Datapubs
1Comprehensive Knowledge Archive Network
![Page 13: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/13.jpg)
What is CKAN?
• A data portal for all kinds of data
• Link or upload • Linked Data-
friendly • Link to:
o Files o APIs o SPARQL
endpoints o Metadata o Publications o Visualizations…
![Page 14: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/14.jpg)
• A data portal for all kinds of data • Link or upload
• Linked Data-friendly • Link to: o Files
o APIs o SPARQL endpoints
o Metadata o Publications
o Visualizations…
data.melagrid.org A portal for melanoma data
![Page 15: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/15.jpg)
What is a Datapub?
Viewing Relations, Attributes, and Entities in RDF (VRAER)dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma.ttl Redraw
hasAttribution
hasSupporting
hasAssertion
hasProvenance
exome-variants-in-melanomaa Nanopublication
provenancea Provenance
attributiona Attribution
supportinga Supporting
assertiona Assertion
Groth et al., 2010
![Page 16: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/16.jpg)
Anatomy of a Datapub: Assertion
Viewing Relations, Attributes, and Entities in RDF (VRAER)http://dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma-‐‑assertion.ttl Redraw
IMT
homepage
distribution
exome_aa_variants_final.xlsa DistributionaccessURL: exome_aa_variants_final.xls
xls
value: xls
Variant data from "Exome sequencing identifiesrecurrent somatic RAC1 mutations in melanoma"
a Datasetdescription: Variant data from M. Krauthammer, Y. Kong, B. Ha,
P. Evans, A. Bacchiocchi, J.P. McCusker, E.Cheng, M.J. Davis, G. Goh, M. Choi, S. Ariyan, D.Narayan, K. Dutton-Regester, A. Capatana, E.C.Holman, M. Bosenberg, M. Sznol, H.M. Kluger, D.E.Brash, D.F. Stern, M.A. Materin, R.S. Lo, S. Mane,S. Ma, K.K. Kidd, N.K. Hayward, R.P. Lifton, J.Schlessinger, T.J. Boggon, and R. Halaban, Exomesequencing identifies recurrent somatic RAC1mutations in melanoma. Nature Genetics, 2012. inpress. **Tab 1: Description** This worksheetcontains a description of the variant calling method.**Tab 2: SNVs** This worksheet containsautomatically called somatic non-silent SNVs inmatched melanoma samples. Annotations fromMU2A. **Tab 3: InDels** This worksheet containsautomatically called somatic InDels in matchedmelanoma samples. Annotations from VEP. **Tab 4:Splice Site Variants** This worksheet containsautomatically called somatic splice site variants inmatched melanoma samples. Annotations fromVEP. **Tab 5: Additional mutations** This worksheetcontains additional somatic mutations. Thesemutations are either inferred in unmatched samples(see Methods overview above), or have beenSanger-validated via PCR amplified products, aftermanual inspections of sequencing reads.Annotations from MU2A/VEP. Nomenclature --------**SNV:** Single Nucleotide Variant **DNV:**Dinucleotide Variant **DNV*: ** Two SNVs affectingthe same codon, at positions 1 and 3 of the codon**TNV:** Trinucleotide Variant **Parentheses ingenotype calls:** Nucleotides that appear inparentheses are true variant calls in tumor whichhave not been called somatic by the automaticpipeline. These variants are shown if anotherposition in the same codon has a somatic call. Thecorresponding SNP position, if known, is alsoshown. **InDel:** Insertions and Deletions**HGVS:** Human Genome Variation Societyvariant format **COSMIC:** Catalogue of SomaticMutations -http://www.sanger.ac.uk/perl/genetics/CGP/cosmic/**SNP:** This column provides SNP-IDs if availablefor any the mutated positions in tumors **PhyoP:**Computation of p-values for conservation oracceleration(http://compgen.bscb.cornell.edu/phast/faq.php).Data from UCSC genome browser. References ------ **MU2A:** Garla V, Kong Y, Szpakowski S,Krauthammer M. MU2A--reconciling the genomeand transcriptome to determine the effects of basesubstitutions. Bioinformatics. 2011 Feb 1;27(3):416-8. Epub 2010 Dec 12. PubMed PMID: 21149339;PubMed Central PMCID: PMC3031033. **VEP:**McLaren W, Pritchard B, Rios D, Chen Y, Flicek P,Cunningham F. Deriving the consequences ofgenomic variants with the Ensembl API and SNPEffect Predictor. Bioinformatics. 2010 Aug
15;26(16):2069-70. Epub 2010 Jun 18. PubMedPMID: 20562413; PubMed Central PMCID:PMC2916720.
keyword: exome-sequencing, homo-sapiensidentifier: exome-variants-in-melanoma
![Page 17: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/17.jpg)
Anatomy of a Datapub: Attribution, Evidence
Viewing Relations, Attributes, and Entities in RDF (VRAER)http://dl.dropboxusercontent.com/u/9752413/dils2013/exome-‐‑variants-‐‑in-‐‑melanoma-‐‑attribution.ttl Redraw
contributor
creatorexome-variants-in-melanoma
rights: cc-by
James McCusker
mbox: mailto:[email protected]
Michael Krauthammer
mbox: mailto:[email protected]
Attribution
Evidence
![Page 18: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/18.jpg)
Citing a Dataset using Datapubs
![Page 19: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/19.jpg)
Citing a Dataset using Datapubs
![Page 20: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/20.jpg)
Levels 2-3: Automated Conversion, Semantic
Conversion Prizms raw conversions, enhanced conversions
![Page 21: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/21.jpg)
Prizms RDF Converter
smart, naïve bootstrap
"Hawaii","Alii Garden Market Place", "75-6129 Alii Drive", "Kailua-Kona", "96740", "-155.9819183", "19.61436844"
ds4383:thing_1367 raw:column_1 "Hawaii"; raw:column_2 "Alii Garden Market Place"; raw:column_3 "75-6129 Alii Drive"; raw:column_4 "Kailua-Kona"; raw:column_5 "96740"; raw:column_6 "-155.9819183"; raw:column_7 "19.61436844" .
ds4383:thing_1367 con:preferredURI ds4383:farmersMarket_1367 .
ds4383:farmersMarket_1367 a ds4383_vocab:FarmersMarket; con:address :address_1367; dcterms:title "Alii Garden Market Place"; wgs:lat -155.9; wgs:long 19.6 .
:address_1367 a con:Address; con:stateOrProvince typed_state:Hawaii; con:street "75-6129 Alii Drive"; con:city "Kailua-Kona"; con:zip "96740" .
typed_state:Hawaii a ds4383_vocab:State; dcterms:identifier "Hawaii"; rdfs:label "Hawaii"; owl:sameAs <http://sws.geonames.org/5855797/>, govtrackusgov:HI, dbpedia:Hawaii .
enhancementTime Domain
ExpertiseSemWebExpertise
Time Domain Expertise
SemWebExpertise
Lebo et al., 2012
![Page 22: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/22.jpg)
Prizms Benefits
Prizms has worked with: • BFO/IAO/OBI • SIO • RDF Data Cube
Vocabulary • PROV • VOID • FOAF • etc.
For free, you get: • Provenance at
dataset and triple levels
• Automatic source/dataset/version URI generation
• Automated conversion as data changes
![Page 23: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/23.jpg)
Future Work: Supporting Levels 4-5
Level 1: Basic data sharing CKAN dataset metadata + datapubs
Level 2: Automated Conversion Prizms raw conversions
Level 3: Semantic Conversion Prizms enhanced conversions
Level 4: Semantic eScience Level 3 + NCBO ontology recommender + similar tools
Level 5: Community-Based Standards Level 4 + Vocabulary reuse analysis✔✔
✔
✔
✔
![Page 24: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/24.jpg)
Publishing Custom Linked Data Using LODSPeaKr
• Custom templates for RDF and HTML
• Templates driven by rdf:type
• Web-based template editor
• Embed easy-to-generate visualizations
![Page 25: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/25.jpg)
Conclusions
• Prizms is an infrastructure for sharing data on many levels of sophistication
• Good support for Level 1-3 Data Sharing • Initial support for Level 4-5 Data Sharing • Didn't just make life science data better, it made future
Linked Data better! • More to be done, but lots of progress
![Page 26: Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications](https://reader036.vdocument.in/reader036/viewer/2022062405/554e81ecb4c9054a698b5528/html5/thumbnails/26.jpg)
Thanks!
• Rensselaer Polytechnic (Tetherless World): o Alvaro Graves o John Erickson o The LOGD Team
• The Open Knowledge Foundation Network (OKFN)
• Yale University: o Ruth Halaban o Tobias Kuhn
• Grant support from: o Yale SPORE in Skin Cancer o Semantic Sea Ice Interoperability Initiative