semantic (web) technologies for translational research in life sciences
TRANSCRIPT
Knowledge Enabled Information and Services Science
Semantic (Web) Technologies for Translational Research in Life Sciences
Ohio State University, June 16, 2011
Amit P. ShethOhio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)
Thanks to Kno.e.sis team (Satya, Priti, Rama, and Ajith);Collaborators at CTEGD UGA (Dr. Tarleton, Brent Weatherly),
NLM (Olivier Bodenreider), CCRC, UGA (Will York), NCBO/Stanford, CITAR/WSU
Knowledge Enabled Information and Services Science
Kno.e.sis: Ohio Center of Excellence in Knowledge-enabled Computing
Knowledge Enabled Information and Services Science
Evolution of Web & Semantic Computing
Web of pages - text, manually created links
- extensive navigation
2007
1997
Web of databases - dynamically generated pages
- web query interfaces
Web of resources - data, service, data, mashups
Web of people
- social networks, user-created casual content
Sem
antic
Tec
hnol
ogy
Use
d
Keywords
Patterns
Objects
Situations,Events
Tech assimilated in life
Web 1.0
Web 2.0
Web 3.0
Web of Sensors, Devices/IoT- 40 billion sensors, 5 billion mobile connections
Knowledge Enabled Information and Services Science
Outline
• Semantic Web – very brief intro
• Scenarios to demonstrate the applications and benefit of semantic web technologies– Health Care– Biomedical Research– Translational
Knowledge Enabled Information and Services Science
Biomedical Informatics...
Medical Informatics Bioinformatics
Etiology PathogenesisClinical findingsDiagnosisPrognosisTreatment
GenomeTranscriptome
ProteomeMetabolome
Physiome...ome
Genbank
Uniprot
Pubmed
Clinical Trials.gov
...needs a connection
Biomedical Informatics
Hypothesis ValidationExperiment design
PredictionsPersonalized medicine
Semantic Web research aims atproviding this connection!
More advanced capabilities for search, integration, analysis, linking to new insights and discoveries!
Knowledge Enabled Information and Services Science
Biochemistry
Molecular Biology
Cellular biology
Anatomy,
Physiology
Neuroscience
Cognitive Science,
Psychology
Health &
Performance
Data and Facts
Knowledge and Understanding
Decision Making, Insights, InnovationsHuman Performance
ACATATGGGTACTATTTACTATTCATGGGTACTATTTAT
GGCATATGGCGTACTATTCTAATCCTATATCCGTCTAATCTATTTACTATTATCTATTA
CTATACCTTTTGGGGAAAAAAATTCTATACCGTCTAAT
CCTATAAATCAAGCCG
Knowledge Enabled Information and Services Science
Semantic Web standards @ W3C
• Semantic Web is built in a layered manner• Not everybody needs all the layers
Encoding characters : Unicode
Encoding structure: XML
Uniform metamodel: RDF + URI
Simple data models & taxonomies: RDF Schema
Rich ontologies: OWL
Queries: SPARQL, Rules: RIF
…
Semantic Web
Knowledge Enabled Information and Services Science
Linked Data: Semantic Web “diluted”
• Achieve for data what Web did to documents• Relationship with the original Semantic Web vision: no AI,
no agents, no autonomy• Interoperability is still very important
– interoperability of formats– interoperability of semantics
• Enables interchange of large data sets– (thus very useful in, say, collaborative research)
• Semantic Web vision is largely predicated on the availability of data– Linked Data is a movement that gets us there
Thanks – Ora Lassila
Knowledge Enabled Information and Services Science
Opportunity: exploiting clinical and biomedical data
text
Health Information
Services
Elsevier iConsult
Scientific Literature
PubMed300 Documents
Published Online each day
User-contributed Content (Informal)
GeneRifsWikiGene
NCBI Public Datasets
Genome, Protein DBs
new sequencesdaily
Laboratory Data
Lab tests, RTPCR,
Mass spec
Clinical Data
Personal health history
Search, browsing, complex query, integration, workflow, analysis, hypothesis validation, decision support.
Knowledge Enabled Information and Services Science
Major Community Efforts
• W3C Semantic Web Health Care & Life Sciences Interest Group: http://www.w3.org/2001/sw/hcls/
• Clinical Observations Interoperability: EMR + Clinical Trials: http://esw.w3.org/HCLS/ClinicalObservationsInteroperability
• National Center for Biomedical Ontologies: http://bioportal.bioontology.org/
Knowledge Enabled Information and Services Science
Major SW Projects
• OpenPHACTS: A knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA). http://www.openphacts.org/
• LarKC: develop the Large Knowledge Collider, a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web. http://www.larkc.eu/
• NCBO: contribute to collaborative science and translational research. http://bioportal.bioontology.org/
Knowledge Enabled Information and Services Science
• Ontology: Agreement with Common Vocabulary & Domain Knowledge; Schema + Knowledge base
• Semantic Annotation (meatadata Extraction): Manual, Semi-automatic (automatic with human verification), Automatic
• Semantic Computation: semantics enabled search, integration, complex queries, analysis (paths, subgraph), pattern finding, mining, inferencing, reasoning, hypothesis validation, discovery, visualization
Semantic Web Enablers and Techniques
Knowledge Enabled Information and Services Science
Drug Ontology Hierarchy (showing is-a relationships)
owl:thing
prescription_drug
_ brand_na
me
brandname_unde
clared
brandname_comp
osite
prescription_drug
monograph_ix_cla
ss
cpnum_ group
prescription_drug
_ property
indication_
property
formulary_
property
non_drug_
reactant
interaction_proper
ty
property
formulary
brandname_indivi
dual
interaction_with_prescriptio
n_drug
interaction
indication
generic_ individua
l
prescription_drug_ generic
generic_ composit
e
interaction_ with_non_ drug_react
ant
interaction_with_monograph_ix_class
Knowledge Enabled Information and Services Science
N-Glycosylation metabolic pathway
GNT-Iattaches GlcNAc at position 2
UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=>
UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2
GNT-Vattaches GlcNAc at position 6
UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021
N-acetyl-glucosaminyl_transferase_VN-glycan_beta_GlcNAc_9N-glycan_alpha_man_4
Knowledge Enabled Information and Services Science
Maturing capabilites and ongoing research
• Ontology Creation• Semantic Annotation & Text mining: Entity recognition,
Relationship extraction• Semantic Integration & Provenance:
– Integrating all types of data used in biomedical research: text, experimetal data, curated/structured/public and multimedia
– Semantic search, browsing, analysis
• Clinical and Scientific Workflows with semantic web services
• Semantic Exploration of scientific literature, Undiscovered public knowledge
Knowledge Enabled Information and Services Science
Project 1: ASEMR
• Why: Improve Quality of Care and Decision Making without loss of Efficiency in active Cardiology practice.
• What: Use of semantic Web technologies for clinical decision support
• Where: Athens Heart Center & its partners and labs
• Status: In use continuously since 01/2006
Knowledge Enabled Information and Services Science
Operational since January 2006
Details: http://knoesis.org/library/resource.php?id=00004
Knowledge Enabled Information and Services Science
Annotate ICD9s
Annotate Doctors
Lexical Annotation
Level 3 Drug
Interaction
Insurance Formular
y
Drug Allergy
Active Semantic EMR
Demo at: http://knoesis.org/library/demos/
Knowledge Enabled Information and Services Science
Project 2: Glycomics
• Why: To help in the treatment of certain kinds of cancer and Parkinson's Disease.
• What: Semantic Annotation of Experiment Data
• Where: Complex Carbohydrate Research Center, UGA
• Status: Research prototype in use– Workflow with Semantic Annotation of
Experimental Data already in use
Knowledge Enabled Information and Services Science
N-Glycosylation Process (NGP)
Cell Culture
Glycoprotein Fraction
Glycopeptides Fraction
extract
Separation technique I
Glycopeptides Fraction
n*m
n
Signal integrationData correlation
Peptide Fraction
Peptide Fraction
ms data ms/ms data
ms peaklist ms/ms peaklist
Peptide listN-dimensional arrayGlycopeptide identificationand quantification
proteolysis
Separation technique II
PNGase
Mass spectrometry
Data reductionData reduction
Peptide identificationbinning
n
1
Knowledge Enabled Information and Services Science
Storage
Standard FormatData
Raw Data
Filtered Data
Search Results
Final Output
Agent Agent Agent Agent Biological Sample Analysis
by MS/MS
Raw Data to
Standard Format
DataPre-
process
DB Search
(Mascot/Sequest)
Results Post-
process
(ProValt)
O I O I O I O I O
Biological Information
SemanticAnnotationApplications
Scientific workflow for proteome analysis
Knowledge Enabled Information and Services Science
830.9570 194.9604 2
580.2985 0.3592
688.3214 0.2526
779.4759 38.4939
784.3607 21.7736
1543.7476 1.3822
1544.7595 2.9977
1562.8113 37.4790
1660.7776 476.5043
parent ion m/z
fragment ion m/z
ms/ms peaklist data
fragment ionabundance
parent ionabundance
parent ion charge
Semantic Annotation of Experimental Data
Mass Spectrometry (MS) Data
Knowledge Enabled Information and Services Science
<ms-ms_peak_list><parameter
instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode=“ms-ms”/>
<parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/><fragment_ion m-z=“580.2985” abundance=“0.3592”/><fragment_ion m-z=“688.3214” abundance=“0.2526”/><fragment_ion m-z=“779.4759” abundance=“38.4939”/><fragment_ion m-z=“784.3607” abundance=“21.7736”/><fragment_ion m-z=“1543.7476” abundance=“1.3822”/><fragment_ion m-z=“1544.7595” abundance=“2.9977”/><fragment_ion m-z=“1562.8113” abundance=“37.4790”/><fragment_ion m-z=“1660.7776” abundance=“476.5043”/>
</ms-ms_peak_list>
OntologicalConcepts
Semantic Annotation of Experimental Data
Semantically Annotated MS Data
Knowledge Enabled Information and Services Science
Project 3:
• Why: To associate genotype and phenotype information for knowledge discovery
• What: integrated data sources to run complex queries– Enriching data with ontologies for integration,
querying, and automation– Ontologies beyond vocabularies: the power of
relationships• Where: NCRR (NIH) • Status: Completed
Knowledge Enabled Information and Services Science
Use data to test hypothesis
gene
GO
PubMed
Gene name
OMIM
Sequence
Interactions
Glycosyltransferase
Congenital muscular dystrophy
Link between glycosyltransferase activity and congenital muscular dystrophy?
Adapted from: Olivier Bodenreider, presentation at HCLS Workshop, WWW07
Knowledge Enabled Information and Services Science
In a Web pages world…
Congenital muscular dystrophy,type 1D
(GeneID: 9215)
has_associated_disease
has_molecular_function
Acetylglucosaminyl-transferase activity
Adapted from: Olivier Bodenreider, presentation at HCLS Workshop, WWW07
Knowledge Enabled Information and Services Science
With the semantically enhanced data
MIM:608840Muscular dystrophy,
congenital, type 1D
GO:0008375
has_associated_phenotype
has_molecular_function
EG:9215LARGE
acetylglucosaminyl-transferase
GO:0016757glycosyltransferase
GO:0008194isa
GO:0008375acetylglucosaminyl-
transferase
GO:0016758
From medinfo paper.Adapted from: Olivier Bodenreider, presentation at HCLS Workshop, WWW07
SELECT DISTINCT ?t ?g ?d { ?t is_a GO:0016757 . ?g has molecular function ?t . ?g has_associated_phenotype ?b2 . ?b2 has_textual_description ?d .FILTER (?d, “muscular distrophy”, “i”) . FILTER (?d, “congenital”, “i”) }
Knowledge Enabled Information and Services Science
Project 4: Nicotine Dependence
• Why: For understanding the genetic basis of nicotine dependence.
• What: Integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base.
• How: Semantic Web technologies (especially RDF, OWL, and SPARQL) support information integration and make it easy to create semantic mashups (semantically integrated resources).
• Where: NLM (NIH) • Status: Completed research
Knowledge Enabled Information and Services Science
Motivation
• NIDA study on nicotine dependency• List of candidate genes in humans• Analysis objectives include:
o Find interactions between geneso Identification of active genes – maximum
number of pathwayso Identification of genes based on anatomical
locations• Requires integration of genome and biological
pathway information
Knowledge Enabled Information and Services Science
Entrez Gene
ReactomeKEGG
HumanCyc
GeneOntology HomoloGene
Genome and pathway information integration
• pathway
• protein
• pmid
• pathway
• protein
• pmid
• pathway
• protein
• pmid
• GO ID • HomoloGene
ID
Knowledge Enabled Information and Services Science
JBI
Knowledge Enabled Information and Services Science
BioPAXontology
EntrezKnowledge
Model(EKoM)
Knowledge Enabled Information and Services Science
Results: Gene Pathway network and Hub Genes involved with Nicotine Dependence
Knowledge Enabled Information and Services Science
Project 5: T. cruzi SPSE
• Why: For Integrative Parasite Research to help expedite knowledge discovery
• What: Semantics and Services Enabled Problem Solving Environment (PSE) for Trypanosoma cruzi
• Where: Center for Tropical and Emerging Global Diseases (CTEGD), UGA
• Who: Kno.e.sis, UGA, NCBO (Stanford)• Status: Research prototype – in regular lab
use
Knowledge Enabled Information and Services Science
Project Outline
• Data Sources Internal Lab Data
• Gene Knockout• Strain Creation• Microarray• Proteome
External Database• Ontological
Infrastructure Parasite Lifecycle Parasite Experiment
• Query processing Cuebee
• Results
Knowledge Enabled Information and Services Science
Provenance in Parasite Research
*T.cruzi Semantic Problem Solving Environment Project, Courtesy of D.B. Weatherly and Flora Logan, Tarleton Lab, University of Georgia
SequenceExtraction
PlasmidConstruction
Transfection
DrugSelection
CellCloning
Gene Name
3‘ & 5’Region
Knockout Construct Plasmid
Drug Resistant Plasmid
Transfected Sample
Selected Sample
ClonedSample
T.Cruzi sample
Cloned Sample
Gene Name
?
Gene Knockout and Strain Creation*
Related Queries from Biologists• List all groups in the lab that used a
Target Region Plasmid?• Which researcher created a new
strain of the parasite (with ID = 66)?• An experiment was not successful –
has this experiment been conducted earlier? What were the results?
Knowledge Enabled Information and Services Science
Research Accomplishments
• SPSE Integrated internal data with external databases, such as KEGG,
GO, and some datasets on TriTrypDB Developed semantic provenance framework and influence W3C
community SPSE supports complex biological queries that help find gene
knockout, drug and/or vaccination targets. For example:Show me proteins that are downregulated in the
epimastigote stage and exist in a single metabolic pathway.Give me the gene knockout summaries, both for plasmid
construction and strain creation, for all gene knockout targets that are 2-fold upregulated in amastigotes at the transcript level and that have orthologs in Leishmania but not in Trypanosoma brucei.
Knowledge Enabled Information and Services Science
Knowledge driven query formulation
Complex queries can also include:- on-the-fly Web services execution to retrieve additional data- inference rules to make implicit knowledge explicit
Knowledge Enabled Information and Services Science
Project 6: HPCO
• Why: collaborative knowledge exploration over scientific literature
• What: An up-to-date knowledge based literature search and exploration framework
• How: Using information extraction, conventional IR, and semantic web technologies for collaborative literature exploration
• Where: AFRL• Status: Completed research
Knowledge Enabled Information and Services Science
Focused KB Work Flow (Use case: HPCO)
Doozer: Base Hierarchy from
WikipediaFocused
Pattern based extraction
HPC keywords
SenseLab Neuroscience
Ontologies
Initial KB Creation
NLM: Rule based BKR Triples
Knoesis: Parsing based NLP Triples
Meta Knowledgebase
PubMed Abstracts
Enrich Knowledge Base
Final Knowledge Base
Knowledge Enabled Information and Services Science
Triple Extraction Approaches
• Open Extraction– No fixed number of predetermined entities and
predicates– At Knoesis – NLP (parsing and dependency
trees)• Supervised Extraction
– Predetermined set of entities and predicates– At Knoesis – Pattern based extraction to
connect entities in the base hierarchy using statistical techniques
– At NLM – NLP and rule based approaches
Knowledge Enabled Information and Services Science
Mapping Triples to Base Hierarchy
• Entities in both subject and object must contain at least one concept from the hierarchy to be mapped to the KB
• Preliminary synonyms based on anchor labels and page redirects in Wikipedia
– Prolactostatin redirects to Dopamine
• Predicates (verbs) and entities are subjected to stemming using Wordnet
Knowledge Enabled Information and Services Science
Scooner: Full Architecture
Knowledge Enabled Information and Services Science
Scooner Features
• Knowledge-based browsing: Relations window, inverse relations, creating trails
• Persistent projects: Work bench, browsing history, comments, filtering
• Collaboration: comments, dashboard, exporting (sub)projects, importing projects
Knowledge Enabled Information and Services Science
Scooner Screenshot
Knowledge Enabled Information and Services Science
New Knowledge/hypothesis Example
• VIP Peptide – increases – Catecholamine Biosynthesis• Catecholamines – induce – β-adrenergic receptor activity• β-adrenergic receptors – are involved – fear conditioning
Three triples from different abstracts
New implicit knowledgeVIP Peptide – affects – fear conditioning
Caveat: Each triple above was observed in a different organism (cows, mice, humans), but still interesting hypothesis. Scooner’s contextual browsing makes this clear to the user.
Knowledge Enabled Information and Services Science
Project 7: Drug Abuse
• Why: To study social trends in pharmaceutical opioid abuse
• What: – Describe drug user’s knowledge, attitudes, and
behaviors related to illicit use of OxyContin® – Describe temporal patterns of non-medical use of
OxyContin® tablets as discussed on Web-based forums • Where: CITAR (Center for Interventions, Treatment and
Addictions Research) at Wright State Univ.• Status: In-progress (Recently funded from NIDA)
Knowledge Enabled Information and Services Science
Knowledge Enabled Information and Services Science
Project 8: NMR
• Why: Streamline the NMR data processing tasks. Processing NMR experimental data is complex and time consuming.
• What: Providing biologists with tools to effectively process and manage Nuclear Magnetic Resonance (NMR) experimental data.
• How: Use Domain Specific Languages (DSL) to create scientist-friendly abstractions for complex statistical workflows. Use semantics based techniques to store and manage data.
• Where: Air Force Research Lab• Status: In progress
Knowledge Enabled Information and Services Science
Motivation
NMR spectroscopy data is complex and require significant statistical processing before interpreting- Writing these processes is hard
- They have to run on many different computational platforms
- The data collected has to be shared among multiple parties
A simple NMR spectrum, highlighting peaks that
correspond to the presence of specific chemical compounds
Knowledge Enabled Information and Services Science
A complex NMR spectrum, marked with chemical compound identifiers by human observers.
Knowledge Enabled Information and Services Science
Project Outline
Identify fundamental operators required for common NMR processing tasks
Use a DSL to provide abstractions for the operators (named SCALE)
Build compilers to generate multiple, cloud-enabled applications
Knowledge Enabled Information and Services Science
Real time Healthcare Information
• Matching medical requirements with availability of medical resources (Mumbai, India)– Project HERO Helpline for Emergency Response Operations – For patients seeking for immediate medical help
• Medical awareness in rural India– mMitra, info. service during pregnancy and childhood
emergency
Information bridge
Medical Emergency
MedicalResourc
es
Knowledge Enabled Information and Services Science
Future Interoperability Challenge:360 degree health
Clinical CareInsurance, Financial Aspects
Genetic Tests… Profiles
Follow up,Lifestyle
Clinical Trials Social Media
Knowledge Enabled Information and Services Science
• For each component in 360-degree health care, we have data, processes, knowledge and experience. Interoperability solutions need to encompass all these!– Possibly largest growth in data will be in
sensors (eg Body Area Networks, Biosensors) and social content. Extensive use of mobile phones.
Credit: ece.virginia.edu
Knowledge Enabled Information and Services Science
Summary
• Semantic Web is an “interoperability technology”• Semantic Web provides the needed
interoperability, and can accommodate all necessary “points of view”
• Linked Data as a way of sharing data is highly promising
• Many examples of viable usage of Semantic Web technologies
• Words of warning about deployment• Significant research challenges remain as Health
presents the most complex domain
Knowledge Enabled Information and Services Science
Representative References
1. A. Sheth, S. Agrawal, J. Lathem, N. Oldham, H. Wingate, P. Yadav, and K. Gallagher, Active Semantic Electronic Medical Record, Intl Semantic Web Conference, 2006.
2. Satya Sahoo, Olivier Bodenreider, Kelly Zeng, and Amit Sheth, An Experiment in Integrating Large Biomedical Knowledge Resources with RDF: Application to Associating Genotype and Phenotype Information WWW2007 HCLS Workshop, May 2007.
3. Satya S. Sahoo, Kelly Zeng, Olivier Bodenreider, and Amit Sheth, From "Glycosyltransferase to Congenital Muscular Dystrophy: Integrating Knowledge from NCBI Entrez Gene and the Gene Ontology, Amsterdam: IOS, August 2007, PMID: 17911917, pp. 1260-4
4. Satya S. Sahoo, Olivier Bodenreider, Joni L. Rutter, Karen J. Skinner , Amit P. Sheth, An ontology-driven semantic mash-up of gene and biological pathway information: Application to the domain of nicotine dependence, Journal of Biomedical Informatics, 2008.
5. Cartic Ramakrishnan, Krzysztof J. Kochut, and Amit Sheth, "A Framework for Schema-Driven Relationship Discovery from Unstructured Text", Intl Semantic Web Conference, 2006, pp. 583-596
6. Satya S. Sahoo, Christopher Thomas, Amit Sheth, William S. York, and Samir Tartir, "Knowledge Modeling and Its Application in Life Sciences: A Tale of Two Ontologies", 15th International World Wide Web Conference (WWW2006), Edinburgh, Scotland, May 23-26, 2006.
7. Satya S. Sahoo, Olivier Bodenreider, Pascal Hitzler, Amit Sheth and Krishnaprasad Thirunarayan, 'Provenance Context Entity (PaCE): Scalable provenance tracking for scientific RDF data.’ SSDBM, Heidelberg, Germany 2010.
• Papers: http://knoesis.org/library• Demos at: http://knoesis.wright.edu/library/demos/