ontologies in biomed - bmi.osu.edupayne/presentations/ontologies_in_biomed.pdfformal concept...

60
Ontologies & Terminologies in Biomedicine Methodological Approaches to the Population, Selection, and Evaluation of Conceptual Knowledge Collections Philip R.O. Payne, Ph.D. The Ohio State University Medical Center Department of Biomedical Informatics

Upload: phamtram

Post on 01-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Ontologies & Terminologiesin Biomedicine

Methodological Approaches to the Population, Selection, andEvaluation of Conceptual Knowledge Collections

Philip R.O. Payne, Ph.D.The Ohio State University Medical Center

Department of Biomedical Informatics

Objectives• Define knowledge types• Knowledge engineering

– Acquisition– Representation

• Ontologies and terminologies– Differentiation– Uses– Challenges and opportunities

DNA Sequences

Gene Expression

Pathways

Literature

Protein StructureProteomics

Genomic Databases

Lead Compounds

Databases(Structured Data, Text, Images)

Biomedical Knowledge

Grand Challenge for Translational Research: Integration, Modeling and Analysis

Defining Knowledge• Procedural knowledge

– Rules, algorithms, and procedures

• Conceptual knowledge– Interconnection of concepts by a network

of relationships– Concepts are “the knowledge possessed

by an individual about a object or event”

• Strategic knowledge– Relates procedural and conceptual

knowledge

Importance of Conceptual Knowledge

• Conceptual knowledgestructures or collections– Allow for translation of domain

knowledge into computationalforms amenable togeneralization and inference

– Enable efficient and effectivedevelopment, maintenance,and dissemination ofknowledge-based systems

Knowledge Engineering

• “…integrating knowledge intocomputer systems in order to solvecomplex problems normallyrequiring a high level of humanexpertise…”

• Four major steps:– Knowledge Acquisition (KA)– Knowledge Representation (KR)– System Implementation and

Refinement– Verification and Validation

Essential Theories of Knowledge Engineering

• Catalogues libraries of PSM’s or explores a single PSM within such a library• Extensive use of ontologies• At runtime, a general inference engine may be employed

T6

• Hybrid of T3 and T5 approach in which PSM’s are used to structure the analysisdiscussions, but are converted to T5 during implementationT5

• Strong commitment to customizable inference proceduresT4

• Active focus on ontology creation• Ontologies not always used for execution, but rather for domain analysisT3

• No explicit representation of meta-knowledge• Focuses on axioms• Applies rigid controls as to how axioms may be asserted

T2

• Rejects declarative representations• Focuses on frame-based representationsT1

DescriptionT

Knowledge Engineering & Expertise Transfer

Conceptual Knowledge Acquisition Techniques

Formal Concept Analysis (FCA)• Formal Objects + Formal Attributes = Formal Context• Closed Relation: can no longer enlarge the attribute or object set• Formal Concept: Pairing of a Formal Object and Attributes in a Closed Relation

X

X

X

X

Mammal

XXHarriet

XXGreyfriar’sBobby

XXSocks

XXSnoopy

XXGarfield

CatDogTortoiseRealCartoonAttributes /Objects

Concept Lattice• Used to visualize the connection between “formal concepts”• Allows for the application of graph theoretic evaluation metrics

to the resulting conceptual knowledge structure

FCA in Multidimensional Data Sets

Use “situations”– Decomposition of the n-dimensional matrix

into 2-dimensional matrices– Associate with set of remaining attributes

(e.g., axes) that comprise a “situation” forresulting “formal context”

Conceptual Knowledge Discoveryand Data Analysis (CKDD)

• Derivative of the field of conceptual knowledge processing• Based on the mathematical theory of FCA• Operationalization of FCA that elicits expert knowledge from

existent sources rather than from experts directly– In databases, this is known as Constructive Induction

• Often requires human intervention to arbitrate conflicts orambiguity in resulting conceptual knowledge structures.

• Software tools:– TOSCANA: comprehensive CKDD environment– CHIANTI: in-line CKDD algorithm that can be integrated with other

data mining applications

Constructive InductionOntology-anchored Knowledge Discovery in

Databases

Knowledge Representation (KR)

“The process and the result of formalization of knowledge insuch a way, that it can be used automatically for problemsolving.”

• Five defining principles:– Medium for human expression– Set of ontological commitments– Surrogate– Fragmentary theory of intelligent reasoning– Medium for pragmatically efficient computation

KR in Biomedicine - Terminologies• Segregation of collections into terminology and assertions• Surface-form representation(s) versus conceptual knowledge

AcuteInflammation, NOSInAppendix, NOS

G-A231 01M-40000 01G-C006 01T-59200 01

Acute Inflammation, NOSInAppendix, NOS

M-41000 01G-C006 01T-59200 01

Appendicitis, NOSAcute

D5-46100 01G-A231 01

Acute Appendicitis, NOSD5-46210 01

Textual DescriptionSNOMED-CT Code

KR in Biomedicine• Critical representation issues:

– Anatomic location and temporal relationships– “Meaning”, which can take multiple forms

Conceptual KnowledgeRepresentation Structures

• Logical Models• Ontologies

– Mark-up languages– Sharing of knowledge

• Terminologies• Database Schemas

Logical Model for Clinical Data (1)

• Formal (Predicate) logic– Topic neutral– Allows for representation of formal relationships

between concepts that is computationally tractablefor inference.

∀ x PLEURAL-EFFUSION (x) ≡ EFFUSION (x) ∧ ∃ y PLEURAL-CAVITY(y) ∧ ∃ z DISEASE (z) ∧ LOCATED-IN (x,y) ∧ CAUSED-BY (x,z)

Logical Model for Clinical Data (2)• Conceptual graph notation

– Ability to represent complete first order, modal, or higher-order logicsin a human and computer readable format.

• [EFFUSION x] - (LOCATED-IN) → [PLEURAL-CAVITY] (CAUSED-BY) → [DISEASE]

Ontologies• Explicit specification of a conceptualization

– Abstract, simplified view of the world that we wish to representfor some purpose

– Basis of formally represented knowledge• Objects, concepts, and other entities that are assumed to exist in

some area of interest and the relationships that hold among them.

• Modern knowledge organization frameworks– RDF/S– OKBC– KIF– OWL– OML/CKML

Ontology Design Criteria• Clarity• Coherence• Extendibility• Minimal encoding bias• Ontological commitment

– Ensuring that observable actions of a knowledge-based system are consistent with the definition of theontology

Semantic Web Initiative• Goal: support “deep” reasoning about the contents

of distributed knowledge sources• Major components:

– URI: Uniform resource identifier– RDF/RDF Schema: XML-based description of knowledge sources

(data models comprised of objects)– OWL: Web Ontology Language (adds semantics/meta-data to RDF,

as well as ability to define logical assertions)– SPARQL: Semantic web query language– RIF: Rule interchange format– N-stores: Repositories for semantic web components (e.g.,

RDF/OWL)

Semantic Web Component “Stack”

Conceptual Knowledge Mark-upLanguage (CKML)

• OML: Representation language for ontologicaland schematic structure which allows for theassertion of constraints

• Simple OML:– Representation language for functions, reification,

cardinality constraints, inverse relations, andcollections

– Closely related to the resource descriptionframework with schemas (RDF/S) and XML-basedOntology Exchange Language (XOL)

Ontologies & Knowledge Sharing

Ontologies At Varying Levels of Resolution

Measure

Organism

organs

Tissue

CellsOrganelles

Virus

DNA

bases

Technology

meters

100

10-1

10-2

10-3

10-4

10-5

10-6

10-7

10-8

10-9

Biology

SNOMED(diseases)NCBI TaxonomySNOMED (organs)

Mammalian PhenotypeSNOMED (morphology)ATCC (cell lines)Cell Ontology (cell types)Gene Ontology (subcellular)

Gene NomenclatureQuaternary code

• Systematized Nomenclature of Medicine andVeterinary Medicine

• A Semantic Network between granular phenotypesand diseases

• > 500,000 clinical medicine concepts• Licenses

– Free for perpetual use in USA for any field of use– Included in free international UMLS license for research

SNOMED

SNOMED Information Model:Compositional, Multiaxial, Multiple Hierarchies

T M F C D P G L

H. Pylori associated heamorrhagic Gastric Ulcer =(4) D5-32220 Gastric (1) Ulcer (2) with haemorrhage (3)

G-C002 associated with (5) L-13551 H. pylori (6)

1

3

2 4

4

5

5

6

6

Axe

s

Terminologies• Represent elemental concepts and relationships among them.• Knowledge representation closely related to that of ontologies• Terminological and relational knowledge can be partitioned into a

terminology construct and a semantic network• Complexities in designing terminology representation models:

– Representation of surface-level and conceptual entities– Poly-hierarchies– Inheritance– Maintenance and growth

Desiderata (Cimino, 1998)• Content/domain

coverage• Concept-orientation• Concept permanence• Non-semantic concept

identifiers• Polyhierarchy• Formal definitions

• Reject “not elsewhereclassified”

• Multiple granularities• Multiple, consistent

views• Representation of

context• Graceful evolution• Recognize redundancy

Example Controlled Terminology: LOINC

• LOINC = Logical Observation Identifiers, Names andCodes

• 32,000 conceptual entities– 20,000 laboratory-associated

• LOINC laboratory codes are composed of sixattributes:1. Component or analyte2. Property3. Time4. System or specimen or sample5. Scale of precision6. Method

Example LOINC Laboratory Code

<NULL>Method

QuantitativeScale

Amniotic FluidSystem/Specimen

Point In TimeTime

Mass ConcentrationProperty

CreatinineComponent / Analyte

CREATININE:MCNC:PT:AMN:QN:LOINC Name

2159-2LOINC Code

Example Controlled Terminology: MED

• MED = Medical Entities Dictionary (CUMC)• Concept-oriented• Directed acyclic graph• Frame-based• Incorporates UMLS, ICD9-CM, and LOINC codes• Currently contains over 69,000 conceptual entities• Hierarchical and semantic relationships

– Hierarchical = “has ancestor/descendant”– Semantic = “part of”, “has part”, “measured by”…

Example Controlled Terminology: MED

Mapping Between Terminologies: UMLS

• UMLS = Unified Medical Language System (NLM)• Composed of:

– Metathesaurus– Semantic Network– Lexicon

• Contains approximately 5 million codes representing1 million concepts derived from 100 sourceterminologies

UMLS Semantic Network

UMLS Metathesaurus

UMLS Metathesaurus

Evaluating Knowledge Collections

• Verification is the process of determining if theresulting knowledge system meets the specificrequirements of the end-users

• Validation is the process of determining if theresulting knowledge system meets the actualrequirements of the end-users

Key Metrics of Interest

• Multi-expert agreement• Degree of relationships between concepts

– Distance• Consistency of resulting knowledge structures

– Structural– Axiomatic

Mutli-expert Agreement• The measurement of multi-expert agreement

can be divided into four major sub-types:

ContrastExperts use differentterms and distinctions

ConflictExperts use the sameterminology fordifferent distinctions

DifferentDistinctions

CorrespondenceExperts use differentterminology but thesame distinctions

ConsensusExperts use the sameterms and distinctions

SameDistinctions

DifferentTerminology

SameTerminology

Verification & Validation of Knowledge Collections

Information Theoretic Metrics

Graph-Theoretic Metrics• Upon translation of matrix data into a graph construct, a number

of metrics may be applied, including:– Distance– Adjacency– Transitive closure

Logic-based MetricsEvaluating the Gene Ontology (GO)

Statistical Metrics• Comparison between expert generated

reference standard and system underconsideration

• One key issue in generating expert referencestanding is aggregating multiple expertresponses– Voting– Statistics (average, median, min/max)– Heuristics

Statistics MetricsUsing Expert/Reference Standards

Statistical Metrics• System performance can be compared to the

reference standard using some combination ofthe following measures:– Simple accuracy/agreement– Sensitivity and Specificity– Area under the ROC-curve (C-statistic)– Recall and Precision– Specific agreement– Chance corrected agreement

Challenges• Philosophical approach• Resources

– Curation (magnitude of domain concepts ≥ 10^6)• Scalability of KE methods• Content/domain coverage of available

knowledge sources– Available expertise

• Tools to apply knowledge collections topractical problems

Philosophical Approach• Realism: There exists a singular, factual truth

that can be used to characterize allphenomena, which can be ascertained givensufficient time and effort

• Instrumentalism: It is not necessarily possibleto ascertain or measure all the characteristics ofa phenomena such that a singular truth can bedefined, therefore, all knowledge is anapproximation (“snapshot”) of the truth

Solving Practical Problems:The Translational Ontology-anchored

Knowledge-discovery Engine (TOKEn)

Induced Conceptual KnowledgeConstructs (CKCs)

[del(17p13)]-may_be_cytogenetic_abnormality_of_disease-[ChronicLymphocytic Leukemia with Unmutated Immunoglobulin Heavy ChainVariable-Region Gene]-disease_has_molecular_abnormality-[ClonalImmunoglobulin Heavy Chain Gene Rearrangement]-may_be_molecular_abnormality_of_disease-[Pyothorax-AssociatedLymphoma]-disease_may_have_finding-[Lactic acid dehydrogenase raised]

Chromosomal Abnormality →Clinical LaboratoryValue/Finding

[ZAP70 gene]-gene_plays_role_in_process-[Ligand Binding]-biological_process_involves_gene_product-[LTB4R protein, human]-gene_product_expressed_in_tissue-[Lymphoid Tissue]-is_normal_tissue_origin_of_disease-[Chronic lymphocytic leukaemiarefractory]

Gene → Diagnosis

Induced RelationshipRelationship Pattern

Conclusion

• Questions/Comments?

[email protected]• http://bmi.osu.edu/~payne