ontologies in biomed - bmi.osu.edupayne/presentations/ontologies_in_biomed.pdfformal concept...
TRANSCRIPT
Ontologies & Terminologiesin Biomedicine
Methodological Approaches to the Population, Selection, andEvaluation of Conceptual Knowledge Collections
Philip R.O. Payne, Ph.D.The Ohio State University Medical Center
Department of Biomedical Informatics
Objectives• Define knowledge types• Knowledge engineering
– Acquisition– Representation
• Ontologies and terminologies– Differentiation– Uses– Challenges and opportunities
DNA Sequences
Gene Expression
Pathways
Literature
Protein StructureProteomics
Genomic Databases
Lead Compounds
Databases(Structured Data, Text, Images)
Biomedical Knowledge
Grand Challenge for Translational Research: Integration, Modeling and Analysis
Defining Knowledge• Procedural knowledge
– Rules, algorithms, and procedures
• Conceptual knowledge– Interconnection of concepts by a network
of relationships– Concepts are “the knowledge possessed
by an individual about a object or event”
• Strategic knowledge– Relates procedural and conceptual
knowledge
Importance of Conceptual Knowledge
• Conceptual knowledgestructures or collections– Allow for translation of domain
knowledge into computationalforms amenable togeneralization and inference
– Enable efficient and effectivedevelopment, maintenance,and dissemination ofknowledge-based systems
Knowledge Engineering
• “…integrating knowledge intocomputer systems in order to solvecomplex problems normallyrequiring a high level of humanexpertise…”
• Four major steps:– Knowledge Acquisition (KA)– Knowledge Representation (KR)– System Implementation and
Refinement– Verification and Validation
Essential Theories of Knowledge Engineering
• Catalogues libraries of PSM’s or explores a single PSM within such a library• Extensive use of ontologies• At runtime, a general inference engine may be employed
T6
• Hybrid of T3 and T5 approach in which PSM’s are used to structure the analysisdiscussions, but are converted to T5 during implementationT5
• Strong commitment to customizable inference proceduresT4
• Active focus on ontology creation• Ontologies not always used for execution, but rather for domain analysisT3
• No explicit representation of meta-knowledge• Focuses on axioms• Applies rigid controls as to how axioms may be asserted
T2
• Rejects declarative representations• Focuses on frame-based representationsT1
DescriptionT
Formal Concept Analysis (FCA)• Formal Objects + Formal Attributes = Formal Context• Closed Relation: can no longer enlarge the attribute or object set• Formal Concept: Pairing of a Formal Object and Attributes in a Closed Relation
X
X
X
X
Mammal
XXHarriet
XXGreyfriar’sBobby
XXSocks
XXSnoopy
XXGarfield
CatDogTortoiseRealCartoonAttributes /Objects
Concept Lattice• Used to visualize the connection between “formal concepts”• Allows for the application of graph theoretic evaluation metrics
to the resulting conceptual knowledge structure
FCA in Multidimensional Data Sets
Use “situations”– Decomposition of the n-dimensional matrix
into 2-dimensional matrices– Associate with set of remaining attributes
(e.g., axes) that comprise a “situation” forresulting “formal context”
Conceptual Knowledge Discoveryand Data Analysis (CKDD)
• Derivative of the field of conceptual knowledge processing• Based on the mathematical theory of FCA• Operationalization of FCA that elicits expert knowledge from
existent sources rather than from experts directly– In databases, this is known as Constructive Induction
• Often requires human intervention to arbitrate conflicts orambiguity in resulting conceptual knowledge structures.
• Software tools:– TOSCANA: comprehensive CKDD environment– CHIANTI: in-line CKDD algorithm that can be integrated with other
data mining applications
Knowledge Representation (KR)
“The process and the result of formalization of knowledge insuch a way, that it can be used automatically for problemsolving.”
• Five defining principles:– Medium for human expression– Set of ontological commitments– Surrogate– Fragmentary theory of intelligent reasoning– Medium for pragmatically efficient computation
KR in Biomedicine - Terminologies• Segregation of collections into terminology and assertions• Surface-form representation(s) versus conceptual knowledge
AcuteInflammation, NOSInAppendix, NOS
G-A231 01M-40000 01G-C006 01T-59200 01
Acute Inflammation, NOSInAppendix, NOS
M-41000 01G-C006 01T-59200 01
Appendicitis, NOSAcute
D5-46100 01G-A231 01
Acute Appendicitis, NOSD5-46210 01
Textual DescriptionSNOMED-CT Code
KR in Biomedicine• Critical representation issues:
– Anatomic location and temporal relationships– “Meaning”, which can take multiple forms
Conceptual KnowledgeRepresentation Structures
• Logical Models• Ontologies
– Mark-up languages– Sharing of knowledge
• Terminologies• Database Schemas
Logical Model for Clinical Data (1)
• Formal (Predicate) logic– Topic neutral– Allows for representation of formal relationships
between concepts that is computationally tractablefor inference.
∀ x PLEURAL-EFFUSION (x) ≡ EFFUSION (x) ∧ ∃ y PLEURAL-CAVITY(y) ∧ ∃ z DISEASE (z) ∧ LOCATED-IN (x,y) ∧ CAUSED-BY (x,z)
Logical Model for Clinical Data (2)• Conceptual graph notation
– Ability to represent complete first order, modal, or higher-order logicsin a human and computer readable format.
• [EFFUSION x] - (LOCATED-IN) → [PLEURAL-CAVITY] (CAUSED-BY) → [DISEASE]
Ontologies• Explicit specification of a conceptualization
– Abstract, simplified view of the world that we wish to representfor some purpose
– Basis of formally represented knowledge• Objects, concepts, and other entities that are assumed to exist in
some area of interest and the relationships that hold among them.
• Modern knowledge organization frameworks– RDF/S– OKBC– KIF– OWL– OML/CKML
Ontology Design Criteria• Clarity• Coherence• Extendibility• Minimal encoding bias• Ontological commitment
– Ensuring that observable actions of a knowledge-based system are consistent with the definition of theontology
Semantic Web Initiative• Goal: support “deep” reasoning about the contents
of distributed knowledge sources• Major components:
– URI: Uniform resource identifier– RDF/RDF Schema: XML-based description of knowledge sources
(data models comprised of objects)– OWL: Web Ontology Language (adds semantics/meta-data to RDF,
as well as ability to define logical assertions)– SPARQL: Semantic web query language– RIF: Rule interchange format– N-stores: Repositories for semantic web components (e.g.,
RDF/OWL)
Conceptual Knowledge Mark-upLanguage (CKML)
• OML: Representation language for ontologicaland schematic structure which allows for theassertion of constraints
• Simple OML:– Representation language for functions, reification,
cardinality constraints, inverse relations, andcollections
– Closely related to the resource descriptionframework with schemas (RDF/S) and XML-basedOntology Exchange Language (XOL)
Ontologies At Varying Levels of Resolution
Measure
Organism
organs
Tissue
CellsOrganelles
Virus
DNA
bases
Technology
meters
100
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
10-9
Biology
SNOMED(diseases)NCBI TaxonomySNOMED (organs)
Mammalian PhenotypeSNOMED (morphology)ATCC (cell lines)Cell Ontology (cell types)Gene Ontology (subcellular)
Gene NomenclatureQuaternary code
• Systematized Nomenclature of Medicine andVeterinary Medicine
• A Semantic Network between granular phenotypesand diseases
• > 500,000 clinical medicine concepts• Licenses
– Free for perpetual use in USA for any field of use– Included in free international UMLS license for research
SNOMED
SNOMED Information Model:Compositional, Multiaxial, Multiple Hierarchies
T M F C D P G L
H. Pylori associated heamorrhagic Gastric Ulcer =(4) D5-32220 Gastric (1) Ulcer (2) with haemorrhage (3)
G-C002 associated with (5) L-13551 H. pylori (6)
1
3
2 4
4
5
5
6
6
Axe
s
Terminologies• Represent elemental concepts and relationships among them.• Knowledge representation closely related to that of ontologies• Terminological and relational knowledge can be partitioned into a
terminology construct and a semantic network• Complexities in designing terminology representation models:
– Representation of surface-level and conceptual entities– Poly-hierarchies– Inheritance– Maintenance and growth
Desiderata (Cimino, 1998)• Content/domain
coverage• Concept-orientation• Concept permanence• Non-semantic concept
identifiers• Polyhierarchy• Formal definitions
• Reject “not elsewhereclassified”
• Multiple granularities• Multiple, consistent
views• Representation of
context• Graceful evolution• Recognize redundancy
Example Controlled Terminology: LOINC
• LOINC = Logical Observation Identifiers, Names andCodes
• 32,000 conceptual entities– 20,000 laboratory-associated
• LOINC laboratory codes are composed of sixattributes:1. Component or analyte2. Property3. Time4. System or specimen or sample5. Scale of precision6. Method
Example LOINC Laboratory Code
<NULL>Method
QuantitativeScale
Amniotic FluidSystem/Specimen
Point In TimeTime
Mass ConcentrationProperty
CreatinineComponent / Analyte
CREATININE:MCNC:PT:AMN:QN:LOINC Name
2159-2LOINC Code
Example Controlled Terminology: MED
• MED = Medical Entities Dictionary (CUMC)• Concept-oriented• Directed acyclic graph• Frame-based• Incorporates UMLS, ICD9-CM, and LOINC codes• Currently contains over 69,000 conceptual entities• Hierarchical and semantic relationships
– Hierarchical = “has ancestor/descendant”– Semantic = “part of”, “has part”, “measured by”…
Mapping Between Terminologies: UMLS
• UMLS = Unified Medical Language System (NLM)• Composed of:
– Metathesaurus– Semantic Network– Lexicon
• Contains approximately 5 million codes representing1 million concepts derived from 100 sourceterminologies
Evaluating Knowledge Collections
• Verification is the process of determining if theresulting knowledge system meets the specificrequirements of the end-users
• Validation is the process of determining if theresulting knowledge system meets the actualrequirements of the end-users
Key Metrics of Interest
• Multi-expert agreement• Degree of relationships between concepts
– Distance• Consistency of resulting knowledge structures
– Structural– Axiomatic
Mutli-expert Agreement• The measurement of multi-expert agreement
can be divided into four major sub-types:
ContrastExperts use differentterms and distinctions
ConflictExperts use the sameterminology fordifferent distinctions
DifferentDistinctions
CorrespondenceExperts use differentterminology but thesame distinctions
ConsensusExperts use the sameterms and distinctions
SameDistinctions
DifferentTerminology
SameTerminology
Graph-Theoretic Metrics• Upon translation of matrix data into a graph construct, a number
of metrics may be applied, including:– Distance– Adjacency– Transitive closure
Statistical Metrics• Comparison between expert generated
reference standard and system underconsideration
• One key issue in generating expert referencestanding is aggregating multiple expertresponses– Voting– Statistics (average, median, min/max)– Heuristics
Statistical Metrics• System performance can be compared to the
reference standard using some combination ofthe following measures:– Simple accuracy/agreement– Sensitivity and Specificity– Area under the ROC-curve (C-statistic)– Recall and Precision– Specific agreement– Chance corrected agreement
Challenges• Philosophical approach• Resources
– Curation (magnitude of domain concepts ≥ 10^6)• Scalability of KE methods• Content/domain coverage of available
knowledge sources– Available expertise
• Tools to apply knowledge collections topractical problems
Philosophical Approach• Realism: There exists a singular, factual truth
that can be used to characterize allphenomena, which can be ascertained givensufficient time and effort
• Instrumentalism: It is not necessarily possibleto ascertain or measure all the characteristics ofa phenomena such that a singular truth can bedefined, therefore, all knowledge is anapproximation (“snapshot”) of the truth
Induced Conceptual KnowledgeConstructs (CKCs)
[del(17p13)]-may_be_cytogenetic_abnormality_of_disease-[ChronicLymphocytic Leukemia with Unmutated Immunoglobulin Heavy ChainVariable-Region Gene]-disease_has_molecular_abnormality-[ClonalImmunoglobulin Heavy Chain Gene Rearrangement]-may_be_molecular_abnormality_of_disease-[Pyothorax-AssociatedLymphoma]-disease_may_have_finding-[Lactic acid dehydrogenase raised]
Chromosomal Abnormality →Clinical LaboratoryValue/Finding
[ZAP70 gene]-gene_plays_role_in_process-[Ligand Binding]-biological_process_involves_gene_product-[LTB4R protein, human]-gene_product_expressed_in_tissue-[Lymphoid Tissue]-is_normal_tissue_origin_of_disease-[Chronic lymphocytic leukaemiarefractory]
Gene → Diagnosis
Induced RelationshipRelationship Pattern