informaon)ar(factontology:))...
TRANSCRIPT
Informa(on Ar(fact Ontology: General Background
Barry Smith
1
Slides
h=p://ncorwiki.buffalo.edu/index.php/STIDS_2013
2
Barry Smith – who am I? Director: Na(onal Center for Ontological Research (Buffalo) Founder: Ontology for the Intelligence Community (OIC, now STIDS) conference series
Ontology work for
NextGen (Next Genera(on) Air Transporta(on System Na(onal Nuclear Security Administra(on, DoE Joint-‐Forces Command Joint Warfigh(ng Center Army Net-‐Centric Data Strategy Center of Excellence Army Intelligence and Informa(on Warfare Directorate (I2WD)
and for many na(onal and interna(onal biomedical research and healthcare agencies
3
I2WD Ontology Team
Ron Rudnucki CUBRC, University at Buffalo
Dr. Ta(ana Malyuta NY City College of Technology of CUNY,
Data Tac<cs Corp.
David Salmen Data Tac<cs Corp.
LCOL Dr. William Mandrick Data Tac<cs Corp.
4
In the olden days
people measured lengths using inches, ulnas, perches, king’s feet, Swiss feet, leagues of Paris, etc., etc.
5
On June 22, 1799, in Paris, everything changed
6
Interna(onal System of Units (SI)
7
Making data (re-‐)usable through standard terminologies
• Standards provide – common structure and terminology
– single data source for review (less redundant data)
• Standards allow – use of common tools and techniques – common training – single valida(on of data
8
One successful part of the solu(on to this problem = Ontologies
controlled vocabularies (nomenclatures) plus defini(ons of terms in a logical language
Standardized (logically defined) terms in an ontology are the equivalent of standardized
units in the SI
9
Ontologies
• are computer-‐tractable representa(ons of types in specific areas of reality
• are more and less general (upper and lower ontologies) – upper = organizing ontologies – lower = domain ontology modules
10
Linked Open Data are not enough
11
Links are inconsistently defined; ontologies are full of redundancies
12
Towards coordina(on of modular non-‐redundant ontologies
13
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN AND ORGANISM
Organism (NCBI
Taxonomy)
Anatomical Entity (FMA, CARO)
Organ Function
(FMP, CPRO) Phenotypic Quality (PaTO)
Biological Process (GO)
CELL AND CELLULAR
COMPONENT
Cell (CL)
Cellular Component (FMA, GO)
Cellular Function
(GO)
MOLECULE Molecule
(ChEBI, SO, RnaO, PrO)
Molecular Function (GO)
Molecular Process (GO)
The Open Biomedical Ontologies (OBO) Foundry 14
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
COMPLEX OF ORGANISMS
Family, Community, Deme, Population
(PCO) Organ
Function (FMP, CPRO)
Population Phenotype
Population Process
ORGAN AND ORGANISM
Organism (NCBI
Taxonomy)
Anatomical Entity (FMA, CARO) Phenotypic
Quality (PaTO)
Biological Process (GO)
CELL AND CELLULAR
COMPONENT
Cell (CL)
Cellular Component (FMA, GO)
Cellular Function
(GO)
MOLECULE Molecule
(ChEBI, SO, RnaO, PrO)
Molecular Function (GO)
Molecular Process (GO)
Population-level ontologies 15
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN AND ORGANISM
Organism (NCBI
Taxonomy)
Anatomical Entity
(FMA, CARO)
Organ Function
(FMP, CPRO) Phenotypic Quality (PaTO)
Biological Process (GO)
CELL AND CELLULAR
COMPONENT
Cell (CL)
Cellular Component (FMA, GO)
Cellular Function
(GO)
MOLECULE Molecule
(ChEBI, SO, RnaO, PrO)
Molecular Function (GO)
Molecular Process (GO)
Environment Ontology (EnvO)
Envi
ronm
ents
16
OBO Foundry approach extended into other domains
17
NIF Standard Neuroscience Informa(on Framework
IDO Consor(um Infec(ous Disease Ontology
cROP Common Reference Ontologies for Plants
MilPortal.org Military Ontology
AIRS Ontology Suite Intelligence Ontology Suite
18
19
20
slide from Margaret Storey 21
Horizontal Integra(on of Big Intelligence Data
The Role of Ontology in the Era of Big Data
T. Malyuta, Ph. D New York City College of Technology, NY, NY
B. Smith, Ph. D University at Buffalo, Buffalo, NY
R. Rudnicki CUBRC, Buffalo, NY
23
h=p://ncorwiki.buffalo.edu/ index.php/Main_Page#Documents
Big Data Problem
• Wikipedia defines Big Data as “…a collec(on of data sets so large and complex that it becomes difficult to process using on-‐hand database management tools.”
• Gartner defines Big Data with three ‘V’s: – Volume – Velocity (of produc(on and analysis) – Variety – Recently the forth ‘V’ – Veracity – was added
• This means that Big Data are beyond our control (as opposed to those complex and big systems with diverse and changing data where the complexity is known)
24
Big Data Solu(on – Agility
• Dimensions of agility – Storage paradigms that accommodate massive volumes of heterogeneous data
– Data processing paradigms that can deal with the massive volumes of heterogeneous data coming onstream
– Dynamic data stores that can easily accommodate diverse and a priori unknown data types and seman(cs
– Methods and tools that leverage dynamic and diverse content
25
Agile Data Management • New highly distributed compu(ng (MapReduce) and data processing (Bigtable) paradigms and technologies based on them (hadoop.apache.org/, hbase.apache.org/) help in solving data management problems: – Store and process Volume – Keep up with Velocity – Represent Variety
• These technologies are not meant (and never were meant) to provide data interpreta(on – In data systems we have been dealing with data the meaning of which we knew (usually via data applica(ons)
– These technologies do not help in solving the problem of data integra(on and interoperability of systems
26
Agile U(liza(on • Today, the main problem of the Big Data is how to use it
– U(liza(on of ‘Variety’ – diverse and a priori unknown types and seman(cs
– Ability of Big Data systems to interoperate – Ability to integrate Big Data – The last two problems are inherently difficult and could not be properly addressed by the data technology itself
• Tradi(onal data u(liza(on and integra(on approaches fail
• Relying on legacy data models and mappings (linked open data) fails – creates forking and mapping degrada(on
• Agile u(liza(on and integra(on paradigms are needed 27
The Problem of Horizontal Integra(on of Big Intelligence Data
• HI =Def. the ability to exploit mul?ple data sources as if they are one
• Recognized issues for HI with exis(ng approaches – Data silos – Lexicon/seman(cs silos
• Requirement for HI of Big Intelligence Data – Agile Seman(c Interoperability A strategy for HI must be agile in the sense that it can be quickly extended to new zones of emerging data according to need
Ontology allows an incremental approach – big bang already from the very first buck (we showed on the project that is described below)
Ontology can provide the needed agility 28
Agile Seman(c Interoperability
• A good solu(on has to be – Able to grow incrementally
– Able to be developed in a distributed manner – Without losing consistency – Independent of par(cular implementa(ons, and data producers and consumers
– Applicable to data in an agile manner
• We call our solu(on: ‘seman(c enhancement’ (SE) of data
29
• Explica-‐<on of general terms used in source intelligence ar(facts and in data models, terminologies and doctrinal publica(ons which provide typo-‐logies of intelligence-‐related IAs to seman(cally enhance data in a way that enables computa(onal integra(on and reasoning
• Annota<on of the instance-‐level informa(on captured by such IAs to aid retrieval of informa(on about specific persons, groups, events, documents, images, and so forth
Explica(on vs. Annota(on
SE Types
• Explica-‐<on of general terms used in source intelligence ar(facts and in data models, terminologies and doctrinal publica(ons which provide typo-‐logies of intelligence-‐related IAs to seman(cally enhance data in a way that enables computa(onal integra(on and reasoning
• Annota<on of the instance-‐level informa(on captured by such IAs to aid retrieval of informa(on about specific persons, groups, events, documents, images, and so forth
SE • SE is realized with the help of ontologies that are used to
explicate data models and annotate data instances – Vocabulary of ontologies used for explica(ons and annota(ons
provides agile horizontal integra(on – Ontologies, by virtue of their nature and organiza(on, provide
seman(c enhancement of data
PersonID Name Descrip?on
111 Java Programming
222 SQL Database
SQL Java C++
ProgrammingSkill
ComputerSkill
Skill Educa(on
Technical Educa(on
32
The Meaning of ‘Enhancement’ • Seman(c enhancement/enrichment of data = arm’s
length approach (no change to data) – through simple explica(on we associate an en(re knowledge system with a database field – enables analy(cs to process data, e.g. about computer skills,
“ver(cally” along the Skill hierarchy, as well as “horizontally” via rela(ons between Skill and Educa(on.
– and further… while data in the database does not change, its analysis can be richer and richer as our understanding of the reality changes
• For this richness to be leveraged by different communi(es, persons, and applica(ons it needs to have the proper(es men(oned above and be constructed in accordance with the principles of the SE
33
SE Principles ⁻ Create a Shared Seman(c Resource (SSR) of ontologies to be used for explica(on and annota(on
⁻ Establish an agile strategy for building ontologies within this SSR, and apply and extend these ontologies to explicate and annotate new source data as they come onstream
⁻ Problem: Given the immense and growing variety of data sources, the development methodology must be applied by mul(ple different groups ⁻ How to manage collabora(on?
34
Achieving the Goal • Methodology of incremental distributed ontology development
• A common ontology architecture incorpora(ng a common, domain-‐neutral, upper-‐level ontology (BFO)
• A shared governance and change management process • A simple, repeatable process for ontology development
• An ontology registry • A process of intelligence data capture through explica(on or source data models
35
Main Methodological Points • Ontological realism
– Based on Doctrine / Science – Involves SMEs in label selec(on and defini(on
– Thoroughly tested in many projects • Arms-‐length process, with minimal disturbance to exis(ng data
and data seman(cs
• Reference ontologies – capture generic content and are designed for aggressive reuse in mul(ple different types of context: Single reference ontology for each domain of interest
• Applica(on ontologies – are (ed to specific local applica(ons – An applica(on ontology is created by combining local content with generic content taken from relevant reference ontologies
– S(ll interoperable because based on common set of reference ontologies
* Barry Smith and Werner Ceusters, “Ontological Realism as a Methodology for Coordinated Evolu(on of Scien(fic Ontologies”, Applied Ontology, 5 (2010), 139–188.
36
Arms-‐length Process
SE ontology labels
• Focusing on the terms (labels, acronyms, codes) used in ***our source data
• Where mul(ple dis(nct terms {t1, …, tn} are used in separate data sources with one and the same meaning, they are associated with a single preferred label drawn from a standard set of such labels
• All the separate data items associated with the {t1, … tn} thereby linked together through the corresponding preferred labels.
• Preferred labels form basis the for the ontologies we build
Heterogeneous Contents ABC KLM
XYZ
37
Reference and Applica(on Ontologies
vehicle =def: an object used for transpor(ng people or goods
tractor =def: a vehicle that is used for towing
crane =def: a vehicle that is used for liying and moving heavy objects
vehicle plaJorm=def: means of providing mobility to a vehicle
wheeled plaJorm=def: a vehicle plazorm that provides mobility through the use of wheels
tracked plaJorm=def: a vehicle plazorm that provides mobility through the use of con(nuous tracks
ar?llery vehicle = def. vehicle designed for the transport of one or more ar(llery weapons
wheeled tractor = def. a tractor that has a wheeled plazorm
tracked tractor = def. a tractor that has a tracked plazorm
ar?llery tractor = def. an ar(llery vehicle that is a tractor
wheeled ar?llery tractor = def. an ar(llery tractor that has a wheeled plazorm
Reference Ontology Applica?on Defini?ons
38
Illustra(on of Ontology Types (Toy Example) Vehicle
Tractor
Wheeled Tractor
Ar(llery Tractor
Wheeled Ar(llery Tractor
Ar(llery Vehicle
Black – reference ontologies
Red – applica(on ontologies
39
Role of Reference Ontologies • Normalized
– Maintains a set of consistent ontologies
– Eliminates redundancy
• Modular – A set of plug-‐and-‐play ontology modules – Enables distributed consistent development
• Surveyable
40
SE Architecture • The Upper Level Ontology (ULO) in the SE hierarchy must be maximally general (no overlap with domain ontologies)
• The Mid-‐Level Ontologies (MLOs) introduce successively less general and more detailed representa(ons of types which arise in successively narrower domains un(l we reach the Lowest Level Ontologies (LLOs).
• The LLOs are maximally specific representa(on of the en((es in a par(cular one-‐dimensional domain
41
Architecture Illustra(on
42
Challenges to HI • Too many lexicons • The scope of the domain: signal, sensor, image, … intelligence about … the whole world
• Difficult to conduct governance and management of ontology development to ensure consistent evolu(on
• Lack of exper(se • Complexity of the ontology development and applica(on process
43
Preven(ng Failure • The method we use offers solu(ons to some of the common
reasons for failure • Lack of Consensus
– Realism offers an objec(ve standard for se=ling disputes over terminology. Ontology development becomes an empirical science instead of an exercise in the publica(on of dialects
– Governance helps to resolve conflicts and achieve consensus • High Maintenance
– Arm’s length implementa(on places no addi(onal overhead onto applica(ons
• Parochialism – Architecture and methodology prevent development of
vocabularies that apply only to a single perspec(ve • Poor Quality
– Experience prevents common mistakes in vocabularies that cause downstream problems with search and analy(cs
44
Preven(ng Failure (cont.) • Agile ontology development
– Methodology and architecture – Growing SSR
• Agile ontology applica(on – Incremental – Semi-‐automated where possible – Even if not as fast as some want it to be
• It is s(ll faster than crea(ng a physical store, which will be just another silo and will s(ll need to be integrated with the rest of data
• Once a data collec(on is seman(cally enhanced, it is integrated with all data that had been and will be seman(cally enhanced without any addi(onal efforts
45
What is Next…
– IAO-‐Intel: An Informa(on Ar(fact Ontology for the Intelligence Community (BS)
– A Survey of DSGS-‐A Ontology Work and Explica(ng and Annota(ng Processes (R. Rudnicki)
– Email Ontology – illustra(on of the methodology of ontology design and of the IAO-‐Intel (D. Salmen and W. Mandrick)
46
References • Barry Smith, Ta(ana Malyuta, William S. Mandrick, Chia Fu,
Kesny Parent, Milan Patel, Horizontal Integra(on of Warfighter Intelligence Data: A Shared Seman(c Resource for the Intelligence Community, STIDS Conference, 2012.
• • Barry Smith, Ta(ana Malyuta, David Salmen, William
Mandrick, Kesny Parent, Shouvik Bardhan, Jamie Johnson, “Ontology for the Intelligence Analyst”, Crosstalk: The Journal of Defense Soyware Engineering, 2012.
• • David Salmen, Ta(ana Malyuta, Alan Hansen, Shaun
Cronen, Barry Smith, Integra(on of Intelligence Data through Seman(c Enhancement, STIDS Conference, 2011.
47