1 sc 32/wg 2 tutorial metadata registry standards july 16, 2007 bruce bargmeyer university of...
TRANSCRIPT
1
SC 32/WG 2 Tutorial
Metadata Registry Standards
July 16, 2007
Bruce BargmeyerUniversity of California, BerkeleyandLawrence Berkley National LaboratoryTel: +1 [email protected]
JTC1 SC32 N1649JTC1 SC32 N1649
2
Topics
Standards development: OMG, ISO (TC 37 & JTC 1/SC 32), W3C, OASIS Align, Coordinate, Integrate:
Standards, Recommendations, Specifications Semantics Challenges and Future Directions
Align, Coordinate, IntegrateStandards
3
24707
11179 E3 19763
20944
WG 2 doing OK internally:
Align, Coordinate, IntegrateStandards
4
WG 1
WG 2WG 3
WG 4
SC 32?
Clearwater meetinga step forward
5
Align, Coordinate, Integrate Standards/Recommendations/Specifications
for Semantic Computing
ISO/IEC JTC 1/SC 32
UsUs
ererss
ISO/IEC 11179MetadataRegistries
Metadata Registry
Terminology Thesaurus Taxonomy
DataStandards
Ontology
StructuredMetadata
Terminology
CONCEPT
Referent
Refers To Symbolizes
Stands For
“Rose”,“ClipArt Rose”
ISO TC 37
SemanticWeb
W3C
Object Management
MOFODMCWMIMM
OMG
Node
Node
Edge
Subject
Predicate
Object
Graph RDF
Standards DevelopmentSemantics Management and Semantics Services –
Semantic Computing
6
OMG
W3CISO/IEC JTC 1 SC 32
Align, Co-develop, Fast Track, PAS Submission …
ISO TC 37
Standards DevelopmentSemantics Management and Semantics Services –
Semantic Computing
7
OMG
W3CISO/IEC JTC 1 SC 32
Align, integrate, co-develop, Fast Track, PAS Submission …Can we coordinate content?
W3C
A Success
8
OMG
ISO/IEC JTC 1 SC 32
Some text and figures are identical in the two standards.
ISO/IEC 24707OMG ODM
ISO/IEC 20944 – Common LogicOMG Ontology Definition Metamodel
Standards DevelopmentSemantics Management and Semantics Services –
Semantic Computing
9
ISO/IEC 11179 (Edition 3)
ISO/IEC JTC 1 SC 32
Ongoing effort
Standards DevelopmentSemantics Management and Semantics Services –
Semantic Computing
10
Possible effort
11179 E3 proposals
OMG
Standards DevelopmentSemantics Management and Semantics Services –
Semantic Computing
11
ISO/IEC 11179 (Edition 3)
ISO/IEC JTC 1 SC 32
Hopeful?
OMG
IMM &
Other Possibilities
OASIS ebXML RegistryW3C Semantic Web Deployment WGTC 37
12
Getting the information that we need, when we need it, without afflicting the excellent minds of humans with toil and drudgery
The litany: Too much or too little, irrelevant, not authoritative, out of date Unknown quality, not trustable, lacks provenance, no certainty measures Difficult to find, difficult to access, difficult to use Meaning not clear, relationship to other information not clear Data creators do not have the same understanding of the data as end users Recorded data loses much real world meaning, context, relationships Much of the meaning of data is buried in the processes used to manipulate the data (e.g., in
computer code) Need improvements in efficiency and effectiveness
Every time we solve it, we re-create it.
The Ageless Information Problemcf: Data, Information, Knowledge, Wisdom
Improve traditional data management/data administration Use stronger semantics management and
semantics services capabilities
Enable something new Semantic computing
New Semantics Capabilities Proposed for ISO/IEC 11179 MDR (Edition 3)
Processing that takes “meaning” into account Makes use of concept systems, e.g., thesauri and/or ontologies Moves some of the “meaning” of data from computer code to
managed semantics Processing that uses (e.g., reasons across) the relations between
things not just computing about the things themselves. Processing that helps to take people out of the computation,
reducing the human toil Semantics “grounding” for data, data discovery, extraction,
mapping, translation, formatting, validation, inferencing, … Delivering higher-level results that are more helpful for the user’s
thought and action
Semantic Computing: The Nub of It
In The Epic Information StruggleWe Have Made Heroic Progress
Files
Machine Processing
Computer Processing Cards Tape Disk
In structuring data and text -- Structured Data
Columns on cards & tape (possibly comma separated) Hierarchical (DBMS) Network Table (relational DBMS) Hierarchy (XML) Graph (RDF)
Semi-structured text Nrof, trof, LaTeX … SGML HTML XML
In The Epic Information StruggleWe Have Made Heroic Progress
In documenting data and text (e.g., semantics management) –
Data Standards Code sets
(Meta)Data Standards Data element definitions, valid values, value meanings Metadata registries (MDR, ISO/IEC 11179) Other standards as presented at this conference
Concept systems (or KOS) Glossaries Dictionaries Thesauri Taxonomies Ontologies Graphs
In The Epic Information StruggleWe Have Made Heroic Progress
Improve data management through use of stronger semantics management Databases XML data Other “traditional” data
Enable new wave of semantic computing Take meaning of data into account Process across relations as well as properties May use reasoning engines, e.g., to draw inferences
Semantic ManagementProposals for 11179 Edition 3
Semantic Computing Application: Find and process non-explicit data
Analgesic Agent
Non-Narcotic Analgesic
AcetominophenNonsteroidal Antiinflammatory Drug
Analgesic and Antipyretic
DatrilAnacin-3 Tylenol
For example…
Patient data on drugs contains brand names (e.g. Tylenol, Anacin-3, Datril,…);
However, want to study patients taking analgesic agents
A Semantics Application: Specify and compute across Relations, e.g., within a food web in an Arctic ecosystem
An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer.
Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)
Semantics Application: Combine Data, Metadata & Concept Systems
Name Datatype Definition Units
ID textMonitoring Station Identifier
not applicable
Date date Date yy-mm-dd
Temp numberTemperature (to 0.1 degree C)
degrees Celcius
Hg numberMercury contamination
micrograms per liter
ID Date Temp Hg
A 06-09-13 4.4 4
B 06-09-13 9.3 2
X 06-09-13 6.7 78
Inference Search Query:“find water bodies downstream from Fletcher Creek where chemical contamination was over 10 micrograms per liter between December 2001 and March 2003”
Data:
Metadata:
Biological Radioactive
Contamination
lead cadmiummercury
Chemical
Concept system:
Challenge: Use data from systems that record the same facts with different terms
Common Content
OASIS/ebXMLRegistries
Common Content
ISO 11179Registries
Common Content
OntologicalRegistries
Common Content
CASE ToolRepositories
Common Content
UDDIRegistries
CountryIdentifier
DataElement
XML Tag
TermHierarchy
Attribute
BusinessSpecification
TableColumn
SoftwareComponentRegistries
Common Content
Common Content
DatabaseCatalogs
BusinessObject
DublinCore
Registries
Common Content
Coverage
Data Elements
DZ
BE
CN
DK
EG
FR
. . .
ZW
ISO 3166English Name
ISO 31663-Numeric Code
012
056
156
208
818
250
. . .
716
ISO 31662-Alpha Code
Algeria
Belgium
China
Denmark
Egypt
France
. . .
Zimbabwe
Name:Context:Definition:Unique ID: 4572Value Domain:Maintenance Org.Steward:Classification:Registration Authority:Others
ISO 3166French Name
L`Algérie
Belgique
Chine
Danemark
Egypte
La France
. . .
Zimbabwe
DZA
BEL
CHN
DNK
EGY
FRA
. . .
ZWE
ISO 31663-Alpha Code
Same Fact, Different Terms
Algeria
Belgium
China
Denmark
Egypt
France
. . .
Zimbabwe
Name: Country IdentifiersContext:Definition:Unique ID: 5769Conceptual Domain:Maintenance Org.:Steward:Classification:Registration Authority:Others
DataElementConcept
Challenge: Draw information together from a broad range of studies, databases, reports, etc.
A semantics application:Information Extraction and Use
Segment
Classify
Associate
Normalize
Deduplicate
Discover patterns
Select models
Fit parameters
Inference
Report results
Actionable Information
Decision Support
ExtractionEngine
11179-3(E3)
XMDR
Metadata Registries are Useful
Registered semantics For “training” extraction engines The “Normalize” function can make use of standard
code sets that have mapping between representation forms.
The “Classify” function can interact with pre-established concept systems.
Provenance High precision for proper nouns, less precision
(e.g., 70%) for other concepts -> impacts downstream processing, Need to track precision
Challenge: Gain Common Understanding of meaning between Data Creators and Data Users
Users Information systems
Data Creation
UsersUsers
EEA
USGS
DoD
EPAenvironagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text
ambienteagriculturatiemposalud hunanoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
Others . . .
ambienteagriculturatiemposalud hunoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038
3268082513485038270800002178
text data
A common interpretation of what the data represents
Vocabulary Management is essential for use of semantic technologies Define concepts and relationships Harmonize terminology, resolve conflicts Collaborate with stakeholders
An approach Select a domain of interest Enter core concepts and relationships Engage community in vocabulary review Harmonize, validate and vet the vocabulary Enter metadata describing enterprise data Link concept system to metadata
Practical Vocabulary Management
For vocabulary repository Register, harmonize, validate, and vet definitions and
relations To register mappings between multiple vocabularies To register mappings of concepts to data To provide semantics services To register and manage the provenance of data
11179-3 (E3) is part of the infrastructure for semantics and data management.
These capabilities are proposed for ISO/IEC 11179 Edition 3
Use eXtended MDR Capabilities
Upside Collaborative
Supports interaction with community of interest Shared evolution and dissemination Enables Review Cycle
Standards-based – don’t lock semantics into proprietary technology Foundation for strategic data centric applications Lays the foundation for
Ontology-based Information Management Content is reusable for many purposes
Downside Managing semantics is HARD WORK
- No matter how friendly the tools Needs integration with other components
11179 (E3) Use
Data management and metadata management must evolve to address more complex data structures (relational, object, hierarchies, graphs) Query capabilities
More than SQL, XQuery, SPARQL Discovery mechanisms
More than Google Access, mining, extraction
We need stronger semantics management
Some Challenges
Registering and mapping ontologiesOntology EvolutionRegistering Process Ontologies
Metadata Registry Support for
Thank You
Acknowledgements Karlo Berket, LBNL Kevin Keck, LBNL John McCarthy, LBNL Harold Solbrig, Apelon
This material is based upon work supported by the National Science Foundation under Grant No. 0637122, USEPA and USDOD. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, USEPA or USDOD.
37
Bruce BargmeyerLawrence Berkeley National Laboratory &Berkeley Water CenterUniversity of California, BerkeleyTel: +1 [email protected]