Knowledge-Based e-Science
Nigel ShadboltThe University of Southampton
The work of many people…
Harith Alani Steve Harris Nick Gibbins Yannis Kalfoglou Kieron O’Hara David Dupplaw Bo Hu Paul Lewis Srinandan Dashamapatra Duncan Macrae-Spencer Hugh Glaser
Les Carr David de Roure Wendy Hall Mike Brady David Hawkes Yorick Wilks Enrico Motta Carole Goble Simon Cox Andy Keane :
Advanced Knowledge Technologies IRC
www.aktors.org
Structure
A little history Shared Meaning and Content Services Collaboration Mapping Science Lessons Learnt
A Little History
Knowledge Engineering: Evolution
Knowledge engineering is not about transfer but about modelling aspects of human knowledge
The knowledge level principle: first concentrate on the conceptual structure of knowledge and leave the programming details for later
Knowledge has a stable internal structure that can be analysed by distinguishing specific knowledge types and roles
Knowledge Engineering: Principles
McBrien, A.M., Madden, J and Shadbolt, N.R. (1989). Artificial Intelligence Methods in Process Plant Layout. Proceedings of the 2nd International Conference on Industrial and Engineering Applications of AI and Expert Systems, pp364-373, ACM Press
Knowledge-Based System: PLS
Bull, H.T, Lorrimer-Roberts, M.J., Pulford, C.I., Shadbolt, N.R., Smith, W. and Sunderland, P. (1995) Knowledge Engineering in the Brewing Industry. Ferment vol.8(1) pp.49-54.
Knowledge-Based System: Brewing
Many systems deployed and in routine use Often embedded
Understood aspects of
Representing knowledge and dealing with uncertaintyHow to reason with knowledgeHow to acquire the knowledgeHow to run in real real time
Knowledge-Based Systems: Achievements
How to disseminate capability?
How to reuse hard won knowledge?
How to maintain the knowledge-bases?
How to support communities of users and practitioners?
Knowledge-Based Systems: Problems
The Semantic Web and e-Science…
Fundamentally changed the way we think about KA, knowledge engineering and knowledge management
Suggested a different way in which knowledge intensive components could be deployed
Also brought together a community unencumbered by close attention either to AI or Knowledge Engineering
New funding opportunities…
Making the Web Semantic…
Via meta content…
This is a type of object event and this is its title
This is the URL of the web page for the event
This is a type of object photograph and the photograph is of Tim Berners-Lee
Tim Berners-Lee is an invited speaker at the event
In a way that machines can deal with….
<owl:Class rdf:ID="Conference"><rdfs:subClassOf rdf:resource="#Meeting-Taking-Place"/><rdfs:subClassOf rdf:resource="#Publication-Type-Event"/>-<rdfs:subClassOf>-<owl:Restriction><owl:onProperty rdf:resource="#published-proceedings"/><owl:allValuesFrom rdf:resource="#Conference-Proceedings-Reference"/></owl:Restriction>
Can Annotate Anything
DatabasesScientific
structuresWorkflowPublicationsPeople
Web data set (XHTML)
The Semantic Web is about Sharing Meaning
Linkage of heterogeneous information web content databases meta-data
repository multimedia
Via ontologies to share meaning
Using Semantic Web languages
Oncogene(MYC): Found_In_Organism(Human). Gene_Has_Function(Transcriptional_Regulation). Gene_Has_Function(Gene_Transcription). In_Chromosomal_Location(8q24). Gene_Associated_With_Disease(Burkitts_Lymphoma).
NCI Cancer Ontology (OWL)
<meta> <classifications> <classification type="MYC” subtype="old_arx_id">bcr-2-1-059</classification> </classifications></meta>
BioMedCentral Metadata (XML)
Web data set (XHTML)
Vocabulary (RDFS)
Sharing Meaning on the Web: Ontologies
A shared conceptualisation of a domain
Provides the semantic backbone
Lightweight and is deployed using W3C recommended standards
A Philosophical Digression
Realist Stance
We engage with a reality directly Reality consists of pre existing objects with attributes Our engagement may be via reflection, perception or
language Philosophical exponents
Aristotle Leibnitz the early Wittgenstein :
Language and logic pictures the world Seen as a way of accounting for common
understanding Promised a language for science
There is no simple mapping into external objects and their attributes in the world
We construct objects and their attributes This construction may be via intention and perception, it
may be culturally and species specific Philosophical exponents
Husserl Heidegger Later Wittgenstein :
Language as games, complex procedures, contextualised functions that construct a view of the world
Constructivist Stance
Ontologies - Current Context
The large metaphysical questions remain What is the essence of being and being in the world
Our science and technology is moving questions that were originally only philosophical in character into practical contexts Akin to what happened with natural philosophy from the 17th
century – chemistry, physics and biology
As our science and technology evolves new philosophical possibilities emerge Particularly when we look at knowledge and semantic based
processing
Shared Meaning and Content:The Role of Ontologies
The AKT Ontology
Designed as a learning case for AKT
Adopted for our own Semantic Web experiments
Uses a number of Upper Ontology fragments
Reused in many contexts
data sources
?
Mediate and Aggregate: UK Research Councils
A Proposed Solution
data sources
gatherers Ontology knowledge repository
(triplestore)
applications
Raw CSV dataHeterogeneous tables
Processed RDF informationUniform format for files
Mediate and Aggregate: Ontologies
An Application Service
Relatively simple could yield real information integration and interoperability benefits
Reuse was real but again lightweight
Ontology winnowing needed
Content harvested and published from multiple Heterogeneous Sources
Higher Education directories 2001 RAE submissions UK EPSRC project database Info on personnel, projects
and publications harvested for 5 or 5* CS departments in the UK
Automatic NL mining: Armadillo
All UK administrative areas (from ISO3166-2)
All UK settlements listed in the UN LOCODE service
Currently constructing spaces for other domains
Mediate and Aggregate: CS AKTive
Classification: Ontologies in Proteomics
Work of Brass, Stephens and colleagues at Manchester
Classifying protein function
Large areas of biology where this is a knowledge-intensive human function
Mass of sequencing data demands assistive technologies
Auto classification using OWL ontologies and reasoners
From Amino Acids to Protein Domains
>1A5Y:_ PROTEIN TYROSINE PHOSPHATASE 1B
MEMEKEFEQIDKSGSWAAIYQDIRHEASDFPCRVAKLPKNKNRNRYRDVSPFDHSRIKLHQEDNDYINASLIKMEEAQRSYILTQGPLPNTCGHFWEMVWEQKSRGVVMLNRVMEKGSLKCAQYWPQKEEKEMIFEDTNLKLTLISEDIKSYYTVRQLELENLTTQETREILHFHYTTWPDFGVPESPASFLNFLFKVRESGSLSPEHGPVVVHXSAGIGRSGTFCLADTCLLLMDKRKDPSSVDIKKVLLDMRKFRMGLIATAEQLRFSYLAVIEGAKFIMGDSSVQDQWKELSHEDLEPPPEHIPPPPRPPKRILEPHNGKCREFFPN
Ontology Mediated Classification
Proteins carry out function in a cell
Broad functional classes are “Protein Families”
Families sub-divided to family classifications
Class membership determined by “protein features”, such as domains, etc.
Classification enables comparative genomics - what is already known in other organisms
The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology
Ontological Mediated Classification: Results
The ontology system refined classification- DUSC contains zinc finger domain Characterised and conserved – but not in classification- DUSA contains a disintegrin domainpreviously uncharacterised – evolutionarily conserved
A new kind of phosphatase? Extending to fungi much less studied
A Standards Digression
Standards are fundamental
HTML XML + Name Space + XML Schema
Topic Maps
SMIL
RDF(S)XOL
OWL
RDF
Unicode URI
What is wrong with XML?
Nothing but… Data/Information oriented
applications are : Less concerned about serial order. e.g., patient records vs
poems
More concerned with aggregation. e.g., merging information
about a patient from different sources vs merging musical scores
Viewpoints on RDF and XML…
The difference between the RDF proponents and the XML proponents is fairly simple. In the XML-centric world parties can utilize whatever internal formats and data sources they want but exchange XML documents that conform to an agreed upon format, in cases where the agreed upon format conflicts with internal formats then technologies like XSLT come to the rescue.
RDF vocabulary, allows free composition with other RDF language… assertions can be mixed with assertions about weather, physics, business processes, weblogs and syndication, genealogy, politics, and so on, without the need for prior coordination between the designers of languages for these domains.
The Semantic Web is not an all-or-nothing proposition; it is a rubric describing a set of distinct (though related) technologies - RDF, FOAF, OWL, RSS, XML - all of which are designed to improve machine-to-machine communication [...].
[...] the many smaller victories already taking place, impact of RDF, RSS, and Web services.
The RDF position is that it is too difficult to agree on interchange formats so instead of going down this route we should use KR schema to map between formats. [...]
Knowledge-Based Services
Diverse and heterogeneous content Clinical examination
Notes
Imaging X-ray, Ultrasound MRI
Microscopy Histopathology
Treatment Protocol Records Re-assessment
Medical Records Case sets Individual patient records
Published background Epidemiology Medical Abstracts
Collaborative Medical Decision Making
The Services and Framework
Image Analysis Services
Classification Services
Patient Data Retrieval Services
Image Registration
Natural Language Report Generation
Patient Records through web-service
Underlying Representation of Patients
<rdf:Description rdf:about='#g1p78_patient'> <rdf:type rdf:resource='#Patient'/> <NS2:has_date_of_birth>01.01.1923</NS2:has_date_of_birth> <NS2:involved_in_ta rdf:resource='#ta_soton_000130051992'/></rdf:Description>
<rdf:Description rdf:about='#ta_soton_000130051992'> <rdf:type rdf:resource='#Multi_Disciplinary_Meeting_TA'/> <NS2:involve_patient rdf:resource='#g1p78_patient'/> <NS2:consist_of_subproc rdf:resource='#oe_00103051992'/> <NS2:consist_of_subproc rdf:resource='#hp_00117051992'/> <NS2:consist_of_subproc rdf:resource='#ma_00127051992'/> <NS2:has_overall_impression rdf:resource='#assessment_b5_malignant'/> <NS2:has_overall_diagnosis>invasive carcinoma</NS2:has_overall_diagnosis></rdf:Description>
<rdf:Description rdf:about='#oe_00103051992'> <rdf:type rdf:resource='#Physical_Exam'/> <NS2:has_date>03.05.1992</NS2:has_date> <NS2:produce_result rdf:resource='#oereport_glp78_1'/> <NS2:carried_out_on rdf:resource='#g1p78_patient'/></rdf:Description>
<rdf:Description rdf:about='#oereport_glp78_1'> <NS2:type rdf:resource='#Lateral_OE_Report'/> <NS2:contains_roi rdf:resource='#oe_roi_00103051992'/> <NS2:has_lateral rdf:resource='#lateral_left'/></rdf:Description>
Shared meaning: People Institutions Norms
Are concepts and classes are end-products of epistemological and/or decision-making procedures
One needs to “recognise” instances of a particular class as such
Information indexed against an ontology and be treated declaratively (Tarskian view, OWL), but …
… they come into being procedurally against social and institutional norms,
Institutional Norms
Common false-positives in FNAC is misdiagnosis of apocrine cells as malignant condition (pleomorphic appearance signals malignancy; morphological characteristics trad. distinguishing classification criteria for pathologists)
For KR support, need to record not just the label relevant for diagnosis (“apocrine cells”) but also the means by which such a labelling was achieved
• NHS guidelines suggests for identification of apocrine cells (common false positive):
“Recognition of the dusty blue cytoplasm, with or without cytoplasmic granules with Giemsa stains or pink cytoplasm on Papanicolaou or haematoxylin and eosin stains coupled with a prominent central nucleolus is the key to identifying cells as apocrine.”
Formalised Procedures
For laboratory practice L(x, t) that specimen x is subjected to in context t (time, state variables for exptal/clinical conditions) a predicative attribute P(x) is identified with behavioural response B(x, t) leading to an implicit definition of P(x)
Procedures for Reproducibility
Specific criteria for identifying histopathological slides as instances of particular lesions – rule following props – make concept labelling reproducible
For Ductal Carcinoma in situ, Atypical ductal hyperplasia, procedural criteria reducesinter-expert variability
Criteria of Page et al (Cancer 1982; 49:751-758; Cancer 1985; 55:2698-2708), reported by Fechner in MJ Silverstein (1997). Ductal Carcinoma In Situ of the Breast
Norms and Rule-following
Concept use in medical practice requires the recognition of instances as instances of appropriate classes
Classes are assigned as proxies of groups of instances to respond in coherent ways to patterns of questioning
Class ascription needs to be reproducible Reproducibility is enhanced by rule-
following
http://www.geodise.org
Nacelle Design
Computational fluid
dynamics
(1)specify the geometry,(2)generate a mesh(3)decide which code to
use for the analysis,(4)decide the
optimisation schedule(5)execute the
optimisation run
Design Optimisation of a Nacelle
Knowledge Engineering in Geodise
Techniques: CommonKADS knowledge
engineering methodologies, PCPACK
Knowledge models: Organization, agent & task templates, domain schema & inferences.
Representation:HTML (knowledge web), XML, CML, UML, flowchart.
Knowledge models: ontologies, knowledge bases
Tools: Protégé & OilEd
Representation: RDF, Rule Sets
Knowledge Engineering in Geodise
WF Advisor
Semantic Service Description
A KB Approach to Semantic Service Composition
Lessons Learnt
Ontology mediated invocation of services Services as Problem Solving Methods or
Task Achieving Components Semantic service description is
indispensable for resource sharing and reuse Composition and workflow support will vary
from domain to domain Knowledge acquisition & conceptualisation
may still be the bottleneck
Knowledge-Based Collaboration
Simon Buckingham Shum (Open)David De Roure (So’ton)Marc Eisenstadt (Open)Nigel Shadbolt (So’ton)
Austin Tate (Edin)
AKT Integrated Feasibility Demonstrator
collective sensemaking /
discussion capture
following through decisions/coordinating activities
synthesising artifacts
recovering information from meetings
awareness of agent status /
sense of ‘presence’virtual meetings
Integrating multiple methods of communicating
mapping real time discussions/ group
sensemaking
following through decisions/coordinating activities
synthesising artifacts
recovering information from meetings
awareness ofcolleagues’
availability/ sense of ‘presence’virtual meetings
Testbed: Scientific project management (AKT PI NetMeeting, March 2003)
recovering information from meetings
NetMeeting
BuddySpace
Compendium
I-X Process panels
Meeting replay by time, speaker, contribution to discussion, and decision
CoAKTinG @ NASA
Distributed Mars-Earth planning and data analysis toolsfor Mars Habitat field trial in Utah desert, supported from US+UK
CoAKTinG NASA testbed
Compendium activity plans for surface exploration, constructed by scientists,
interpreted by software agents
CoAKTinG NASA testbed
Compendium science data map, generated by software agents, for interpretation by Mars+Earth scientists
CoAKTinG NASA testbed
Compendium-based photo analysis by geologists on ‘Mars’
CoAKTinG NASA testbed
Compendium scientific feedback map from Earth scientists to Mars colleagues
CoAKTinG NASA testbed
Compendium ‘portal map’ for Earth scientists, with the key links they need to prepare for a virtual meeting to analyse Mars scientists’
data
CoAKTinG NASA testbed
Meeting Replay tool for Earth scientists, synchronising
video of Mars crew’s discussion with their Compendium maps
Lessons from CoAKTinG
Initially achived a lot with light weight integration via XML based architectures like Jabber
Recently large scale 3 store RDF repository with taxonomic capabilities and RDQL provided a deeper integration
Substantial amounts of integration via the common representation of RDF in the repository
Real time services for annotation of streaming media particularly challenging
Achieved collaborative virtual team support – and follow on VRE project
Knowledge Mapping for e-Science
The Network Effect
We need tools to assist our understanding of emerging and established scientific work
Social Network Analysis Communities of
Interest/Practice Who is doing what,
where With what impact
Existing Citation Ratings…
Use of the literature Citeseer’s ‘most
cited author’ page. People use it to rate
themselves and others.
25 x ‘D Johnson’ in Citeseer: inaccurate!
Pure citation count not enough to prove ‘impact’?
Hubs and Authorities
Begin with existing measures: document count and citation count.
Apply Kleinberg (1998) ‘hubs/authorities’ analysis to data.
Note that higher citation count may not mean higher authority rating: quality citations are what count.
Turning Points and Centrality
Next step: include Chen’s Citespace/Pathfinder algorithm.
Allows us to find turning points in scientific development: Kuhn’s “paradigm shift” moment.
‘Centrality’ measure to be applied to same Citeseer data.
Lessons Learnt
The content is primary It needs rich semantic
annotation via ontologies Services emerge/designed
to exploit the content
Ontologies Mediation and
Aggregation Lightweight works Mapping and alignment
Socio technical issues Ontologies and KA Usability and
requirements
Importance of Reference A fundamental problem
Push versus Pull Important to understand
the balance here
Fidelity Need for trust, provenance
and quality services