1
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Semantic Network Analysis 11.07.05
Analyzing Semantic Interoperability in Bioinformatic
Database Networks
Philippe Cudré-Mauroux, EPFL
Joint work with:Julien Gaugaz, Adriana Budura and Karl Aberer
2
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Overview
1. Peer Data Management Systems (PDMS)2. Semantic Interoperability in the Large
• Generatingfunctionologic framework
3. The Sequence Retrieval System• Degree distribution• Analysis of giant component• Weighted analysis
4. Conclusions
3
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Beyond Keyword Search
searching semantically richer objects in large scale heterogeneous networks
<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate><xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate>
date?
<es:DofCreation> 05/08/2004 </es:DofCreation>
<myRDF:Date> Jan 1, 2005 </myRDF:Date>
?
?
??
?
4
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Decentralized Data Integration
• Large Scale Information Systems (e.g., WWW)– Number of sources > 100– Unreliable data
• Autonomy– Semi-structured data
• E.g., XML/RDF– No integrity constraints– No transactions– Simple SP queries
• E.g., triple patterns, ranking
– Schemata created by end users
– Network churn
• Distributed Databases
– Number of sources < 100– Consistent data
• Coordination– Structured data
• E.g., Relational data model– Integrity constraints– Transactions– Powerful queries
• E.g., SQL, aggregation– Schemas created by
administrators– Relatively Fixed topology
VS
5
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Data Integration: LAV/GAV
• Traditional database techniques (e.g., LAV/GAV) rely on centralized schemas to integrate data sources
• Not applicable to our context– Scale (upper ontologies?)– Churn– Autonomy
• How can we foster semantic interoperability in decentralized settings?
Date
myDate yourDate
m(Date) = yourDatem(Date) = myDate
6
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Semantic Interoperability
Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%"
<Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells </Item> <Item>Royal Council</Item> </Bag> </Subject> …</Photoshop_Image>
Photoshop(own schema)
<WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> …</WinFSImage>
WinFS (known schema)
T12 =<Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage
Q2=<GUID>$p/GUID</GUID> FOR $p IN T12 WHERE $p/Creator LIKE "%Robi%"
Extending semantic interoperability techniques to decentralized settings
7
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
1. Peer Data Management Systems
• Pairwise mappings– Peer Data Management Systems (PDMS)
• Local mappings overcome global heterogeneity– Iterative query rewriting
<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate><xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate>
date?
<es:cDate> 05/08/2004 </es:cDate>
<myRDF:Date> Jan 1, 2005 </myRDF:Date>
articleweather
es:cDate xap:CreateDate
es:cDate
myR
DF:D
ate
myR
DF:
Dat
e
xap
:Mod
ifyD
ate
8
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Semantic Mediation Layer
Correlated / Uncorrelated
Correlated / Uncorrelated
“Physical”layer
Overlay Layer
SemanticMediation Layer
9
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Schema-to-Schema Graph
Inter-organization of the different schemas used by the peers - Logical model- Directed- Weighted- Redundant
10
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
The Semantic Connectivity Graph
• Definition (Semantic Interoperability) Two peers are said to be semantically interoperable if they can
forward queries to each other in the Schema-to-Schema graph, potentially through series of semantic translation links
• Idea– As for physical network analyses, create a connectivity layer to
account for semantic interoperability
• The semantic connectivity Graph S– Unweighted, irreflexive and non-redundant version of the Schema-
to-Schema graph
11
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Observations
• Theorem Peers in a set Ps are semantically interoperable iff Ss is
strongly connected, with Ss {s | p Ps, ps}
• Observation 1 A set of peers Ps cannot be semantically interoperable if
|Es| < |Vs|
• Observation 2 A set of peers Ps is semantically interoperable if
|Es| > |Vs| (|Vs|-1) - (|Vs|-1)
12
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
2. Semantic Interoperability in the Large
• Question– How can we analyze semantic interoperability in
large-scale PDMS?
• Idea: use percolation theory to detect the emergence of a strongly connected component in S– Necessary condition for vertex-strong connectivity– Necessary condition for semantic interoperability
13
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
The Model
• Adaptation of a recent graph-theoretic framework– Newman, Strogatz, Watts 2001
• Large-scale semantic graphs as random graphs with arbitrary degree distribution– Exponentially distributed, small-world, scale-free… graphs
• Specificities of our model– Strong clustering (clustering coefficient cc)– Bidirectionality (bidirectionality coefficient bc) (for directed networks)
• Based on generatingfunctionology
–
• Percolation: ci > 0
14
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Size of the giant component
With u the smallest non-negative solution of
And G1 the distribution of edges from first to second-order neighbors:
15
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
3. The Sequence Retrieval System (SRS)
• Commercial information indexing and retrieval system
• Bioinformatic libraries– EMBL– SwissProt– Prosite– Etc.
• Schemas described in a custom language (Icarus)
• Mappings (links) from one database to others
16
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Why is SRS interesting?
• Applying our heuristics on a real large-scale corpus of interconnected databases– More than 380 databanks– More than 500 (undirected) links– Data used by professionals on a daily basis
17
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Crawling the SRS schema-to-schema graph
• Custom crawler• As of May 2005 (EBI repository)
– 388 nodes– 518 edges
– Giant connected component: 187 nodes– Power-law distribution of node degrees
– Clustering coefficient = 0.32– Diameter = 9
18
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Results
• Connectivity indicator ci = 25.4– Super-critical state
• Size of the giant component– 0.47 (derived)– 0.48 (observed)
19
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Graphs with same power-law degree distr.
• Varying number of edges
20
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
10x Bigger Graph
21
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Analyzing weighted networks
• Do we have a sufficient number of good mappings?
• Introducing quality measures from the mappings– Weights– Attribute / schema level– Cf. Chatty Web (WWW03)
• Semantic query forwarding– Per-hop forwarding behaviors
– Only forward if wi >= = 0 : flooding = 1 : exact answers
22
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Weighted Results
• Same degree distribution (388 nodes)• Uniformly distributed weights between 0 and 1
23
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
4. Conclusions
• Analyzing a real network of bioinformatic databases– Accurate results (even for relatively small networks)– Weighted / unweighted
• Current works– Compositions of weights along a path– Semantic random walkers– Public domain simulator
• Future works– Analyzing other forwarding behaviors– Implementation in a real PDMS (self-organizing
mappings)• GridVine
24
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
References
A Necessary Condition for Semantic Interoperability in the LargePhilippe Cudré-Mauroux and Karl AbererODBASE 2004
GridVine: Building Internet-Scale Semantic Overlay NetworksKarl Aberer, Philippe Cudré-Mauroux and Tim van PeltISWC 2004
Semantic Overlay Networks (Tutorial)Karl Aberer and Philippe Cudré-MaurouxVLDB 2005
… complete reference list athttp://lsirpeople.epfl.ch/pcudre/