network comparison – lipari international summer school – july 3-10, 2010 exact and inexact...
TRANSCRIPT
Network comparison – Lipari International Summer School – July 3-10, 2010
Exact and Inexact Graph Matching with applications
in Biology
Bioinformatica 27-05-2011
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
BIBLIOGRAPHY
DI NATALE R, FERRO A., GIUGNO R, MONGIOVI' M, PULVIRENTI A, SHASHA D
SING: Subgraph search In Non-homogeneous Graphs.
BMC BIOINFORMATICS, vol.11:96,2010.
MONGIOVÌ M, DI NATALE R, GIUGNO R, PULVIRENTI A, FERRO A., SHARAN R.
A set-cover-based approach for inexact graph matching.
JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY,vol. 8, 199—218, 2010
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Outline
• Motivation• Exact matching and Graph Indexing• Indexing large graphs• Indexing for inexact matching• A Set-cover based approach• Multiset multi-cover and a greedy
algorithm• A tight lower bound for the optimal
cover• Experimental analysis• Application on protein complexes• Conclusion and future work
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
HN O
H
HH
H
N
ON
N
H
O
C
C
N N CH
query
matches
HHH
HHHH
H
Searching on molecular compounds
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Searching on protein complexes
query
Query a complex of a species over a database of complexes of another species
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Exact Graph Matching
Given two graphs G1 = (V1, E1, , l), G2 = (V2, E2, , l), an isomorphism (that respects the labels) between G1 and G2 is a bijection : V1 V2 so that:
(v, u) E1 ( (v), (u)) E2
l(u) = l( (u)), u V1
A subgraph isomorphism between G1 and G2 is an isomorphism between G1 and a subgraph of G2.
We say that a graph G1 admits an exact match in G2 if there exist a subgraph isomorphism between G1 and G2.
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Subgraph Isomorphism
The subgraph isomorphism problem is NP-hard. Several algorithms (Ullmann, Nauty, VF2) and tools (NetMatch) have been proposed
If we want to search for a query in a database of graphs, it may take a long time. For this reason, indexing systems have been recently proposed to obtain a reasonable response time
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Graph Indexing Systems
Feature-based graph indexing systems: they consider a set of “features” F and filter out all graphs of the database which do not contain at least one feature of F contained in the query. They use an inverted index to organize the features.
E.g.: gIndex, TreePi, GraphFind
Non-feature based graph indexing systems: the graphs of the database are usually arranged on a tree (R-tree or B-tree like). This systems are more suitable for frequent updates.
E.g. CTree, GCoding
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Features
Each system define its own set of features. Some examples of features are:
• Small graphs (gIndex, FGIndex) : To limit the number of features, they consider the set of frequent subgraphs.
• Trees (TreePi) : Since trees have a center it is possible to improve the filtering phase by considering the distances between centers.
• Paths (SING) : Paths have a starting point. This info can be used to improve filtering and matching. Moreover finding paths is more efficient than finding subgraphs.
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Example
set of featureoccurrences
G
FQ
Q
FG
3
1
1
1
1
2
1
set of features
missing features
2 missing occurrences
Consider as features all paths of length 2
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Graph Indexing Schema
The basic scheme considers three phases:
1. Preprocessing: each graph of the database is examined in order to extract all features which it contains. The features are organized in an inverted index
1. Filtering: the query is examined in order to extract the set of features which it contains, and a candidate graph set is computed by comparing the set of features of the query with the set of features of the graphs
2. Matching: each candidate graph is examined in order to verify if there are matches
SubgraphsTreesPaths
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Example
f1
f2
f3
f4
f5
f6
g1 g2 g4
g1
g3
g3
g1
g1
g6
g6g3
f1
f2
f4
f5
index
Q
Graph DB
preprocessing
g1
filtering
Set of candidates
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
SING
Consider edges as features. Note that AB and AC are contained in both g1 and g2 but only g1 contains the query.
How can we distinguish these cases?
Both features AB and AC start from a single vertex A in g1 and q but not in g2.
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
SING index
We consider as features all the paths of length up to lp (by default lp = 4)
We consider a global inverted index and a local index for each graph
f1
f2
f3
f4
f5
f6
g1 3 g2 1 g4 3
g1 2
g3 3
g3 5
g1 1
g1 7
g6 1
g6 3g3 4
f1
f2
f4
f5
10010000
10000100
00010000
10011101
v1 v4
global index
local index of g1
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Query processing
1. For each feature f of the query, take the set of graphs in which f occurs a number of time greater than or equal to the number of occurrences in the query. Compute the intersection of all taken sets.
2. For each graph of the resulting set, use the local index to compute a mapping between vertices of the query and vertices of the graph.
3. Discard all graphs so that at least one vertex of the query doesn’t have any corresponding vertex in the graph.
4. Assign new labels to the vertices based on the mapping. The new labels make the verification phase faster.
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Comparison – molecules (AIDS dataset)
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Comparison – TRN E.Coli annotated with gene
espression data• 22 copies of the Transcr. Reg. Network of E. Coli• Gene expression profiles of 22 strains of E. Coli K12• Each network labeled with the gene expression profile of a
different sample. 5 labels: very low, low, medium, high, very high.
• Motifs (by Uri Alon) as queries
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Comparison – Single graph (synthetic)
• Scale-free network• 2000 nodes• 4000 edges• 8 labels• Queries extracted
at random
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
The importance of inexact matching
In certain application domains, exact matching is too restrictive because misses partial matches, which can give useful information. In this case, inexact matching is greatly advantageous.
E.g. molecular compounds: partially matching substructures can preserve important chemical properties
E.g. protein complexes: we want to look for a protein complex of a species in a database of protein complexes of another species, in order to identify conserved complexes. Rarely the topology is fully conserved
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Indexing for Inexact matching
GRAFIL: transforms the edge deletions into feature misses and computes the maximum number of feature misses allowed. To improve the results it applies a multi-filter strategy considering several groups of features separately
SIGMA: given a maximum number of edge deletions, it transforms the filtering problem into a variant of Set-cover
SAGA: handles deletions and mismatches. It compares fragments (groups of nodes satisfying a maximum distance constraint) of the query with fragments of each target graph and build a compatibility graph among matching fragments. A clique on the compatibility graph is a candidate match. SAGA uses a different concept of distance between graphs, so its applicability is limited in domains which require to control the number of deletions
CTree: find the subgraphs whose edit distance from the query is low. The distance computation is approximated, so it can produce false negatives
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Inexact matching – edge deletions
• Some edges in the query can be missed in the graph (deletions)
• Grafil and SIGMA fix a maximum number of deletions d and look for all matches obtained deleting from the query a number of edges less than or equal to d
Q
G
deletions
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Managing edge deletions
1
3
2
4
F1F2
F3
• Each edge is associated to the set of features that contains it.
• GRAFIL How many features of Q can be missing in a target graph ? Maximum coverage problem
• SIGMA Given the set of features of a target graph, is it “consistent” with Q and a maximum number of deletions d ? Multiset multi-cover problem
Q
F4
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Feature count vs identity
B
AA
A
A A
A
B
AA
A
Q G
A
BAA
B AA 3
3 BAA
B AA 3
1
• Search for Q with 1 allowed edge deletion
• The maximum number of feature misses is 3 (considering all the occurrences)
• G have 2 feature misses, so it cannot be discarded
• If we look at the identity of features, we note that G misses 2 features of kind AAB, that are sufficient to assert that Q cannot be contained in G
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
SIGMA- admitting one deletion
1
3
2
4
F1F2
F
• Given a graph G, if Q is completely contained in G all features of F must be contained in G.
• If the edge 1 is missing, the features in F1 can be missed in G
• If the edge 2 is missing, the features in F2 can be missing in G and so on…
• In general if we admit maximum one deletion, all features of F – Fi must be contained in G for some i E
• The missing features in G must be contained in Fi for some i E
Q
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Generalizing to more deletions
Given a graph G, find the minimum size set of edges such as:
e
eGQ FFF• This corresponds to find
the minimum number of edges which have to be deleted to be G a candidate to match
• The defined problem is the classical Set-cover problem
• Since a feature can occur several times, we consider instead the Multiset multi-cover problem, with the further constraint that a set can be taken only once(Vazirani)
1
3
2
4
GQ
F3
F1
F4
F2
FG
FQ-FG
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Multiset multi-cover
• We have multisets (each element has a multiplicity)
• Find the min-size subfamily of S whose union contains Y (in respect of the multiplicity)
• E.g. {X2,X3,X4} is a cover for Y
Y
X1
X2 X3
X4
X5
S
26
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Multiset multi-cover
• Multiset multi-cover, like Set-cover, is NP-hard but…
• There is a greedy algorithm which can solve it in polynomial time with bounded error
• We can compute a lower bound for the size of the cover, which we can use to prune the database of graphs. For the filtering to be effective we need a tight lower bound.
• Given a graph G, if the computed lower bound for the cover is greater than the maximum number of allowed deletions then G can be discarded
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
A tight lower bound
• Y is the multiset to cover and S is the input family of multisets
• When XS is taken, assign a cost to each element instance of X, spreading an unitary cost over all the newly covered feature occurrences
• Consider the occurrences of each feature numerated by the order they are covered, and let cost(f, i) be the cost assigned to the i-th occurrence of f.
• Let * be the exact cover, mX (f) and mY(f) the multiplicity of f in X and Y, and rX (f) = min(mX (f),mY(f))
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Lower bound proof
Proof. We prove that:
The thesis obviously implies since * is one of the ‘ S which satisfies the condition under the min operator
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Computing the lower bound
• During the execution of the greedy algorithm, we compute and, for each set X, the quantity fX rX(f) (f).
• The minimum-size ’ is obtained by taking the sets which have the greatest values of fX rX(f) (f)
• More precisely, the sets of S are ranked by fX rX(f) (f) in descending order, then they are taken one by one until the total is greater than or equal to || +
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Query processing
1. Extract the features from the query. 2. Build a family of sets of features S (each set
associated to an edge of the query)3. For each graph
a) Compute the set of missing features Yb) Apply the greedy algorithm for multiset multi-cover
on (S,Y)c) Compute the lower-boundd) If the lower-bound is less than or equal to the
maximum number of allowed deletions then check if there is a match
e) Otherwise discard the graph
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Experimental analysis - molecules
• Comparisons of our approach (SIGMA) against GRAFIL and a layman approach (Edge), over a database of 40.000 molecular compounds
• All methods use paths with length up to 4 as features
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Experimental analysis – query time
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Application on protein complexes
YeastHuman
Protein complexes cross-comparison
Find all protein complexes of yeast which contain a protein complex of human with up to 4 deletions
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Material
• 785 Human complexes from CORUM• 284 Yeast complexes from SGD• The topology was inferred from the PPI
networks (BioGRID)• The vertices were labeled according to the
BLAST score (similar proteins are assigned with the same label)
All-pair-BLAST on yeast and human proteinsAverage-linkage hierarchical clustering with score cutoff 40 and a maximum size 100. Proteins in the same cluster are labeled together
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Experimental analysis - complexes
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Experimental analysis - complexes
Small nucleolar ribonucleoprotein complex
LSm2-8 complex
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Conclusion
Exact matching SINGUse node locality information to improve filteringIdentify and filter nodes of the target network that cannot belong to a matchReassign labels to improve the matching phase
Inexact matching SIGMAEfficient filtering based on Multiset multi-coverGreedy algorithmA tight lower bound for the optimal cover
ApplicationsMolecular compoundsTranscription Regulation NetworksProtein complexes
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Future directions
• Multi-label managementSupport generic associations between query nodes and target nodes (e.g. all-pair-BLAST)Support labels that have a hierarchical structure (e.g. GO)Manage wildcards
• Managing bounded and unbounded pathsDistance and reachability queries with label constraints
• Inexact matching on large graphsMethods for exact matching do not work wellManage matches sharing a large common component
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Future directions
• Find high scored matches (with respect to a scoring function)
Edge weightsNode similarity
• Secondary memory management
Network comparison – Lipari International Summer School – July 3-10, 2010
The Jacob T. Schwartz International School for Scientific Research
(LIPARI SCHOOL)http://lipari.cs.unict.it/
School Director
Professor Alfredo Ferro, Ph.D.Department of Mathematics & Computer ScienceUniversity of CataniaViale A.Doria, 6 - 95125 Catania - ITALYTel: +39 095 7383071Fax: +39 095 330094E-mail: [email protected]
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Jacob T. Schwartz International School for Scientific Research
Biological Sequence Analysis and High Throughput Technologies
Lipari July 2 – July 9, 2011
SpeakersSoren Brunak,Center for Biological Sequence Analysis; Technical University of Danmark
Bud Mishra, New York University
Itzik Peer, Columbia University in the City of New York
Paola Sebastiani, Boston University
Guest LecturersCarlo Croce, Ohio State UniversityGene Myers, HHMIRoded Sharan, Tel Aviv University School Directors
* Prof. Alfredo Ferro (University of Catania) * Prof. Raffaele Giancarlo (University of Palermo) * Prof. Concettina Guerra (University of Padova and Georgia Tech.) * Prof. Michael Levitt, (Stanford University)
* Dr. Rosalba Giugno (co-director, University of Catania) * Dr. Alfredo Pulvirenti (co-director, University of Catania)
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
Jacob T. Schwartz International School for Scientific Research
Game Theoretic approach to Computational Complex Systems
Lipari July 9 – July 16, 2011
Doyne Farmer, Santa Fe Institute – LUISS RomeThe complex dynamics of complicated games
Herbert Gintis, Santa Fe Institute - Central European University - Collegium Budapest The Dynamics of Market Economies
Dirk Helbing, ETH Zurich, Swiss Federal Institute of Technology Zurich Social cooperation, norms and conflicts: A game-theoretical approac
Tim Roughgarden, Stanford universityReward and punishment in Public good Games.
Karl Sigmund, University of Vienna Reward and punishment in Public good Games.
School Directors
* Prof. Alfredo Ferro (University of Catania) * Prof. Dirk Helbing (ETH Zurich) * Prof. Andrea Rapisarda (University of Catania) * Prof. V.S. Subrahmanian (University of Maryland)
Network comparison – Lipari International Summer School – July 3-10, 2010Optimization, Machine Learning and Bioinformatics – Centre "Ettore Majorana“ - Erice - September 8 - 16, 2010
4° International Conference on Similarity Search and Applications
Lipari June 30 – July 1, 2011
Invited SpeakersRoded Sharan, Tel Aviv UniversityPaolo Ferragina, Università di Pisa
http://www.sisap.org/
Network comparison – Lipari International Summer School – July 3-10, 2010
THANK YOU!
http://ferrolab.dmi.unict.it/