statistical approaches to protein matching in bioinformatics
TRANSCRIPT
Statistical approaches toprotein matching in Bioinformatics
Vysaul B. Nyirongo
Submitted in accordance with the requirements for the degree
of Doctor of Philosophy
The University of Leeds
Department of Statistics
January 2006
The candidate confirms that the work submitted is his own and that appropriate
credit has been given where reference has been made to the work of others. This
copy has been supplied on the understanding that it is a copyright material and that
no quotation from the thesis may be published without proper acknowledgement.
2
Dedication
To my father W.C. Nyirongo, the kindest
and
my mother Nee Dorothy Nyirenda, the finest.
Acknowledgements
I am deeply thankful to my supervisor, Prof. K.V. Mardia for his guidance, discus-
sions, helpful comments and inspiring interest on this research. I am also deeply
indebted to Prof. P.J. Green for kindly providing the source code for Bayesian
alignment using hierarchical models.
I wish to thank Dr. C. Xu for his many helpful comments on spatial point pro-
cesses and kindly allowing to use his program for analysing spatial point processes.
I am also grateful to Dr. D.R. Westhead and Dr. N.D. Gold for their many helpful
discussions, comments and for the access to functional sites database (SITESDB).
Finally, but not least, I would like to express by gratitude for financial support
from Universities UK, University of Leeds and the Department of Statistics at Uni-
versity of Leeds. My research studies were financed by Universities UK through
ORS scholarship and University of Leeds through Tetley and Lupton scholarship.
During this research, I was financially supported by the Department of Statistics,
University of Leeds.
i
Abstract
Structural genomics projects aim to provide structural data or accurate models
for uncharacterised proteins (Brenner and Levitt, 2000). The motivation for these
initiatives is the knowledge that similarity between protein structures can provide
evidence of common evolutionary ancestry (and hence possible functional similarity)
even where sequence similarity lies undetectable because structure is conserved for
longer in evolution than sequence (Chothia and Lesk, 1986). Recent advances in
high-throughput protocols for structural determination of structural genomics target
proteins have produced an explosion in volume of structural data prior to knowledge
of protein biochemical function. With these advances has come the need to rapidly
predict functions for proteins based on structure.
We present statistical matching of functional sites. In particular, we are using
the EM algorithm in a mixture model formulation to solve for correspondence and
alignment in matching two configurations of functional sites. We extend the EM
algorithm of Kent et al. (2004) to incorporate concomitant information in matching
functional sites. We also extend Green and Mardia (2006) to matching configura-
tions of coupled points using hierarchical models for Bayesian alignment.
We also present goodness-of-fit statistics for matching two functional sites un-
der the Gaussian error model. We consider the Procrustes statistic for matching of
forms. The Procrustes statistic is related to RMSD except for a divisor. P-values
are used to indicate goodness-of-fit. Related but harder is the problem of finding
the distribution for the minimum Procrustes statistic when the points are unla-
belled. First we will discuss this problem and the inherent difficulty. For illustrative
ii
purposes, we use Gaussian configurations on a line.
Key words: active site, binding site, Bayesian, Bioinformatics, correspondence
and alignment, EM algorithm, functional site, hierarchical models, Markov chain
Monte Carlo, mixture model, Procrustes, Root mean square deviation.
iii
Contents
Abstract ii
Abbreviations and Acronyms xiv
About this Thesis xv
Overview and Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Research Goals and Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Conference Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
1 Introduction and Literature Review 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Mathematical Abstraction of the Problem . . . . . . . . . . . 1
1.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Functional Sites . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.5 The SITESDB Database . . . . . . . . . . . . . . . . . . . . . 6
1.1.6 Structure comparisons . . . . . . . . . . . . . . . . . . . . . . 8
1.1.7 Objectives in matching protein structures . . . . . . . . . . . 8
1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Matching and Superposition Algorithms . . . . . . . . . . . . 11
1.2.2 Extreme Values in Bioinformatics . . . . . . . . . . . . . . . . 17
iv
2 Exploratory Analysis of Protein Geometry 21
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Inter-event distances . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Point to nearest event distances . . . . . . . . . . . . . . . . . 22
2.1.3 The K-function . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Functional Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Protein Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Simulation Design and Evaluation of Algorithms 38
3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1 Functional Sites Simulations . . . . . . . . . . . . . . . . . . . 38
3.1.2 Whole Structure Simulations . . . . . . . . . . . . . . . . . . . 44
3.1.3 Appropriateness of Simulated Data . . . . . . . . . . . . . . . 52
3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Correct Correspondence . . . . . . . . . . . . . . . . . . . . . 56
3.2.2 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Match Statistics 58
4.1 Goodness-of-fit Statistics for Rigid Body Superpositions . . . . . . . . 58
4.1.1 Minimum RMSD Distribution . . . . . . . . . . . . . . . . . . 58
4.1.2 Distribution of Size-and-shape Distance . . . . . . . . . . . . . 60
4.1.3 Simulations for RMSD Distribution . . . . . . . . . . . . . . . 64
4.1.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 EM Algorithm Alignment 68
5.1 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 Soft Matching of Forms . . . . . . . . . . . . . . . . . . . . . 69
5.1.2 Model Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.4 Hardening of Soft Matches . . . . . . . . . . . . . . . . . . . . 73
v
5.2 Concomitant Information in the Mixture Model . . . . . . . . . . . . 77
5.2.1 Concomitant Information Model . . . . . . . . . . . . . . . . . 78
5.2.2 Colour Weighting . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.4 Application on Matching Functional Sites . . . . . . . . . . . 86
5.2.5 Using Amino Acid Group Information . . . . . . . . . . . . . 86
5.2.6 Summarising Comments . . . . . . . . . . . . . . . . . . . . . 91
5.3 Distance Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Multiple Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.1 Soft Matching Model . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.2 Model Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 Bayesian Alignment 103
6.1 Bayesian Hierarchical Model . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.1 Point Process Model, with Geometrical Transformation and
Random Thinning . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1.2 Formulation of Poisson Process Prior . . . . . . . . . . . . . . 104
6.1.3 Data Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.1.4 Prior Distributions and Computations . . . . . . . . . . . . . 108
6.1.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.1.6 Using Concomitant Information . . . . . . . . . . . . . . . . . 115
6.1.7 Results for Graph Theoretic and MCMC . . . . . . . . . . . . 115
6.1.8 Sensitivity of Poisson Prior Assumption . . . . . . . . . . . . 118
6.2 Using Two Atoms for each Amino Acid . . . . . . . . . . . . . . . . . 131
6.2.1 Prior Distributions and Computations . . . . . . . . . . . . . 132
6.2.2 Updating M . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
vi
6.2.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7 Bayesian Refinement of Graph Solutions 137
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3.1 Representation and Matching . . . . . . . . . . . . . . . . . . 139
7.3.2 Graph Theoretic Step . . . . . . . . . . . . . . . . . . . . . . 139
7.3.3 MCMC Refinement Step . . . . . . . . . . . . . . . . . . . . . 140
7.3.4 Accounting for Physico-chemistry Properties . . . . . . . . . . 141
7.3.5 Assessing Quality of Matches . . . . . . . . . . . . . . . . . . 141
7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4.1 Case 1: Alcohol Dehydrogenase and Family . . . . . . . . . . 144
7.4.2 Case 2: 17 − β Hydroxysteroid Dehydrogenase and Family . . 149
7.4.3 Case 3: Alcohol Dehydrogenase and Superfamily . . . . . . . . 149
7.4.4 Case 4: Alcohol Dehydrogenase and FAD/NAD(P)-binding
Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.4.5 Assessing MCMC Refinement . . . . . . . . . . . . . . . . . . 152
7.5 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8 Conclusions and Further Work 157
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.1.1 Functional Sites . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.1.2 Simulating Random Protein Structures . . . . . . . . . . . . . 158
8.1.3 Matching Algorithms . . . . . . . . . . . . . . . . . . . . . . . 158
8.1.4 Concomitant Information . . . . . . . . . . . . . . . . . . . . 159
8.1.5 Hardening Soft Matches . . . . . . . . . . . . . . . . . . . . . 160
8.1.6 Assessing Significance of Matches . . . . . . . . . . . . . . . . 160
8.1.7 Application: Matching NAD Binding Functional Sites . . . . . 161
8.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.2.1 Simulating Random Protein Structures . . . . . . . . . . . . . 161
vii
8.2.2 Matching Statistics . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2.3 Matching Algorithms . . . . . . . . . . . . . . . . . . . . . . . 162
8.2.4 Application: Matching NAD Binding Functional Sites . . . . . 164
Bibliography 165
A Computational Cost 174
A.1 Processor Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.2 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
viii
List of Figures
1.1 Peptide bond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Functional site in 5-aminolaevulinate dehydratase protein structure. . 7
1.3 RasMol ball representation ofCα for functional sites of 17−β hydroxysteroid-
dehydrogenase and carbonyl reductase. . . . . . . . . . . . . . . . . . 9
2.1 The K-function and inter-point distance distribution for a functional
site of 17 − β hydroxysteroid dehydrogenase. . . . . . . . . . . . . . . 25
2.9 The K-function and inter-point distance distribution for Cα atoms in
17 − β hydroxysteroid dehydrogenase structure. . . . . . . . . . . . . 33
2.13 The K-function and inter-point distance distribution for all atoms in
17 − β hydroxysteroid dehydrogenase. . . . . . . . . . . . . . . . . . . 37
3.1 Virtual distances and angles in a protein backbone. . . . . . . . . . . 45
3.2 Distance constraints in a protein virtual backbone. . . . . . . . . . . 47
3.3 Orientation of Cα atoms in simulating protein short chains. . . . . . . 50
3.4 Typical chain realisations in short protein chain simulations without
hydrophobic effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Chain realisations in short protein chain simulations with hydropho-
bic effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 The K-function and inter-point distance distribution for a simulated
hardcore configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 The K-function and inter-point distance distribution for a simulated
short chain configuration. . . . . . . . . . . . . . . . . . . . . . . . . 55
ix
4.1 RMSD against number of corresponding points. . . . . . . . . . . . . 59
4.2 RMSD histogram, approximate and empirical distribution functions. . 65
5.1 Correct correspondence proportions for different hardening methods. . 77
5.2 Illustrative example of data-driven weights for matching. . . . . . . . 81
5.3 Correct correspondence proportions for various weighting schemes.
Bayesian: simple prior conditional probabilities. . . . . . . . . . . . . 84
5.4 Correct correspondence proportions for various α levels. . . . . . . . . 85
5.5 Convergence regions of starting values for EM algorithm. . . . . . . . 87
5.6 Superposition of carbonyl reductase and 17−β hydroxysteroid dehy-
drogenase sites when matching with EM algorithm. . . . . . . . . . . 91
5.7 Match scores and RMSD when using weights. . . . . . . . . . . . . . 93
6.1 Acyclic graph for Bayesian hierarchical model. . . . . . . . . . . . . . 106
6.2 Corresponding amino acids found by MCMC method where the graph
theoretic method gives worse solutions. . . . . . . . . . . . . . . . . . 119
6.3 True correspondence proportions for MCMC, graph and EM algo-
rithm methods for hardcore and Poisson model data. . . . . . . . . . 129
6.4 True correspondence proportions for MCMC and graph for hardcore
data with large variance. . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5 Corresponding amino acids in matching functional sites of 17 − β
hydroxysteroid dehydrogenase and carbonyl reductase using Cα and
Cβ atoms in MCMC and graph theoretic methods. . . . . . . . . . . 134
6.6 Corresponding amino acids in matching functional sites of 17−β hy-
droxysteroid dehydrogenase and carbonyl reductase using Cα atoms
only in MCMC and graph theoretic methods. . . . . . . . . . . . . . 135
7.1 RMSD against number of corresponding amino acids for matching al-
cohol dehydrogenase NAD-binding site against NAD(P)(H) binding
sites of SCOP alcohol dehydrogenase-like family proteins with/without
amino acid property information . . . . . . . . . . . . . . . . . . . . . 146
x
7.2 Effect of MCMC refinement on graph matches of the NAD-binding
functional site of alcohol dehydrogenase against NAD(P)(H) binding
sites of SCOP alcohol dehydrogenase-like family proteins . . . . . . . 147
7.3 Corresponding amino acids between the NAD-binding site of alco-
hol dehydrogenase and NADP-binding site of quinone oxidoreductase
before and after MCMC refinement . . . . . . . . . . . . . . . . . . . 148
7.4 Corresponding amino acids between the NAD-binding site of alcohol
dehydrogenase and NADP-binding site of hypothetical protein YhdH
before and after MCMC refinement step . . . . . . . . . . . . . . . . 148
7.5 Effect of MCMC refinement on graph matches of 17 − β hydroxys-
teroid dehydrogenase NADP-binding site against NAD(P)(H) binding
sites of SCOP tyrosine dependent oxidoreductase family proteins . . . 150
7.6 RMSD against number of corresponding amino acids for matching
17−β hydroxysteroid dehydrogenase NADP-binding site against NAD(P)(H)
binding sites of SCOP tyrosine dependent oxidoreductase family pro-
teins with/without amino acid property information . . . . . . . . . . 151
7.7 Superposition of matching amino acids between alcohol dehydroge-
nase and glyceraldehyde-3-phosphate dehydrogenase binding sites af-
ter MCMC refinement . . . . . . . . . . . . . . . . . . . . . . . . . . 152
xi
List of Tables
3.1 Frequencies of amino acids. . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Relative mutabilities of amino acidsa. . . . . . . . . . . . . . . . . . . 41
3.3 Amino acid substitution matrix. . . . . . . . . . . . . . . . . . . . . . 43
3.4 Target (desired) distances between Cα atoms in simulated short chains
of proteins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Best fitting functional sites in the database when matched against
5-aminolaevulinate dehydratase functional site. . . . . . . . . . . . . . 66
5.3 Example functional sites for comparing results when using or not
using concomitant information in the EM algorithm. . . . . . . . . . 88
5.4 Groups of amino acids. . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Comparison of with and without colour matching results. . . . . . . . 89
5.6 Matching results using Gold (2003) method. . . . . . . . . . . . . . . 90
5.7 Matching statistics for 17−β hydroxysteroid dehydrogenase and rep-
resentative functional sites with/out colour information in EM algo-
rithm and graph methods. . . . . . . . . . . . . . . . . . . . . . . . . 96
5.8 Proportions of correct correspondence and rotation errors when match-
ing forms with two transformations . . . . . . . . . . . . . . . . . . . 102
6.1 Matching statistics for 17−β hydroxysteroid dehydrogenase and fam-
ilies representative functional sites using graph and MCMC methods
(cases with MCMC doing better). . . . . . . . . . . . . . . . . . . . . 120
xii
6.4 Matching statistics for 17−β hydroxysteroid dehydrogenase and fam-
ilies representative functional sites using graph and MCMC methods
(cases with graph doing better). . . . . . . . . . . . . . . . . . . . . . 123
6.8 Number of same pairs in MCMC and graph solutions when using Cα
atoms only and when using both Cα and Cβ atoms. . . . . . . . . . . 135
7.1 Assessment of statistical significance of functional site matching be-
fore and after MCMC refinement step with/out amino acid property
information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.2 RMSD(A) before and after MCMC refinement step without amino
acid property. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.3 The number of matched amino acids before and after MCMC refine-
ment step without amino acid property. . . . . . . . . . . . . . . . . . 154
7.4 RMSD(A) before and after MCMC refinement step with amino acid
property. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.5 The number of matched amino acids before and after MCMC refine-
ment step with amino acid property. . . . . . . . . . . . . . . . . . . 155
A.1 Database search times . . . . . . . . . . . . . . . . . . . . . . . . . . 175
xiii
Abbreviations and Acronyms
CATH: Class, Architecture, Topology and Homologous superfamily
CE: Combinatorial Extension
DALI: Distance (matrix) alignment
DP: Dynamic programming
eF-site: electrostatic surface of Functional site
FAD: Flavin Adenine Dinucleotide
FSSP: Families of Structurally Similar Proteins
LA: Linear assignment
LP: Linear programming
MCMC: Markov chain Monte Carlo
NAD(P)(H): Nicotinamide Adenine Dinucleotide (Phosphate)
PDB: Protein Data Bank
PINTS: Patterns In Non-homologous Tertiary Structures
pvSoar: pocket and void surfaces of amino acid residues
RMSD: Root mean square deviation
SCOP: Structural Classification of Proteins
SitesBase: Database of ligand binding site similarities
SITESDB: Sites Database
xiv
About this Thesis
This thesis is on statistical approaches in matching proteins. We consider match-
ing functional sites of proteins using atom coordinates and type of amino acids as
concomitant information.
We also consider matching configurations of coupled points. By coupled points
we mean two spatially dependent points i.e. points which are say physically con-
nected e.g. two atoms from the same amino acid. This arises in Bioinformatics
application of aligning functional sites when matching amino acids using two atoms
from each amino acid to take into account the relative orientation of the amino acid
We are also interested in statistics which measure quality of the matching.
Overview and Organisation
The thesis is divided into eight chapters. We introduce the problem and review
current literature on the problem in Chapter 1. Particularly, we briefly describe
graph theoretic formulation for structural similarity problem (Gold et al., 2003) in
section 1.2.1.
Exploratory analysis on functional sites is presented in Chapter 2. Simulation
of proteins and functional sites is considered in Chapter 3. Work on matching
statistics is presented in Chapter 4. We consider the EM algorithm for the mixture
model formulation of the problem (Taylor et al., 2003) in Chapter 5, section 5.1. In
section 5.2 we investigate the added value of using concomitant information in the
mixture model. An application of the graph theoretic method and EM algorithm
on representatives of functional sites from tyrosine dependent oxidoreductase family
xv
is discussed in section 5.2.4. We give some of the results for graph theoretic and
MCMC (Green and Mardia, 2006) methods in section 6.1.7. An extension of the
Bayesian alignment method of Green and Mardia (2006) to matching coupled points
is presented in section 6.2. In Chapter 7 we undertake a Bayesian refinement of graph
solutions for matching protein functional sites.
Chapter 8 gives conclusions to work in this thesis and possible future work.
Research Goals and Aims
Two main aims of this research are to investigate:
(a) Optimising correspondence and alignment between two configurations using con-
comitant information.
(b) Statistics and their distributions for matching configurations.
Correspondence and alignment
We aim to investigate and develop methods for solving correspondence and align-
ment between configurations with concomitant information. This aim is multi-
objective as we want a maximal number of concordant corresponding points with
respect to the concomitant information; and also close geometrical alignment of
the configurations. We investigate effective ways of optimising correspondence and
alignment with regard to these objectives.
Optimisation of multi-objective problems
When faced with multi-objective problem, the traditional approach is to come up
with a composite objective function that incorporates all individual objectives. This
is called a preference-based multi-objective optimisation. The other approach is
multi-objective optimisation per se.
Preference-based multi-objective optimisation
This is usually a weighted sum of all the objectives. This procedure of handling
xvi
multi-objective optimisation is much simpler. The disadvantage of this approach
is the subjectivity in coming up with weights (Deb, 2001). This subjectivity is
particularly acute in our problem of protein matching. There is no clear indication
as to how to weight the two objectives in this problem. Protein 3-dimensional
structure comparisons are mainly by root mean square distance (RMSD). Sometimes
a score derived from amino acid type matches alone is used Gold, 2003. There is no
universally accepted score function for both amino acid type matches and RMSD.
Multi-objective optimisation
Recently, with an advent of evolutionary algorithms (EAs), there has been a rise in
interest for multi-objective optimisation per se. This approach does not lead to one
solution but a set of optimal solutions called Pareto-optimal solutions. Users choose
one of the obtained solutions using higher-level information. Refer to Deb (2001)
for a detailed discussion on multi-objective optimisation. Definitely this approach
would be very relevant to protein 3-dimensional structure matching.
Other objective functions
Energy function derivatives are also popular choices for optimisation in protein 3-
dimensional structure matching.
Statistics and distributions
The second aim of this research is to investigate and develop statistics of matching
3-dimensional configurations in general and protein structures in particular. We in-
vestigate these statistics and their distributions under “random” and “non-random”
configuration hypotheses.
xvii
Contributions
In this thesis, there are a few contributions towards matching geometrical configu-
rations in general and functional sites in particular.
(a) Exploration of the functional sites using spatial statistics tools. In Chapter 2,
we explore the spatial characteristics of functional sites using tools in spatial
statistics. Here we learn from point patterns analysis that functional sites tend
to be elongated tubular structures rather than isolated points in space. Au-
thor’s contributions include doing the computations and statistical analyses.
(b) In chapter 3, section 3.1.2 we propose an alternative to Aszodi and Taylor
(1994) method of simulating short virtual peptide chains. Aszodi and Tay-
lor (1994) method iterates between “distance space” and “Euclidean space”
(coordinates). We experiment with a similar but simplified method based on
coordinates only. Author’s contributions are
• Derivation of the formula for coordinates given the conformational angle.
• Strategy for modelling the hydrophobic effect.
(c) Goodness-of-fit statistics in section 4.1, Chapter 4 are used for comparing
quality of matches with different number of matched points. Using p-values
we arrive at similar conclusions as using the score proposed by Gold (2003).
Author’s contribution was to derive the approximate distribution for RMSD,
staring from the size-and-shape distribution (Dryden and Mardia, 1998) under
the isotropic Gaussian error model for matching configurations.
(d) In section 5.2 of Chapter 5 we formulate a mixture model with concomitant
information for matching functional sites. This is an extension of the method in
Kent et al. (2004). In section 5.3, we constrain matches in order to get better
solutions. Another contribution on matching is the framework for allowing
multiple transformations (section 5.4) in alignment. Author’s contributions
are
xviii
• Formulating a model and likelihood with concomitant information where
concomitant information is assumed to be independent of geometrical
information.
• Introducing techniques to constraint matches in order to get better match-
ing solutions.
• Formulating the model and likelihood for multiple transformation.
(e) In section 6.2 we extend Green and Mardia (2006) to matching configurations
of coupled points. Author’s contribution in this work is how to take depen-
dence between atoms from the same amino acid.
(f) We present a new method in Chapter 7 for matching protein functional sites
based on initial graph matching and followed by refinement using Markov chain
Monte Carlo (MCMC) procedure in Bayesian hierarchical modelling frame-
work.
Author’s contributions in this work include
• Formulation to account for side chains.
• The meta algorithm (add refinement step to graph theoretic).
• Extending software implementation to account for two atoms for MCMC.
• Modifying software for graph-theoretic to account for physico-chemistry
properties.
• Computations and statistical analysis of the results.
xix
Conference Papers
• Mardia, K.V., Green, P.J., Nyirongo, V.B., Gold, N.D. and Westhead, D.R.
(2006). Bayesian refinement of protein functional site matching. submitted.
• Mardia, K.V., Nyirongo, V. and Westhead, D.R. (2005). EM algorithm,
Bayesian and distance approaches to matching active sites. Mathematical and
Statistical Annual Meeting in Bioinformatics, Abstracts pp. 13-14. Rotham-
sted.
• Mardia, K.V. and Nyirongo, V. (2004). Procrustes statistics for unlabelled
points and applications In R.G. Aykroyd, S. Barber, and K.V. Mardia (Eds.),
Bioinformatics, Images, and Wavelets, p. 137. Department of Statistics,
University of Leeds.
• Mardia, K.V., Nyirongo, V. and Westhead, D.R. (2003). Protein Matching
Using Amino Acids Information. In R.G. Aykroyd, K.V. Mardia and M.J.
Langdon (Eds.), Stochastic Geometry, Biological Structure and Images, p. 147.
Department of Statistics, University of Leeds.
xx
Chapter 1
Introduction and Literature
Review
Matching and aligning of 3-dimensional protein structures are part of an active area
of research in Bioinformatics. This involves developing algorithms for matching, as
well as statistics and distributions of measures for quantifying quality of matching
and alignment. In this chapter we give a little background to the research problem
being addressed. In section 1.2 we also highlight current literature on the topic.
1.1 Introduction
The main matter of this research is to use statistical approaches to matching con-
figurations of points in 3-dimensional. The research problem is mathematically
formulated in section 1.1.1.
1.1.1 Mathematical Abstraction of the Problem
We have two point configurations, {µi} and {xj} in ℜd for i = 1, . . . , m and
j = 1, . . . , n. Without loss of generality we can assume n ≤ m (see section 5.1.1).
In addition to coordinates, the points have some attributes e.g. colour. The colours
of the points are the concomitant information. We require to match these config-
1
Chapter 1. Introduction and Literature Review 2
urations in some defined optimal way. Matching in this case means finding cor-
responding points and rigid body motion required to bring the configurations into
registration or superimposition.
Optimal matching is the one with (in no any particular order of importance):
(a) maximised number of corresponding points q ≤ n ≤ m and
(b) minimised average distance between the corresponding points under rigid body
motion transformation of the coordinates.
(c) maximised number of corresponding points with similar or same colour (at-
tribute).
It is not certain how much importance to attach to each requirement for the
above vaguely defined optimality criteria. Informally, a match is regarded as best if
as many as possible corresponding points are as close as possible geometrically and
there is as many as possible similarly coloured corresponding points.
1.1.2 Motivation
This work is particularly motivated by an application in Structural Bioinformatics,
where pair-wise or multiple matching of 3-dimensional structures of proteins is of
interest. Sometimes matching just functional part of proteins (functional sites) is
of importance. We give a brief introduction to proteins and functional sites in the
next section.
1.1.3 Proteins
Proteins are essential for the functioning of the living organisms (Branden and Tooze,
1999; Lesk, 2000). Proteins perform a wide variety of functions in an organism. For
convenience, proteins can be divided into several major classes including but not
limited to structural proteins, transport proteins, messenger proteins and enzymes.
The most familiar of the structural proteins are probably keratins, which form
the protective covering of all land vertebrates: skin, fur, hair, wool, claws, nails,
Chapter 1. Introduction and Literature Review 3
hooves, horns, scales, beaks and feathers. Equally widespread are actin and myosin
proteins of muscle tissue. Another group of structural proteins are the silks and
insect fibres. In addition, there are collagens of tendons and hides, which form
connective ligaments within the body and give extra support to the skin where
needed.
Transport proteins include serum albumin, haemoglobin and myoglobin. Serum
albumin transports water-insoluble lipids in the bloodstream. Haemoglobin carries
oxygen from the lungs to the tissue. Myoglobin performs a similar function in muscle
tissue, taking oxygen from the haemoglobin in the blood and storing it or carrying
it around until needed by the muscle cells.
Messenger proteins are one of the means by which cells in one part of the body
communicate with cells in another part of the body. Relatively, they are generally
quite small as proteins. Many are hormones. But not all hormones are proteins. Two
examples are oxytocin, which occurs in females and stimulates uterine contractions
during child birth, and vasopressin, whose major function is as an anti-diuretic.
Each function or use demands its own protein structure, and their interaction
depends on the 3-dimensional configuration which is the set of all 3-dimensional
coordinates of all atoms. However, there are four different levels of protein structure.
• primary:
the sequence of amino acids.
• secondary:
repeated patterns of local three-dimensional structure in the amino acids (α-
helix, β-sheet/β-strands).
• tertiary:
the full three-dimensional structure of a peptide chain, described as atomic
coordinates or conformational angles (φ and ψ).
• quaternary:
one or more peptide chains which together form the fully functional protein.
Chapter 1. Introduction and Literature Review 4
The main challenge is how to infer the structure components as well as the
function from the primary level of the amino acids. These are the problems of
protein structure and function prediction. There are various approaches to study
proteins.
(a) Biophysical approach:
simulate the action of the physical laws that operate when the polypeptide
chain folds into the 3-dimensional structure. Look for all possible combinations
and among them for those with lowest energy.
(b) Sequence based approach:
use the information from the sequence of amino acids to match directly.
(c) Homology approach:
proteins with homologous structure have a similar 3-dimensional structure and
function but serious exceptions exist.
(d) Combination:
combine (a), (b), (c) + physico-chemistry properties/evolutionary relation-
ships.
Following Mardia et al. (2003), here we will simply define
Protein = {C1, C2, . . . , Ck}
as an unordered set of k peptide chains Ci, where Ci = {si1, . . . , siNi}, is an ordered
sequence of amino acid residues sij ∈ {P1, . . . , P20}, j = 1, . . . , Ni, and Pl = lth
amino acid type, l = 1, . . . , 20. Note that typically Ni = 200 − 2000.
An amino acid residue is a set of atoms (and covalent bonds). This atom set can
be partitioned into backbone atoms, B (same for every residue type) and side-chain
atoms Rl (differing between residue types). Figure 1.1 shows two amino acids (si
and si+1) joined by a “peptide bond”.
Pl = {B,Rl}, the peptide chain may be known only at the sequence level, where
the identities sij of the amino acid residues are known but there is no information
Chapter 1. Introduction and Literature Review 5
OH
O
H
H
H
CN
C’
O
H
H
H
N C’
OH
O
H H
N
C’
CC
C’
OH
O
H
H
H
CN
+
2H O
Peptide bond
αα
α αψφ
Ri
Ri Ri+1
Ri+1
Si Si+1
Figure 1.1: Peptide bond joining two amino acids.
about three-dimensional structure. This is commonly the type of information that
emerges from genome sequencing projects. In a minority of cases, three dimensional
structure information may be available in the form of x, y and z Cartesian coordi-
nates for all the protein atoms. Information about the association of peptide chains
into complete proteins (quaternary structure) may be available in some cases.
The amino acids can be labelled by the side chain (Ri) which takes one of 20
types. For example with Ri = H we have glycine, with Ri+1 = CH3 we have alanine.
These are sometimes also referred to as peptide units. Each peptide unit can only
rotate around N −Cα and Cα−C ′ bonds; these angles φ and ψ are also of interest.
Amino acids have different physico-chemistry properties and can be grouped
according to shared properties e.g. hydrophobic or hydrophilic (see Table 5.4 for
one one possible grouping). Hydrophobic amino acids are those with side-chains
that do not like to reside in an aqueous (i.e. water) environment. For this reason,
these amino acids are generally buried within the hydrophobic core of a protein. On
the other hand non hydrophobic or hydrophilic amino acids tend to interact with
the aqueous environment and are predominantly found on the exterior surfaces of
proteins or in the reactive centres. This property is more important for transport
proteins. These proteins are often globular structures and are generally tightly
Chapter 1. Introduction and Literature Review 6
packed (compact) with hydrophilic (polar) side chains on the outside to enhance
their solubility in water. They typically have hydrophobic (non-polar) side chains
folded to the inside to keep water from getting in and unfolding them. In section
3.1.2 we take into account hydrophobic/hydrophilic properties of the side chains in
order to simulate globular, compact structures. We also take into account physico-
chemistry properties in matching functional parts of proteins (see section 1.1.4) in
Chapters 5, 6 and 7.
The data bank Swiss-Prot contained sequence data of more than 212,425 proteins
as of 21st March, 2006. Protein 3-dimensional structures derived from X-ray diffrac-
tion and neutron-diffraction studies of crystallised proteins are housed at the Protein
Data Bank (PDB). There were about 35,813 (as of 28th March, 2006) structures
which can be accessed at web address http://www.rcsb.org.
1.1.4 Functional Sites
Although proteins are large molecules, in many cases only a small part (e.g. in
Figure 1.2) of the structure: a functional site - is functional, the rest existing only
to create and fix the spatial relationship among amino acids of the functional site.
The term functional site refers to both active sites and binding sites. An active
site is a protein part where chemical reactions occurs while a binding site refers to
a region which binds specific ligands (smaller molecules). For example, Figure 1.2
shows a functional site in 5-aminolaevulinate dehydratase protein structure.
1.1.5 The SITESDB Database
In this thesis all functional sites were taken from a database of known sites (SITESDB)
(Gold, 2003). SITESDB had 91,441 entries (functional sites) as of 28th March, 2006.
The median and mean for the number of amino acid was 10 and 16 respectively.
Lower and upper quartiles were 10 and 19. The range was from 1 to 120.
SITESDB entries were automatically formed from the PDB (Berman et al., 2000)
by locating the local protein environment (amino acids within 5A) around bound
Chapter 1. Introduction and Literature Review 7
Figure 1.2: Functional site in 5-aminolaevulinate dehydratase protein structure.
ligands (identified by PDB HETATM records) and author annotated active sites
(identified by PDB SITE records). A protein may contain multiple functional sites
so unique identifiers for SITESDB entries were generated from the four letter PDB
identifier with an extra integer to distinguish sites from the same protein. For
example, the identifiers 1hdx 0 and 1hdx 1 were separate sites from the protein
with PDB identifier 1hdx.
The automatic extraction of sites results in multiple and incomplete representa-
tions of functional sites containing more than one bound ligand, or sites that are
both annotated with SITE records and contain bound ligands. In these cases a
better biochemical description of the site was obtained by merging component sites
without duplication of their amino acid contents. Sites were merged if ligand atoms
occurred within 5A of atoms in a second ligand (cf 5-5 rule in Park et al., 2001;
Dafas et al., 2004; Gong et al., 2005). In the absence of bound ligands, sites were
merged if they were found to contain common amino acid residues.
Chapter 1. Introduction and Literature Review 8
Availability
SITESDB is accessible at http://www.bioinformatics.leeds.ac.uk (hosted by
the Institute of Molecular and Cellular Biology, University of Leeds). The database
currently contains more than 90,000 functional sites.
1.1.6 Structure comparisons
The 3-dimensional structure of protein is very important in understanding how pro-
teins function as other proteins with similar 3-dimensional structures are likely to
have related functions. Therefore comparing 3-dimensional protein structures is
very important. A newly determined protein structure with 3-dimensional structure
similar to a protein with a known function is likely to have a similar function. This
would facilitate predicting the function of a newly determined protein structure.
The other useful application is protein homology detection. Structure comparison
can complement sequence similarity which is commonly used for homology mod-
elling. Homology refers to proteins having descended from a common ancestor. The
importance of 3-dimensional comparisons cannot be overemphasised as these are
more conserved than the amino acid sequences in homologous proteins.
As much as overall protein structure comparisons are done and very useful in
some applications (see literature review in section 1.2), they have sometimes difficul-
ties in identifying situations where proteins share similar structures and are clearly
related in evolution, yet they have different functions. The reverse i.e. proteins with
functional similarity but having differences in their structure also present difficul-
ties e.g. overall fold comparison misses the functional similarity of subtilisin and
chymotrypsin (Blow et al., 1969; Wright et al., 1969). Thus to complement fold
comparisons we consider comparing functional sites of proteins.
1.1.7 Objectives in matching protein structures
To appreciate the difficulty involved in matching protein structures or part thereof,
consider configurations of functional site Cα atoms in Figure 1.3 from 17 − β
Chapter 1. Introduction and Literature Review 9
hydroxysteroid-dehydrogenase and carbonyl reductase proteins. These functional
sites are related but which and how many atoms correspond are unknown. However
it is not always known apriori if the functional sites are related or not. Our aim is to
match atoms of these configurations. Functional sites matching has two objectives:
(a) To match the proteins geometrically so as to minimise the root mean square
error (RMSD),
r(x, y) =
[q∑
i=1
||xi − yi||2/q]1/2
where we have given q points for configuration {x} and the corresponding q
points for configuration {y}. The matched proteins should come as close as
possible (minimal RMSD) when configurations are superimposed on each other.
(b) The second objective is to maximise the matches of similar residues.
These objectives are often conflicting. Hence the question is how to optimise
this multi-objective matching problem.
a) 1a27 0 (63 atoms) b) 1cyd 0 (40 atoms)
Figure 1.3: RasMol (Sayle and Milner-White, 1995) ball representation of Cα for
functional sites of 17 − β hydroxysteroid-dehydrogenase (1a27 0) and carbonyl re-
ductase (1cyd 0).
Chapter 1. Introduction and Literature Review 10
1.2 Literature Review
Whole domain structural comparison methods such as CE (Shindyalov and Bourne,
1998) and DALI (Holm and Sander, 1993) and databases such as FSSP (Holm et al.,
1992), CATH (Orengo et al., 1997) and SCOP (Hubbard et al., 1997) provide valu-
able insight into the functions of newly determined proteins. However, discovery of
proteins adopting similar folds but exhibiting a variety of functions i.e. superfolds
(Orengo et al., 1994) and proteins showing similar functions without common an-
cestry (Blow et al., 1969; Wright et al., 1969) poses problems for comparisons at the
fold level. Note that SCOP hierarchical classification consists of class, (super)fold
and (super)family.
Protein function is usually carried out by relatively small parts of protein surfaces
at ligand binding or catalytic sites and hence new structural comparison methods
focus on the precise structural nature of these functional sites (Artymiuk et al., 1994;
Binkowski et al., 2003; Kinoshita et al., 2002; Kleywegt, 1999; Shulman-Peleg et al.,
2004; Stark et al., 2003b; Wallace et al., 1997). These methods are based on the
idea that geometrically similar sites are likely to have similar functions since their
amino acids are conserved in precise orientations in order to perform their chemistry
or their similar shapes and physico-chemical properties may be selective for similar
small molecules such as substrates, inhibitors or cofactors. Hence, finding structural
similarity to functional sites of known and characterised proteins may facilitate
function prediction for newly determined protein structures even in the absence of
overall fold or sequence similarity.
Functional site comparison methods essentially fall into one of two categories.
The first category provides known templates of specific motifs of conserved amino
acids or atoms often involved in enzyme catalysis (Artymiuk et al., 1994; Kley-
wegt, 1999; Wallace et al., 1997). These are knowledge-based methods which aim
to discover new proteins with the same catalytic function. The second category
consists of similarity searching algorithms (Binkowski et al., 2003; Schmitt et al.,
2002; Shulman-Peleg et al., 2004; Stark et al., 2003b; Kinoshita et al., 1999) where
Chapter 1. Introduction and Literature Review 11
prior knowledge of motifs is not required and site similarity is assessed by how
closely the sites align and/or the proportion of overlap. Partial similarity between
sites can be detected and hence much larger sites such as ligand binding sites can
be compared. Methods addressing this problem generally represent functional sites
or functional site surfaces as mathematical graphs for graph-theoretic or geomet-
ric hashing comparisons where graph vertex positions are placed using a variety of
methods. CavBase (Schmitt et al., 2002) , SiteEngine (Shulman-Peleg et al., 2004)
and PINTS (Stark et al., 2003b) for example use positions of pseudo-centres whereas
eF-site (Kinoshita et al., 1999) uses electrostatic potentials and surface curvature.
pvSoar (Binkowski et al., 2003) and SitesBase (Gold and Jackson, 2006) use alpha-
shapes and an all-atom model respectively. Recently, Green and Mardia (2006)
proposed a Bayesian hierarchical modelling approach using Cα atoms.
1.2.1 Matching and Superposition Algorithms
Finding the correspondence is intrinsically a combinatorial problem. Without geo-
metric constraints there are
min(n,m)∑
q=1
q!
(n
q
)(m
q
)ways of choosing corresponding pairs
from two configurations with m and n points. However with geometric constraints,
the solution space is tremendously reduced. Matching methods exploit geometric
constraints to solve for correspondence.
To show how geometric constraints make the correspondence problem feasible,
Kuhl et al. (1984) presented a naive brute force approach for matching a molecule
to a functional site.
A naive brute force method (Kuhl et al., 1984)
With the requirement that matching pairs are geometrically as close as possible, all
degrees of freedom are expended when three pairs of matches are made. Thus after
making three matches, simply check the coincidence of other points. Suppose two
configurations are {xj} and {µi}, j = 1 . . . n and i = 1 . . .m. Kuhl et al. (1984) in
their “DOCK” algorithm proceed as follows:
Chapter 1. Introduction and Literature Review 12
(a) For each unique set of three pairings ({i1, j1},{i2, j2},{i3, j3}) of points from two
configurations:
i. Choose the first pair to superpose by translation b.
ii. Find rotation A1 to bring the second pair into optimal superposition.
iii. Find rotation A2 to superpose the third pair.
iv. Thus got xjl = A1A2µil + b, l = 1, . . . , 3. Matching pairs are coinciding
(closest and within a defined distance of each other) points of {x} and
{A1A2µ+ b}.
v. Calculate the number of matched pairs and the RMSD.
(b) The solution is the combination which gives the largest number of matches. In
the case of several solutions with the same number of matching pairs, the one
with smallest Procrustes distance may be taken.
Kuhl et al. (1984) algorithm goes through mn(m − 1)(n − 1)(m − 2)(n − 2)
combinations i.e. mn ways to choose b; (m − 1)(n − 1) ways to choose A1; and
(m− 2)(n− 2) ways to choose A2. Some of these combinations are unnecessary for
ordering is not important. There is need for just mn(m−1)(n−1)(m−2)(n−2)/3!
combinations.
There are a few more efficient approaches for solving the problem of matching and
superimposing in Bioinformatics applications in literature. These efficient matching
methods mainly fall in two categories:
(a) Algorithms iterating between solving for alignment and correspondence. Align-
ment and correspondence support each other, making the problem solvable in a
reasonable time space. These algorithms include the EM algorithm considered
by Kent et al. (2004) which is presented in section 5.1. Wu et al. (1998) also
use an iterative algorithm.
Also in this category is the approach by Green and Mardia (2006). Green
and Mardia (2006) take a Bayesian approach where-by they formulate a joint
Chapter 1. Introduction and Literature Review 13
model for alignment and correspondence. Conditional models for alignment and
correspondence are updated in turn of each other. This framework is presented
in section 6.1.
(b) Combinatorial algorithms which utilise inter-point distance constraints. These
distance-based methods use graph theoretic algorithms to solve for correspon-
dence. Kuhl et al. (1984) proposed to use a graph algorithm of Bron and Ker-
bosch (1973) for matching a molecule binding to a functional site. Gold (2003)
implemented a parallelised database search tool on a Beowulf system, using ei-
ther Bron and Kerbosch (1973) or Carraghan and Pardalos (1990) graph clique
detecting algorithms to match functional sites.
Below we briefly describe the graph theoretic approach taken by Gold (2003). Also
for iterative algorithm category, we briefly describe the approach by Wu et al. (1998).
Graph method (Gold et al., 2003)
The principles of graph theory have been applied to matching biomolecular config-
urations for some time e.g. Kuhl et al., 1984 and Artymiuk et al., 1994. Consider
points as representing amino acid positions. These points could have attributes (con-
comitant information) representing amino acid groups or types. We require to match
two configurations of points {xj} and {yk} for j = 1, 2, . . . , m and k = 1, 2, . . . , n.
• Each configuration is represented by a mathematical graph.
• Vertices are placed at point positions.
• Each vertex is connected by an edge to every other vertex in the same graph.
• Each edge is labelled with the inter-point distance.
A search for the maximum similarity between two graphs G1 and G2 repre-
senting configurations {x} and {y} respectively; corresponds to finding the maxi-
mal common subgraph or a clique within the vertex product graph for G1 and G2
(Hv = G1 ◦v G2). The vertex product graph is defined as follows:
Chapter 1. Introduction and Literature Review 14
Definition 1.2.1. If V1 and V2 are the sets of vertices for G1 and G2 respectively.
The vertex product graph Hv = G1 ◦v G2 includes the vertex set VH = V1 × V2, in
which the vertex pairs (xj , yk) with xj ∈ V1 and yk ∈ V2 have the same attribute.
An edge between two vertices vh = (xj , yk), vh′ = (xj′ , yk′) ∈ VH exists for j 6= j′
and k 6= k′ if the absolute difference between the distances |xj − xj′| and |yk − yk′|is less than some threshold, say δ = 1.5A.
Graph matches based on inter-point distances are not necessarily superimposable
(e.g. mirror image sites). Subsequently, a Procrustes algorithm (Kabsch, 1978) is
used to check that matched configurations are geometrically superimposable. The
Procrustes algorithm minimises the size-and-shape squared (least squares) distance
between two structures, say X1 and X2. The size-and-shape squared distance is:
d2S(X,µ) = inf
A∈SO(d)‖X2 − AX1 − b‖2.
Here d = 3 and SO(d) denotes a set of all d × d rotation matrices (orthogonal
matrices with the determinant equal to +1), b is the translation vector.
Basically the algorithm is a three step process:
(a) Construct a vertex product graph.
(b) Find a maximal clique within the product graph.
(c) Check the 3-dimensional superimposition using Kabsch (1978).
In the least restrictive case all vertices (points) are assumed to have the same
attribute and hence matching can occur between any two points and is only depen-
dent on inter-point distances. Alternatively points can be labelled with colours (con-
comitant information) to restrict matching points with the same colour i.e. colour is
treated as an attribute. Although concomitant information can be incorporated as
attributes, this approach is very rigid. When matching functional sites, Gold, 2003
take into account the amino acid type by introducing a score (presented in section
3.2.2). Bron and Kerbosch (1973) finds all common subgraphs in addition to the
clique. Concomitant information can be used to score all the common subgraphs in
Chapter 1. Introduction and Literature Review 15
order to give preference to the solution with the highest score. Gold (2003) score all
complete subgraphs found by the Bron and Kerbosch (1973) and take the one with
a maximal score. However the algorithm of Carraghan and Pardalos (1990) finds
just the clique so it is not possible to use concomitant information with Carraghan
and Pardalos (1990). Gold (2003) uses the algorithm of Carraghan and Pardalos
(1990) because it is faster or optionally the algorithm of Bron and Kerbosch (1973)
can be used in order to account for concomitant information.
Iterative algorithm (Wu et al., 1998)
This method is for analysing multiple protein structures. The method allows to
perform superposition and averaging. The algorithm iterates between solving for
correspondence and superposition. Correspondence is solved by dynamic program-
ming and superposition by least squares regression.
Dynamic programming
Dynamic programming is used to align two sequences; specifically it finds corre-
spondence between two structures that minimises the overall distance between the
structures. Let i and j be the sequence indices of atoms in structures {µi} and
{xj} respectively for i = 1, 2, . . . , m and i = 1, 2, . . . , n. Let d(j, i) be some distance
metric between atoms {xj} and {µi}. Then we can find two collinear sequences of
atoms 1 ≤ j(1) < j(2) < · · · < j(q) ≤ n that minimise the function
∑qr=1 d(j(r), i(r)) + g(0, j(1)) +
∑q−1r=1 h(j(r), j(r+1)) + g(j(q), n + 1)+
g(0, i(1)) +∑q−1
r=1 h(i(r), i(r+1)) + g(i(q), m+ 1).(1.1)
where g(r, s) is the gap penalty for skipping from r to s at the end of either sequence,
and h(r, s) is the gap penalty for skipping from r to s in the middle of either sequence.
The algorithm makes two attempts to find correspondence within each iteration.
Firstly it uses curvature at each Cα as a distance metric for matching. Secondly it
matches using coordinates of Cα as distance metric.
Dynamic programming is used to find a correspondence between two structures
that minimises the overall distance between the structures. The premise behind
Chapter 1. Introduction and Literature Review 16
the algorithm is that an optimal correspondence can be constructed by adding two
aligned elements to a previously obtained optimal alignment. This insight means
that it is not necessary to search all possible alignments in order to obtain the optimal
one (given two n-length sequences, this amounts to time proportional to n4n; rather,
dynamic programming sequentially adds elements to an optimal alignment that are
already constructed). This basically reduces time cost to just O(n2)
Mechanics of the iterative algorithm (Wu et al., 1998)
In general the algorithm allows superposition of multiple proteins. Let Xj be a
coordinate matrix of corresponding atoms in the jth protein structure. Each column
in Xj represents an atom in the protein structure. In the least-squares formulation,
they find an affine model X and transformation matrices Aj (forj = 1, . . . , J) that
minimise the objective function:
J∑
j=1
‖AjXj − X‖2. (1.2)
The algorithm consists of three steps:
(a) Compute a curvature function κ for each protein structure Sj. Find corre-
sponding landmarks X(1)j by matching curvatures to a reference structure and
obtain the affine model X(1) and transformation matrices A(1)j for j = 1, . . . , J .
(b) Find corresponding landmarks X(2)j by matching coordinates to a reference
structure, and obtain the affine model X(2) and transformation matrices A(2)j .
(c) Find corresponding landmarks Xj by matching coordinates iteratively to the
evolving affine model, and obtain the affine model X and transformation ma-
trices Aj .
The iterative algorithm of Wu et al. (1998) assumes and uses sequence order
information in addition to spatial information in terms of point coordinates while
the graph method of Gold (2003) uses spatial information in terms of inter point
distances. A common problem with these approaches is that they do not take into
Chapter 1. Introduction and Literature Review 17
account concomitant information of amino acids in their matching and alignment in
a flexible way as to model the amino acid substitution phenomenon taking place in
proteins.
1.2.2 Extreme Values in Bioinformatics
Minimum RMSD in protein matching
Extreme values of RMSD in Bioinformatics protein matching applications are at
two levels. The first level is for each pair-wise matching of two configurations, say
{µ} and {x} with m and n points respectively. In this case optimal RMSD in some
sense is sought. Optimal RMSD could be defined to satisfy:
(a) the minimum RMSD among q! (mq ) (nq) values where q ∈ [2, . . . ,min(n,m)] is the
number of matched points;
(b) and require that after alignment, distances between matching points be within
a specified tolerance limit;
(c) and q is maximal.
In Chapter 5 and 6 we would look for corresponding points and alignment that give
the minimum RMSD.
The second level is when searching the database for a match. Here the interest
are matches with smaller RMSD. For example, best fitting matches are analysed
in Chapter 4 where we develop a method to rank best matches. In Chapter 7 we
follow Stark et al. (2003b) using Extreme Value Distribution (EVD) to quantify the
probability of matching by chance. In this set-up the null hypothesis is that match-
ing configurations are random and the matching is due to mere chance (random
matches). That is matching configurations are not related in any way whatsoever.
Extreme value distribution as null distribution
In Bioinformatics applications e.g. sequence matching or structure matching, the
sample space under the random matching hypothesis is practically infinite, diffi-
Chapter 1. Introduction and Literature Review 18
cult to specify and calculate. Consider the question of two random 3-dimensional
structures. These can be of any sizes, say, m,n = 1, 2, . . . and give matches of size
q = 2, 3, . . . . Each structure with m > 2 or n > 2 points can take an infinite num-
ber of configurations. Following Stark et al. (2003b), a practical way to specify the
random distribution is by collecting a large enough database of non-redundant and
non-homologous configurations. The database distribution is then used as the null
distribution under the random hypothesis. The database has to be non-homologous
and non-redundant to correctly control for Type I error rate. Ideally the background
database size should be as large as possible.
Because the interest is in the extremes from an infinitely large database, limiting
Extreme Value Distributions (EVD) are used to model the background database
distribution. For example, let the distribution for RMSD, r be F (r) and denote
its limiting distribution by G(r). Due to weak reliance of limiting EVD on data-
generating distribution function, F (r), the null distribution can be easily modelled
reliably even in these cases where F (r) is difficult to calculate or let alone specify.
What is required is just to know how F (r) depends on m,n and q.
Limiting extreme value distributions
Extremal types distributions are limiting distributions used to model extreme devi-
ations from the mean of probability distributions for stochastic processes.
Two approaches exist today:
(a) most common at this moment is the tail fitting approach based on the second
theorem in extreme value theory (Theorem II Pickands, 1975; Balkema and de
Haan, 1974).
(b) Basic theory approach as described by Burry (1975).
In general this conforms to the first theorem in extreme value theory (Theorem
I Fisher and Tippett, 1928; Gnedenko, 1943). The difference between the two
theorems is due to the nature of the data generation. For theorem I the data are
generated in full range, while in theorem II data is only generated when it surpasses
a certain threshold (POT’s models or Peak Over Threshold).
Chapter 1. Introduction and Literature Review 19
There are three classes of limiting distributions for extreme values:
Gumbel
G(r) = exp {− exp(−r)} for −∞ < r <∞ ;
Frechet
G(r) =
0 r ≤ 0;
exp(−r−α) r > 0, α > 0.
Negative Weibull
G(r) =
exp [−(−r)−α] r < 0, α > 0;
1 r ≥ 0.
These classes are unified by re-parametrisation to give the Generalised Extreme
Value distribution, GEV(µ, σ, ξ) with distribution function
G(r) = exp
{−[1 + ξ
(r − µ
σ
)]−1/ξ
+
}(1.3)
where x+ = max(x, 0) and σ > 0, so up to type the GEV distribution is
G(r) = exp[−(1 + ξr)
−1/ξ+
]. (1.4)
• Gumbel corresponds to ξ = 0 (taken as limit ξ → 0) i.e. GEV(0,1,0) =
Gumbel;
• Frechet corresponds to ξ > 0 i.e. GEV(α−1, α−1, α−1) = Frechet(α) ;
• Negative Weibull corresponds to ξ < 0 i.e. GEV(−α−1, α−1,−α−1) = Negative
Weibull(α).
Type identification
For particular, well-known F (r), the type of limiting distribution can be derived.
For example normal and log normal give rise to Gumbel while student’s t and uni-
form give Frechet and Negative Weibull respectively. In general, exponentially tailed
distributions give Gumbel type; algebraically tailed with a finite end-point distri-
butions give Frechet or Negative Weibull types. Frechet distribution is for positive
Chapter 1. Introduction and Literature Review 20
random variables while Negative Weibull is for negative random variables. This
classification facilitates an easy identification of the right type EVD to model the
scores or measures. For example, Frechet distribution is clearly the right type for
RMSD (Stark et al., 2003b). RMSD values are positive and have a heavy tailed
distribution attenuated at zero.
Adjusting for database size
Because of max-stability property of the GEV distribution, the modelled random
distribution can be used for searches in a database with a different size by correcting
normalising constants. In general, if for an extreme, Mn(ri, i = 1, . . . , n):
Mn − bnan
D−→ EVD(µ, σ, ξ) for a random database of size n (1.5)
then using domains of attraction principle, normalising constants for searching a
database of size n′ are 1 − F(b′n) = 1/n′ and a′n = h(b′n) = 1−F(b′n)f(r)
. However in
Chapter 7 we use an ad hoc method (Stark et al., 2003b; Torrance et al., 2005) to
adjust for sample space since F (r) is unknown.
Chapter 2
Exploratory Analysis of Protein
Geometry
In this Chapter we are interested in learning some properties of proteins in general
and functional sites in particular. We consider spatial information in terms of only
point coordinates for functional sites and protein structure atoms. We explore spa-
tial arrangement of atoms in both functional sites and whole proteins using spatial
statistics tools.
2.1 Background
We are interested in point patterns or spatial positions of points. We would like
to characterise say, whether the points in the configurations are clustered, regular,
random or if there are variations in intensity in different regions.
2.1.1 Inter-event distances
Consider a configuration of points, {xj}, j = 1, . . . , n. Inter-event or inter-point
distances are ||xi − xj || for i, j = 1, . . . , n and i 6= j. The inter-event distribution is
H(t) = P(||Xi −Xj|| ≤ t)
21
Chapter 2. Exploratory Analysis of Protein Geometry 22
Conditional on the number of events N(A) = n of a spatial point process N in
the region of observation A, where N(A) = N∩A, the empirical inter-event distance
distribution function (EDF) is written
H(t) =1
n(n− 1)
∑
i6=jI {||xi − xj || ≤ t} ,
where xi are the events in the observed spatial point pattern and I{.} is an
indicator function. If the theoretical inter-event distance distribution function, say
H0(t) for a theoretical spatial point process is known, deviations of H(t) from H0(t)
can be used to test the hypothesis that an observed point pattern is a realisation
from the theoretical spatial point process.
In section 2.2 we visually (informally) compare H(t) for functional site Cα to
H0(t) for complete spatial randomness (CSR) model i.e. uniform distribution N
points in the region A where N(A) ∼ Poison(λ).
2.1.2 Point to nearest event distances
Another statistical tool for characterising spatial point processes is “point to nearest
event distance”. While for inter-event distance, we consider all the events in the
region, this type of analysis uses distances ti from each of m sample points in A to
the nearest of the n events. Thus point to nearest event distance summarises local
characteristics of the spatial point process.
The empirical distribution (EDF), F (t) = m−1#(ti ≤ t) measures the “empty
spaces” in A, in the sense that 1 − F (d) is an estimate of the volume (area) |Bt|of the region Bt consisting of all points in A a distance at least t from every one
of n events in A. Again F (t) can be compared to the theoretical distribution of a
particular spatial point process of interest.
2.1.3 The K-function
The K-function was introduced by Bartlett (1964) and its potential and importance
for analysing point patterns was realised and developed by Ripley (1976, 1977).
Chapter 2. Exploratory Analysis of Protein Geometry 23
For a stationary isotropic process, the K-function can be defined as
K(t) = λ−1E(number of points within distance t of a randomly chosen point), with t > 0,
where λ is the mean number of points per unit region.
A K-function provides a summary of spatial dependence over a wide range of
scales of a pattern, including all event-event distances, not just the nearest neigh-
bour distances. Since theoretical forms of the function are known for various possible
spatial point process models, the K-function can be used to explore spatial depen-
dence, in addition to suggesting specific models to represent the observed spatial
point process and to estimate the parameters of such models.
The estimator for K(t) is
K(t) = n−2|A|∑
i,j:i6=jω(xi, uij)It(uij).
where uij denotes the distance between the ith and the jth events in A, ω(xi, uij) is
the proportion of the surface of the sphere with centre xi and radius uij which lies
within A. It(uij) is an indicator function taking the value 1 if uij ≤ t, 0 otherwise.
We consider the K-function and inter-event distance for Cα atoms.
2.2 Functional Sites
We consider spatial distribution of Cα atoms of functional sites. We evaluate the
first and second order statistics for spatial processes:
(a) Three dimensional plot of an estimated density field.
(b) Point to nearest event distance frequency plot.
(c) The K-function. Plotted are normalised K-functions: K(t) − πt2.
(d) Inter-event distance cumulative function, H(t).
(e) Inter-event distance frequency plot.
Chapter 2. Exploratory Analysis of Protein Geometry 24
Figures 2.1 - 2.8 depict these statistics for a random sample of 9 functional
sites. From estimated density fields, we observe that there are elongated tubes
hence dependence in the spatial position of Cα atoms. This is confirmed when we
compare the data with homogeneous Poisson process. The points tend to fall in
elongated strings thus probably the residues of functional sites tend to come from
conserved motifs in a sequence.
The departure from homogeneous Poisson process is again apparent when we
compare with first and second order statistics for a homogeneous Poisson pro-
cess. The empirical estimates (red graph) in the Inter-event distance frequency
plots clearly show an inhibition distance of about 4.0A (actually adjacent Cα atoms
in a protein chain are about 3.8A). There is also a peak in inter-event distance
histograms between 5A and 6A. This peak reflects closest Cα atoms that are not
adjacent in sequence order. Thus these are Cα atoms which have come close due
to two parts of the chain folding close to each other. This inhibition distance for
atoms not forming a chemical bond is due to what is called van der Waals radius1.
Aszodi and Taylor (1994) found an average value of 5.5A for this distance between
Cα atoms in proteins. For simulations in Chapter 3, section 3.1.1 we conservatively
use 5A as inhibition distance to model van der Waal radius between Cα atoms.
Inter-event distance cumulative function estimates, H(t) are outside the 95% C.I.
envelope except for functional sites from carbonyl reductase (1cyd 1), 5-aminolaevulinate
dehydratase (1b4e 0) and aspartate aminotransferase (1ajr 0). The same observa-
tion is made for K-functions of these functional sites.
2.3 Protein Structures
We also consider spatial distribution of all Cα atoms in the structure for some
functional sites considered in section 2.2 above. Plotted in Figures 2.9 - 2.12 are
1The van der Waals radius of an atom is the radius of an imaginary hard sphere which can be
used to model the atom for many purposes. Van der Waals radii are determined from measurements
of atomic spacing between pairs of nonbonding atoms in molecules.
Chapter 2. Exploratory Analysis of Protein Geometry 25
Figure 2.1: The K-function and inter-point distance distribution for Cα atoms in
17 − β hydroxysteroid dehydrogenase functional site (1a27 0).
Chapter 2. Exploratory Analysis of Protein Geometry 26
Figure 2.2: The K-function and inter-point distance distribution for Cα atoms in
5-aminolaevulinate dehydratase (from E. coli) functional site (1b4e 0).
Chapter 2. Exploratory Analysis of Protein Geometry 27
Figure 2.3: The K-function and inter-point distance distribution for Cα atoms in
subtilisin functional site (1bfk 0).
Chapter 2. Exploratory Analysis of Protein Geometry 28
Figure 2.4: The K-function and inter-point distance distribution for Cα atoms in
carbonyl reductase functional site (1cyd 1).
Chapter 2. Exploratory Analysis of Protein Geometry 29
Figure 2.5: The K-function and inter-point distance distribution for Cα atoms in
1,3,8-trihydroxynaphtalene reductase functional site (1g0n 0).
Chapter 2. Exploratory Analysis of Protein Geometry 30
Figure 2.6: The K-function and inter-point distance distribution for Cα atoms in
mannitol dehydrogenase functional site (1h5q 0).
Chapter 2. Exploratory Analysis of Protein Geometry 31
Figure 2.7: The K-function and inter-point distance distribution for Cα atoms in
5-aminolaevulinate dehydratase (from Baker’s yeast) functional site (1h7o 0).
Chapter 2. Exploratory Analysis of Protein Geometry 32
Figure 2.8: The K-function and inter-point distance distribution for Cα atoms in
glutaminase-asparaginase functional site (3pga 0).
Chapter 2. Exploratory Analysis of Protein Geometry 33
first and second order statistics for spatial arrangement of Cα in protein structures
(parts a, c, d, e and f). These are three dimensional plot of an estimated density field,
point to nearest event distance frequency plot, the K-function, inter-event distance
cumulative function and inter-event distance frequency plot. Parts b are ribbon
representation of secondary structures in RasMol (Sayle and Milner-White, 1995).
Figure 2.13 is a plot for first and second order statistics for spatial arrangement of
all atoms in 17 − β hydroxysteroid dehydrogenase (1a27).
Figure 2.9: The K-function and inter-point distance distribution for Cα atoms in
17 − β hydroxysteroid dehydrogenase structure (1a27).
Chapter 2. Exploratory Analysis of Protein Geometry 34
Figure 2.10: The K-function and inter-point distance distribution for Cα atoms in
5-aminolaevulinate dehydratase structure (1aw5).
Chapter 2. Exploratory Analysis of Protein Geometry 35
Figure 2.11: The K-function and inter-point distance distribution for Cα atoms in
5-aminolaevulinate dehydratase structure from E. coli (1b4e).
Chapter 2. Exploratory Analysis of Protein Geometry 36
Figure 2.12: The K-function and inter-point distance distribution for Cα atoms in
carbonyl reductase structure (1cyd).
Chapter 2. Exploratory Analysis of Protein Geometry 37
Figure 2.13: The K-function and inter-point distance distribution for all atoms in
17 − β hydroxysteroid dehydrogenase (1a27).
Chapter 3
Simulation Design and Evaluation
of Algorithms
We will consider simulations to evaluate the performance of our approach and some
other known algorithms. Simulations are used to evaluate the correct correspondence
rate for matching methods in Chapters 5 and 6. We cover the simulation scheme
in section 3.1 while 3.2 covers topics on evaluation. Highlighted in section 3.1.2
are simulations of Aszodi and Taylor (1994), producing compact random structures
with a hydrophobic core.
3.1 Simulations
In this section we are concerned on how we simulate functional sites and proteins
to evaluate performance of different algorithms.
3.1.1 Functional Sites Simulations
Functional site pairs with varying sizes were simulated. Each pair consisted of {µ}and {x}. Size of {x}, n varied from 4 to 64 by steps of 4. (i.e. n = 32, 36, . . . , 64).
Size of {µ} was taken to be m = ⌈1.1n⌉, with additional 10% of the points in {µ}having no corresponding points in {x}. The choice of 10% should provide enough
38
Chapter 3. Simulation Design and Evaluation of Algorithms 39
noise in the system to evaluate our matching algorithms. Luo and Hancock (2001)
added up to about 10% of non corresponding points to evaluate different matching
algorithms.
Point set configurations.
For size, n = 4, 36, . . . , 64
(a) Hardcore simulate a configuration of m = ⌈1.1n⌉ points constituting {µ} in a
353 cube. We uniformly sample points inside the cube and reject if it is within
5 units from any other point. The inhibition distance of 5A is to model van
der Waals radius in molecules. Aszodi and Taylor (1994) observed an average
value of about 5.5A for van der Waals radius between Cα atoms in protein
molecules. We also observed an inhibition distance between 5A to 6A in inter-
event distance histograms for functional sites in section 2.2. The distance of
5A was chosen as this is a conservative threshold for interaction between two
atoms, where the atoms are either Cα atoms or atoms in side chains (Park
et al., 2001).
(b) We then randomly generate colours for these points according to frequencies
of amino acids in Table 3.1.
(c) Choose randomly(without replacement) n points from {µ}.
(d) From each of the chosen n points of {µ} simulate a point, x ∼ MN(~µ, 0.5I3).
There is no preferred direction for the x, y and z coordinates hence we assume
isotropic Gaussian. It is also biologically plausible to assume independence
between the coordinates.
Colour of x is
i. First approach: Just take the colour of µ (no mutation of colour).
ii. Second approach: Simulate mutation to get colour of x (see simulation of
mutational process below).
Chapter 3. Simulation Design and Evaluation of Algorithms 40
A set of m points in {µ} and n points in {x} constitute a pair of functional
sites. These pairs are used for evaluating the performance of matching methods as
outlined in section 3.2.1. We also use these configurations for studying the minimum
RMSD distribution in section 4.1.1 of Chapter 4.
Table 3.1: Frequencies of amino acids in the Accepted Point Mutation (PAM) Data.
N Asn 0.040 H His 0.034
S Ser 0.070 R Arg 0.040
D Asp 0.047 K Lys 0.081
E Glu 0.050 P Pro 0.051
A Ala 0.087 G Gly 0.089
T Thr 0.058 Y Tyr 0.030
I Ile 0.037 F Phe 0.040
M Met 0.015 L Leu 0.085
Q Gln 0.038 C Cys 0.033
V Val 0.065 W Trp 0.010
Evolution of amino acid classes.
We consider a model for evolutionary change in proteins of Dayhoff et al. (1978).
Dayhoff et al. (1978) model for amino acid interchanges is applicable for functional
site amino acids as well since it assumes that amino acid mutation is sequence inde-
pendent. However the actual frequencies of substitutions are different in functional
sites. We assume that amino acid mutation is also independent of spatial positions.
Accepted point mutations
An accepted point mutation is a replacement of one amino acid by another which
is again accepted by natural selection. To be viable, the new amino acid usually
must function in a way similar to the old one. Chemical and physical similarities are
Chapter 3. Simulation Design and Evaluation of Algorithms 41
found between the amino acids that are observed to interchange frequently. In the
evolutionary change model, the likelihood of amino acid c replacing k is the same as
that of k replacing c. As a result, no change in amino acid frequencies over evolution
distance will be detected.
The probability that each amino acid will change in a given small evolutionary
interval is called the “relative mutability” of the amino acid. Thus relative mutability
of each amino acid is proportional to the ratio of changes to occurrences. Table 3.2
gives these relative mutabilities computed by Dayhoff et al. (1978).
Table 3.2: Relative mutabilities of amino acidsa.
N Asn 134 H His 66
S Ser 120 R Arg 65
D Asp 106 K Lys 56
E Glu 102 P Pro 56
A Ala 100 G Gly 49
T Thr 97 Y Tyr 41
I Ile 96 F Phe 41
M Met 94 L Leu 40
Q Gln 93 C Cys 20
V Val 74 W Trp 18aThe value for Ala has been arbitrarily set at 100.
Substitution matrix
Information about individual kinds of mutations and about the relative mutability
of amino acids is combined into one time-dependent “mutation probability matrix”.
An element of this matrix, mij , gives the probability that the amino acid in row i
will be replaced by the amino acid in column j after a given evolutionary interval.
Evolutionary distance between proteins is measured in PAM (Percent Accepted
Mutation). 1 PAM corresponds to an evolutionary distance of one amino acid change
in every 100 amino acids. Dayhoff et al. (1978) in addition to calculating mutabilities
Chapter 3. Simulation Design and Evaluation of Algorithms 42
for amino acids, also compiled data on Accepted Point Mutation. PAM substitution
matrices are computed from these pieces of information. Shown in a Table 3.3 is a
1PAM matrix.
Simulation of mutational process
The mutation probability matrix provides the information with which to simulate
any degree of evolutionary change in an unlimited number of proteins. Further, we
can start with one protein and simulate its separate evolution in duplicated genes
or in divergent organisms. By considering large numbers of such related sequences,
a measure is readily obtained of the expected deviations due to random fluctuations
in the evolutionary process.
Let us simulate the effect of 1 PAM of evolutionary change on a particular
amino acid set. To determine the fate of the first amino acid, say alanine, we
obtain a uniformly distributed random number between 0 and 1. The first row of
the mutation probability matrix (Table 3.3) gives the relative probability of each
possible event that may befall alanine (neglecting deletion for simplicity). If the
random number falls between 0 and 0.9867, Ala is unchanged. If the number is
between 0.9867 and 0.9868, it is replaced with Arg, if it is between 0.9868 and 0.9872,
it is replaced with Asp, and so forth. Similarly, a random number is produced for
each amino acid in the set, and action is taken as dictated by the corresponding row
of the matrix. The result is a simulated mutant set. Any number of these can be
generated; their average distance from the original is 1 PAM.
The effects on the set of a longer period of evolution may be simulated by suc-
cessive applications of the matrix to the set resulting from the last application.
Alternatively, the matrix may be multiplied by itself repeatedly and applied once to
the sequence. The two procedures produce mutant sequences of the same average
PAM distance from the initial set. Simulations in this thesis e.g. in section 3.1.1
use PAM120 i.e. the matrix in multiplied by itself 120 times.
For simulations in which a predetermined number of changes are required, a
two-step process involving two random numbers for each mutation can be used.
Chapter
3.
Sim
ula
tion
Desig
nand
Evalu
atio
nofA
lgorith
ms
43
Tab
le3.3:
Substitu
tion(m
utation
prob
ability
)m
atrixfor
the
evolution
arydistan
ce
of1
PA
M.A
nelem
ent
ofth
ism
atrix,mij
,gives
the
prob
ability
that
the
amin
oacid
inrow
iw
illbe
replaced
by
the
amin
oacid
incolu
mnj
aftera
givenevolu
tionary
interval,
inth
iscase
1accep
tedpoin
tm
utation
per
100am
ino
acids.
Thus,
there
isa
0.56%prob
ability
that
Asp
(D)
will
be
replaced
by
Glu
(E).
To
simplify
the
appearan
ce,th
eelem
ents
aresh
own
multip
liedby
10,000.Taken
fromD
ayhoff
etal.
(1978).
A R N D C Q E G H I L K M F P S T W Y V
A 9867 1 4 6 1 3 10 21 1 2 3 2 1 1 13 28 22 0 1 13
R 2 9913 1 0 1 9 0 1 8 2 1 37 1 1 5 11 2 2 0 2
N 9 1 9822 42 0 4 7 12 18 3 3 25 0 1 2 34 13 0 3 1
D 10 0 36 9859 0 5 56 11 3 1 0 6 0 0 1 7 4 0 0 1
C 3 1 0 0 9973 0 0 1 1 2 0 0 0 0 1 11 1 0 3 3
Q 8 10 4 6 0 9876 35 3 20 1 6 12 2 0 8 4 3 0 0 2
E 17 0 6 53 0 27 9865 7 1 2 1 7 0 0 3 6 2 0 1 2
G 21 0 6 6 0 1 4 9935 0 0 1 2 0 1 2 16 2 0 0 3
H 2 10 21 4 1 23 2 1 9912 0 4 2 0 2 5 2 1 0 4 3
I 6 3 3 1 1 1 3 0 0 9872 22 4 5 8 1 2 11 0 1 57
L 4 1 1 0 0 3 1 1 1 9 9947 1 8 6 2 1 2 0 1 11
K 2 19 13 3 0 6 4 2 1 2 2 9926 4 0 2 7 8 0 0 1
M 6 4 0 0 0 4 1 1 0 12 45 20 9874 4 1 4 6 0 0 17
F 2 1 1 0 0 0 0 1 2 7 13 0 1 9946 1 3 1 1 21 1
P 22 4 2 1 1 6 3 3 3 0 3 3 0 0 9926 17 5 0 0 3
S 35 6 20 5 5 2 4 21 1 1 1 8 1 2 12 9840 32 1 1 2
T 32 1 9 3 1 2 2 3 1 7 3 11 2 1 4 38 9871 0 1 10
W 0 8 1 0 0 0 0 0 1 0 4 0 0 3 0 5 0 9976 2 0
Y 2 0 4 0 3 0 1 0 4 1 2 1 0 28 0 2 2 1 9945 2
V 18 1 1 1 2 1 2 5 1 33 15 1 4 0 2 2 9 0 1 9901
Chapter 3. Simulation Design and Evaluation of Algorithms 44
Starting with a given sequence, the first amino acid that will mutate is selected:
the probability that any one will be selected is proportional to its mutability (Table
3.2). Then the amino acid that replaces it is chosen. The probability for each
replacement is proportional to elements in the appropriate row of the substitution
matrix. Starting with the resultant set, a second mutation can be simulated, and
so on, until a predetermined number of changes have been made. In this process,
superimposed and back mutations may occur.
Although these substitution matrices might not apply very well to functional
sites, the data we get serve us right to evaluate the methodology. For real pro-
tein matching application, one can consider other substitution matrices like the one
developed for spatially conserved locations (Naor et al., 1996).
3.1.2 Whole Structure Simulations
To generate random structures, Aszodi and Taylor generate a chain of points repre-
senting Cα atoms in the main chain which is folded into a 3-dimensional structure
by distance geometry methods.
Chain properties
Simulate a chain of Cα atoms with the following properties:
Chain geometry:
Figure 3.1 shows the geometry of a virtual chain.
• Virtual bond length: Inhibition distance of 3.8A between successive points in
the chain since adjacent Cα atoms are separated roughly by such a distance
in proteins (Aszodi and Taylor, 1994; Jeong et al., 2006).
• Non-coincident of atom centres due to atom volume and van der Waals forces:
Inhibition distance of dbump = 2rvdW between two non-successive atoms. dbump
was set to 5.5A in the simulations. This distance was chosen so as to corre-
Chapter 3. Simulation Design and Evaluation of Algorithms 45
spond to van der Waals radius for Cα atoms observed in protein molecules and
to give the correct average residue density.
• Virtual bond angles, β: The virtual angle at each carbon α formed by virtual
bonds from right and left Cα atom neighbours is β = 2 arcsin(d22l
) where d2
is the distance between right and left neighbours. In proteins β ∈ [π/2, π].
Aszodi and Taylor fix β = 2 arcsin( d22l
) where d2 is an average observed distance
between each Cα atom and its second neighbour. They argue for averaging d2
to avoid geometric bias towards secondary structure formation. Simulations
used d2 = 6.0A (observed d2 = 6.0 ± 0.4A from 84 protein structures).
• Virtual bond torsion angle, θ. This angle was allowed to randomly take any
value in the interval [−π, π].
These bonds and angles are “virtual” because they do not exist in proteins i.e. Cα
atoms are not directly connected in protein backbones.
Figure 3.1: Virtual distances and angles in a protein backbone.
Chain biochemistry:
Each Cα atom was randomly assigned a binary hydrophobicity property i.e. hy-
drophobic or not.
Folding
Instead of minimising an energy function (see Weiner et al., 1984; Pereira De Araujo,
1999; Chhajer and Crippen, 2002; Jaramillo et al., 2002), the idea here is to fold the
Chapter 3. Simulation Design and Evaluation of Algorithms 46
chain in such a way as to achieve a target distance matrix for inter-residue (between
amino acid) distances. The target distance matrix specifies the preferred distances
between Cα atoms and accounts for hydrophobicity. Hydrophobic amino acids tend
to cluster together. This phenomenon is called hydrophobicity effect and is one
of the major driving forces for protein folding. The hydrophobicity effect is the
tendency to shield the hydrophobic amino acids away from the surface. A preferred
distance matrix is the one with shorter inter-residue distances among hydrophobic
amino acids.
The target density matrix used is given in Table 3.4:
Table 3.4: Target (desired) distances between Cα atoms in simulated short chains
of proteins.
Pair Type ddes(A) Strictness/Preference
Hydrophobic/hydrophobic 6.0 0.5. . . 1.0
Hydrophobic/hydrophilic 8.0 0.1
Hydrophilic/hydrophilic 10.0 0.1
Distance constraint
Chain geometry constraints that distance between Cα atoms should in general be
above dbump = 2rvdW =5.5A. The other constraint is that distance between the
ith and jth Cα atoms is maximal if the chain connecting them is in its extended
conformation i.e. all in-between torsion angles, θ equal π. The maximal distance,
say dmax,s depends on s = |i − j| only. Figure 3.2 illustrates the calculation of
distance constraints. The algorithm is initialised with three points (0,−d2/2, 0),
(0, d2/2, 0) and (√l2 − d2
2/4, 0, 0) where l = 3.8Aand d2 = 6.0A.
Recursively dmax,s is
dmax,1 = l
dmax,2 = d2
Chapter 3. Simulation Design and Evaluation of Algorithms 47
If s is even:
dmax,s = sdmax,2/2
else if s is odd:
dmax,s =((dmax,s−1+ + l cos(π−β
2))2
+ l2 sin2(π−β2
))1/2
=(d2max,s−1 + 2dmax,s−1l cos(π−β
2) + l2 cos2(π−β
2) + l2 sin2(π−β
2))1/2
=(d2max,s−1 + 2dmax,s−1l cos(π−β
2) + l2
)1/2.
(3.1)
Figure 3.2: Distance constraints in a protein virtual backbone.
The hydrophobic effect was modelled by moving all hydrophobic amino acids
towards the centre e.g. by 20%, and all hydrophilic amino acids were moved outward
by a smaller amount e.g. 5%.
Algorithm
The algorithm has two phases. The first phase is in “Distance Space” whereby
inter-residue distances are updated:
d(new)ij = (1 − sij)d
(old)ij + sijd
(des)ij (3.2)
where sij is the level of strictness for target distance, d(des)ij between residues i and
j from Table 3.4. Here d(des)ij only depends on whether ith and jth Cα atoms are
for hydrophilic or hydrophobic amino acids. In this phase, distance constraints are
checked and any violations corrected. In the case of violation, d(new)ij is set to either
dbump or dmax,s (whichever is closer).
The second phase is in “Euclidean Space”. In this step, the distance matrix is
used to specify the 3-dimensional coordinates of the structure. Again after projecting
Chapter 3. Simulation Design and Evaluation of Algorithms 48
into the 3-dimensional Euclidean space, distance constraints are checked and any
anomalies are corrected.
These steps are iterated until convergence. The criteria for convergence is based
on either distance or constraint scores:
Distance score
This is a sum of squared relative differences between targeted and actual distances,
weighted by strictness values:
Qdist =
√√√√∑
i<j sij((dij − d(des)ij )/d
(des)ij )2
∑i<j sij
. (3.3)
Constraint score
This is a sum of squared relative differences between targeted and actual distances,
weighted by strictness values:
Qcons =
√∑
i<j
(Ebump,ij + Emax,ij) (3.4)
where
Ebump,ij =
(dbump−dij
dbump
)2
, if dij < dbump
0, otherwise
Emax,ij =
(dmax,s−dij
dmax,s
)2
, if dij > dmax,s, s = |i− j|
0, otherwise.
The convergence criteria is met if either the absolute value or the relative change
of the score is below a preset minimum.
Comments
• The method is capable of reproducing important protein “non-random” fea-
tures like globularity and compactness.
• There is no mechanism to avoid forming knots in the structure.
Chapter 3. Simulation Design and Evaluation of Algorithms 49
Alternative chain simulations
Both “distance space” and “Euclidean space” phases of the method by Aszodi and
Taylor (1994) are computer intensive. Here we propose a new algorithm based
on Aszodi and Taylor (1994) method but much simpler and mathematically more
flexible. We consider a chain of Cα atoms only (see Eidhammer et al., 2004, p. 254).
Figure 12.1 therein, gives the geometry. The chain is (Eidhammer et al., 2004, p.
173)
Cαi−1 − Ci−1 = Ni − Cα
i − Ci = Ni+1 − Cαi+1 − Ci+1 = Ni+2.
Note that Cαi − Ci = Ni+1 − Cα
i+1 − Ci+1 lies in a plane where “−” and “=” are
single and double bonds respectively.
• Relative to the plane of Ci−1 = Ni−Cαi , Ci has only freedom to rotate “around”
the bond Ni − Cαi . This angle is φi.
• Relative to the plane of Ni−Cαi −Ci, Ni+1 has only freedom to rotate “around”
the bond Cαi − Ci. This angle is ψi.
Simulating Cα atoms only
Three consecutive Cα atoms can be regarded to lie in a plane. Following Aszodi
and Taylor (1994), we start with a triangle where we take the base line going from
(0,−d1/2, 0) to (0, d1/2, 0). The vertex has coordinates (√l2 − d2
1/4, 0, 0) where l is
pre-specified to be 3.8A, and d1 lies around 6A with standard deviation of 0.4A.
Consider Figure 3.3; A,B,C denote consecutive Cα atoms lying in a plane. Now
we can take normal distribution for di with mean 6A and standard deviation 0.4A.
To generate a fourth atom, D, consider a point P1 in the same plane as A,B,C. The
next step consists of rotating the edge CP1 where the base line is AC by θ. Thus the
next triangle (containing the fourth atom D) gets rotated by θ. This angle between
CD and XY− plane, θ is related to dihedral angles φ and ψ in real proteins. We
take θ to have von Mises distribution with mean zero around edge AC rather than
uniform as in Aszodi and Taylor (1994). Unlike the uniform distribution, the von
Chapter 3. Simulation Design and Evaluation of Algorithms 50
Mises distribution can be used to control the turning properties of the chain by
specifying mean direction and concentration parameters.
A
B
C
D
z
y
x
α α
π
2− α π
2− α
δ
β
η
d1
d2
P1
l
l
ll
θ
Figure 3.3: Orientation of Cα atoms in simulating protein short chains.
Let Dx denote the x− coordinate for D. Then it is found that:
Dx = l cos θ sin η + Cx
Dy = l cos θ cos η + Cy
Dz = l sin θ + Cz
(3.5)
where
η = π − δ − α
= 2 cos−1(d22l
)+ tan−1
(d1
2√l2−d21/4
)− π
2.
With B, C, D as new vertices in place of A, B, C respectively, the process
iterates until all N atoms are generated subject to the minimum distance of 5.5A
between any two non-neighbouring atoms to model van der Waals forces.
Chapter 3. Simulation Design and Evaluation of Algorithms 51
Simulating the hydrophobic effect
Hydrophobic effect is the other major force driving protein folding. We take into
account hydrophobicity by favouring the bond angle which takes the hydrophobic
amino acid towards the centre of the configuration. Thus, at each Cα atom , simulate
bond angles as follows:
• Step 1: get three angles randomly, say from von Mises distribution.
• Step 2: choose the angle which takes the Cα atom for a hydrophobic amino
acid furthest towards the centre of mass.
• Step 3: for hydrophilic amino acid, choose the angle which takes the Cα atom
furthest away from the centre.
Using three angles in Step 1 was observed to efficiently give reasonable structures.
In principal more angles could be sampled. However more angles increases chances
of the chain crashing into itself as there are more chances of turning the chain closer
towards the centre of mass.
Results
Figures 3.4 and 3.5 are plots of typical chain realisations without and with modelling
hydrophobic effects respectively. Chains in Figures 3.5 a, b, c and d have 45, 55, 65
and 75% of their amino acids as hydrophobic. We observe that more hydrophobic
content gives more compact and globular structures.
Comments
• The algorithm has no in-built capability to avoid the chain making knots or
getting entangled. The simple implementation was just to restart assembling
the chain afresh if several (e.g. 100) attempts to generate a point fails due to
distance constraints as this is indicative of entanglement.
• As the folding direction is random, realisations leading to non compact and
non globular chains are possible as well especially when hydrophobic effects
are not taken into account.
Chapter 3. Simulation Design and Evaluation of Algorithms 52
−40 −30 −20 −10 0 10 20 020
4060
80
−50−40
−30−20
−10 0
10 20
30
x
y
z
−30 −20 −10 0 10 20 0 2
0 4
0 6
0 8
010
0
−30−20
−10 0
10 20
30
x
yz−40 −30 −20 −10 0 10 20−
40−
20 0
20
40
60
−40−30
−20−10
0 10
20 30
x
y
z
−10 0 10 20 30 40−10
0 −80
−60
−40
−20
0
20
40
−10 0
10 20
30 40
50
x
y
z
−40 −30 −20 −10 0 10 20 020
4060
80
−50−40
−30−20
−10 0
10 20
30
x
y
z
−30 −20 −10 0 10 20 0 2
0 4
0 6
0 8
010
0
−30−20
−10 0
10 20
30
x
yz−40 −30 −20 −10 0 10 20−
40−
20 0
20
40
60
−40−30
−20−10
0 10
20 30
x
y
z
−10 0 10 20 30 40−10
0 −80
−60
−40
−20
0
20
40
−10 0
10 20
30 40
50
x
y
z
Figure 3.4: Typical chain realisations in short protein chain simulations without
hydrophobic effects.
−60 −40 −20 0 20 0 2
0 4
0 6
0 8
010
0
−60−40
−20 0
20 40
x
yz
a)
−10 0 10 20 30 40−80
−60
−40
−20
0 2
0
−10 0
10 20
30 40
50 60
x
y
z
b)
−20 −15 −10 −5 0 5 10−50
−40
−30
−20
−10
0 1
0 20
30
−25−20
−15−10
−5 0
5 10
15
x
yz
c)
−20 −10 0 10 20 30−30
−20
−10
0 1
0 2
0 3
0 4
0
−30−20
−10 0
10 20
x
y
z
d)
Figure 3.5: Chain realisations in short protein chain simulations with hydrophobic
effects.
3.1.3 Appropriateness of Simulated Data
In this chapter we considered how to simulate structures similar to protein and func-
tional sites. Any attempt to simulate a random structure has to consider intrinsic
Chapter 3. Simulation Design and Evaluation of Algorithms 53
characteristics of proteins. However it is difficult to completely separate random and
deterministic properties of protein structures except for well known characteristics
like inhibition distance between atoms, compactness in globular proteins.
Figures 3.6 and 3.7 are plots for typical hardcore and short chain simulations for
functional sites and part of protein structure in sections 3.1.1 and 3.1.2 respectively.
Figures 3.6a shows that the density field for hardcore configurations are not as
tubular as for functional sites in SITESDB. The density field in Figure 3.7 for short
chain simulation looks very similar to the density field for the whole structure e.g.
Figure 2.13. In all these simulations, the minimum inter-event distance of 3.8A is
well reflected. Also the structures are compact.
Arguably, hardcore configurations may not entirely mimic functional sites with
respect to tubularity. Also because short chains have points completely connected,
they might not entirely reflect functional sites as well. Probably functional sites are
mid-way between these two cases. However we use hardcore configurations (e.g..
in sections 5.2.3, 5.4.4 and 6.1.8) and short chains (in section 6.1.8) to evaluate
matching algorithms. Although the motivation was matching functional sites, these
algorithms can be used for matching any type of configuration in some other appli-
cations e.g. matching steroids in chemoinformatics (Dryden et al., 2006).
3.2 Evaluation
Functional site pairs were simulated as in section 3.1 and used to evaluate the
performance of matching methods as outlined below (section 3.2.1). Quality of
matching real functional sites is evaluated in section 4.1.4 of Chapter 4 using scores
defined in section 3.2.2 in addition to goodness-of-fit p-values proposed in section
4.1 of Chapter 4.
Chapter 3. Simulation Design and Evaluation of Algorithms 54
Figure 3.6: The K-function and inter-point distance distribution for a simulated
hardcore configuration. The K-function is normalised: K(t) − πt2.
Chapter 3. Simulation Design and Evaluation of Algorithms 55
Figure 3.7: The K-function and inter-point distance distribution for a simulated
short chain configuration. The K-function is normalised: K(t) − πt2.
Chapter 3. Simulation Design and Evaluation of Algorithms 56
3.2.1 Correct Correspondence
With simulated datasets, evaluating matching methods by correct points correspon-
dence proportion is possible because we know which point in {x} corresponds to
which point in {µ}.To evaluate the methods:
(a) For each set of n and m(n = 4 to 64), thirty pairs of functional sites {µi} and
{xj} are generated.
(b) Randomly permute the order of xi to xj . Thus, we no longer “know” corre-
sponding µi and xj points.
(c) Match points of {µ} to points of {x}. Find out correctly matched points.
Obviously, points in {µ} which do not correspond to any point in {x} have
correct correspondence if not matched.
(d) Calculate average correct correspondence proportion for each n from replica
datasets.
(e) Plot correct correspondence proportion against points set size (n) for each
method.
3.2.2 Scores
Gold (2003) introduces scores for assessing quality of a match. Three matching
scores are considered. These are
(a) Option 0 (free matching): Corresponding amino acids found by distance cri-
teria only score one. The reported raw score is the total number of matched
pairs.
(b) Option 1 (identity matching): Corresponding amino acids only add to the
score if they have the same amino acid identity.
Chapter 3. Simulation Design and Evaluation of Algorithms 57
(c) Option 2 (similarity matching): Corresponding amino acids score one if they
have the same amino acid identity and score half if in the same group but not
identical. In this thesis we use groups as defined in Table 5.4.
Thus the Option 2 score, S is
S =
q∑
i=1
s(µi, xi)
where amino acids µi and xi are geometrically equivalent amino acids in matched
parts of configurations {µ} and {x}; and
s(µi, xi) =
1 if ki = ci else
0.5 if Gi = Di
0 otherwise
where q is the number of matched pairs; ki and Gi denote amino acid identity and
group of µi; similarly ci and Di denote amino acid identity and group of xi.
Option 0 and 1 matching scores are easily expressed in the same form by appropri-
ately re-defining s(µi, xi).
Dividing the matching score by RMSD gives a final score. Thus, the final score
is a function of both geometrical and matching types measures. The matching score
is qualitative while RMSD is a quantitative measure. There is no rigorous statistical
interpretation of these scores (Gold, 2003). In Chapter 4 we consider goodness-of-fit
statistics for quantifying quality of matches. We compare quality indications by
using the scores and p-values from the distribution of the size-and-shape distance.
Chapter 4
Match Statistics
In this chapter we focus on match statistics and their distributions under “random”
and “non-random” configuration hypotheses. By non-random we mean the scenario
where we know or suppose that the configurations are related and we are interested in
“goodness-of-fit”. The random hypothesis is for assessing the probability of finding
a matching configuration by mere chance.
4.1 Goodness-of-fit Statistics for Rigid Body Su-
perpositions
We consider optimal RMSD under the isotropic Gaussian landmark Model for as-
sessing goodness-of-fit when matching two configurations. The quality of rigid body
superposition of two configurations is often assessed by RMSD after optimal align-
ment. In general, the smaller the RMSD is, the better is the superposition or
matching.
4.1.1 Minimum RMSD Distribution
We simulated configurations of {µ} and {x} as in section 3.1.1 of Chapter 3. As
in section 3.2.1, the order of xi are randomly permuted so that we do not “know”
corresponding µi and xj points.
58
Chapter 4. Match Statistics 59
Figure 4.1a is a plot of RMSD after solving for optimal correspondence and
alignment for {xp} and {µp}. The graph theoretic method was used to solve for
correspondence and alignment. We observe that RMSD variability decreases with
increasing n. This is because chance good matchings and probably spurious worst
matchings as well are more likely with small n. This could be one of the reasons
for a well known decrease in RMSD for small n in matching proteins as only best
superpositions are of interest. Figure 4.1b is a plot for the smallest 10 RMSD values
for each n = 4, 8, . . . , 64. Figure 4.1c is a plot for the minimum RMSD for each
n = 4, 8, . . . , 64. These are typical plots when considering RMSD for best matches
in proteins.
Figure 4.1: RMSD against number of corresponding points with “loess” smoothing
curves. a) Optimal RMSD after graph matching against number of points. b)
Minimum 10 RMSD values for each number of corresponding points in (a). The
Minimum (best) RMSD for each value of n in (a) or (b).
Since RMSD depends on the number of corresponding points in the configura-
tions, one cannot directly compare two RMSD values from superimposing configu-
Chapter 4. Match Statistics 60
rations with different number of corresponding points. To overcome this problem
in Bioinformatics applications, Carugo and Pongor (2001) find how RMSD depends
on the number of corresponding points, q. These authors propose to adjust RMSD
values to RMSD100 values i.e. interpolated RMSD values for q = 100. RMSD100 val-
ues are comparable but this adjustment ignores that variability for RMSD increases
with respect to q. A classical way to take this variability into account is to find the
distribution of RMSD. One can then directly compare the standardised RMSD or
goodness-of-fit p-values.
4.1.2 Distribution of Size-and-shape Distance
The size-and-shape distance is q × RMSD2 after optimal alignment of two rigid
body configurations. Let X and µ be coordinate matrices with columns in X and
µ representing corresponding points in the two configurations. The size-and-shape
squared distance is,
d2S(X,µ) = S2
X + S2µ − 2SXSµ cos ρ(X,µ) (4.1)
where S2X =
∑qj=1 ‖Xj − X‖2 is the squared centroid size of X. ρ is the Procrustes
distance (see Definition 4.1.4 below and Dryden and Mardia, 1998).
We first define the Helmert sub-matrix used to centre configurations at the origin.
The Helmert sub-matrix also scales the configuration to have a unit centroid size.
Definition 4.1.1. The jth row of the Helmert sub-matrix H is given by
(hj , . . . , hj,−jhj , 0, . . . , 0), hj = −{j(j + 1)}−1/2,
and so the jth row consists of hj repeated j times, followed by −jhj and then q−j−1
zeros, j = 1, . . . , q − 1.
For q = 3 the full Helmert matrix is explicitly
Hf =
1/√
3 1/√
3 1/√
3
−1/√
2 1/√
2 0
−1/√
6 −1/√
6 2/√
6
Chapter 4. Match Statistics 61
and the Helmert sub-matrix is
H =
−1/√
2 1/√
2 0
−1/√
6 −1/√
6 2/√
6
.
For q = 4 the full Helmert matrix is
Hf =
1/2 1/2 1/2 1/2
−1/√
2 1/√
2 0 0
−1/√
6 −1/√
6 2/√
6 0
−1/√
12 −1/√
12 −1/√
12 3/√
12
and the Helmert sub-matrix is
H =
−1/√
2 1/√
2 0 0
−1/√
6 −1/√
6 2/√
6 0
−1/√
12 −1/√
12 −1/√
12 3/√
12
.
Definition 4.1.2. The pre-shape of a configuration X is all the geometrical infor-
mation that remains when location and scale effects are filtered out from the object.
That is the pre-shape of X is given by
Z =XHT
‖XHT‖where H is the Helmert sub-matrix.
The pre-shape of an object is invariant under translation and scaling of the
original configuration.
Definition 4.1.3. The pre-shape space is the space of all possible pre-shapes.
Formally, the pre-shape space Sqd is the orbit space of the non-coincident q point set
configuration in ℜd under the action of translation and isotropic scaling.
The pre-shape space Sqd ≡ Sd(q−1)−1 is a hypersphere of unit radius in d(q − 1)
real dimensions, since the centroid size of Z, ‖Z‖ = 1.
Definition 4.1.4. The Procrustes distance ρ(X,µ) is the closest great circle
distance between pre-shapes of X and µ on the pre-shape sphere. The minimisation
is carried out over rotations.
Chapter 4. Match Statistics 62
From the distribution of size-and-shape we will derive the distribution of RMSD
which is a function of size-and-shape. RMSD is commonly used in Bioinformat-
ics applications while size-and-shape is mainly used in Morphometry. We use the
distribution of RMSD in a Bioinformatics application to rank best matches from a
database search in section 4.1.4.
Consider the distribution of RMSD, r = dS(X,µ)/√q under the isotropic Gaus-
sian model for corresponding points i.e. xj ∼ N(µi, σ2Id) where point µi corresponds
to point xj and d = 3 is the dimension. Thus RMSD is a function of two random
variables SX and ρ. Under our model, we assume Sµ is fixed while S2X is distributed
as non-central χ2ν(λ) with ν = dq − d and λ = S2
µ/σ2. After optimal superposition
of configurations with q points, the full Procrustes distance,
d2F = sin2 ρ(X,µ) ∼ τ 2
0χ2dq−d(d−1)/2−d−1
with τ 20 = σ2/S2
µ. We consider exact and approximate distributions for r in the
following sections. The approximation is when SX ≈ Sµ and variability of Sx is
small.
Exact Distribution
We first consider the distribution for d2S(X,µ). With
sin2 ρ(X,µ) ∼ τ 20χ
2dq−d(d−1)/2−d−1
the density for x = cos ρ is
f(x) =2xβα
Γ(α)(1 − x2)e−β(1−x2) (4.2)
where β = 1/2τ 20 and α = dq−d(d−1)/2−d−1
2. and the density for y = S2
X is
f(y) =1
2σ2
√2π√λy/σ2
e(λ+y/σ2)/2( y
λσ2
)−1/4 {e√λy/σ2
+ e−√λy/σ2
}(4.3)
Chapter 4. Match Statistics 63
where β = 1/2τ 20 and α = dq−d(d−1)/2−d−1
2. Assuming independence between x =
cos ρ and y = S2X , the joint distribution for x and y is
f(x, y) = xβα
Γ(α)(1 − x2)e−β(1−x2)
× 1
σ2
q2π√λy/σ2
e(λ+y/σ2)/2(
yλσ2
)−1/4{e√λy/σ2
+ e−√λy/σ2
}.
(4.4)
Let v = S2µ + y − 2Sµx
√y and u = y. Inverse functions are y = u, x =
v−S2µ−u
2√uSµ
and
the Jacobian of transformation, |J | = 12√uSµ
. Hence the joint distribution for v and
u is
f(u, v) =(v−S2
µ−u)βα
4√uS2
µΓ(α)
{1 −
(v−S2
µ−u2√uSµ
)2}e
(−β(1−
„v−S2
µ−u
2√
uSµ
«2
)
)
× 1
σ2
q2π√λu/σ2
e(λ+u/σ2)/2(
uλσ2
)−1/4{e√λu/σ2
+ e−√λu/σ2
}.
(4.5)
It is not easy to integrate out u in order to get the distribution for v. Thus we
consider an approximation for size-and-shape distance.
Approximation
We consider the distribution for approximate size-and-shape distance when variabil-
ity of SX is so small or SX ≈ Sµ such that we can treat SX as a constant as well.
For example in Bioinformatics applications, interesting cases are where matching is
good hence configurations are of the same size i.e. SX ≈ Sµ. Thus
d2S(X,µ) ≈ 2S2
µ(1 − cos ρ(X,µ)). (4.6)
With sin2 ρ(X,µ) ∼ τ 20χ
2dq−d(d−1)/2−d−1, the approximate1 density for r is
f(r) =2qrβα
S2µΓ(α)
(2 − qr2
S2µ
)(qr2
S2µ
−(qr2
2S2µ
)2)α−1
e−β
qr2
S2µ−„
qr2
2S2µ
«2!
(4.7)
where β = 1/2τ 20 and α = dq−d(d−1)/2−d
2. We adjust degrees of freedom because we
do not allow scaling i.e. we multiply with S2µ. We only lose d(d− 1)/2 − d degrees
1This is the density for Sµ
√2(1 − cos ρ(X, µ))/q, an approximate size-and-shape distance in
closely fitting configurations.
Chapter 4. Match Statistics 64
of freedom for rotation and translation as
d2S(X,µ) = inf
A∈SO(d)‖µ− AX − b‖2
where SO(d) denotes a set of all d× d rotation matrices (orthogonal matrices with
determinant equal to +1)
4.1.3 Simulations for RMSD Distribution
We simulate {µ} and {x} as in section 3.1.1. However here n = m = 20 and we
simulated 10,000 pairs. The order of xi are randomly permuted as in section 3.2.1
so that we do not “know” corresponding µi and xj points.
Figure 4.2a gives a histogram of RMSD after optimal superposition using graph
theoretic method. Superimposed on this histogram is the probability density func-
tion in equation 4.7. We observe that this approximate distribution is a good fit.
Figure 4.2b is a plot of empirical distribution function and the cumulative density
function of equation 4.7. We also observe a good fit here. Therefore a goodness-of-fit
p-value from our approximate distribution can be used.
4.1.4 Application
We did a database search with a functional site of 5-aminolaevulinate dehydratase
(1b4e 0) using the graph method of Gold (2003). The standard deviation, σ is esti-
mated to be around 0.3 for matching functional sites known to be related (functional
sites from 17 − β hydroxysteroid-dehydrogenase and carbonyl reductase proteins
shown in Figure 1.3) at a threshold of 1.5A. Thus we set σ = 0.3 for matching
distance tolerance of 1.5A(cf section 5.3 of Chapter 5).
Table 4.1 gives the results for best 50 matches sorted by goodness-of-fit p-values.
Also given are scores proposed by Gold (2003) and described in section 3.2.2. The
scores given in Table 4.1 are found by dividing values of the score option 2 (see
section 3.2.2) by the RMSD.
We observed that 1eb3 0 (No 13) has a higher p-value than 1i8j 2 (No 14) al-
though the later has a lower RMSD. There is an agreement between the p-value and
Chapter 4. Match Statistics 65
Figure 4.2: Approximate RMSD distribution. a) Histogram of RMSD after opti-
mal superposition using graph theoretic method. b) Empirical and approximate
(equation 4.7) distribution functions for RMSD.
the score ranking. A better match has 21 corresponding amino acids compared to
8 for the other match. This justifies a better goodness-of-fit even though its RMSD
is higher than the other. This scenario is also observed for 1gjp 0 and 1l6s 2 (No 16
and 17); 1h7r 0 and 1l6y 3 (No 18 and 19).
4.1.5 Summary
A bigger challenge is to analytically work out the exact distribution for number of
matches and RMSD when matching random configurations. Unlike our attempt
to find a good approximating distribution for a Procrustes metric when matching
random configurations, Stark et al. (2003b) empirically modelled the distribution
for RMSD with the extreme value distribution. We follow Stark et al. (2003b) to
Chapter 4. Match Statistics 66
Table 4.1: Best fitting functional sites in the database when matched against 5-
aminolaevulinate dehydratase functional site (1b4e 0).
No. SITE q Sµ SX RMSD SCORE P-value
1 1b4e 0 21 41.46 41.46 0.000000 NA 1.0000000
2 1h7n 0 21 41.46 42.19 0.325806 0.513 0.9999999
3 1i8j 4 15 32.56 32.84 0.264983 0.502 0.9999993
4 1l6s 6 15 32.56 32.87 0.275493 0.492 0.9999982
5 1l6y 6 18 37.31 37.90 0.325130 0.479 0.9999966
6 1l6s 7 15 32.56 32.87 0.285211 0.465 0.9999946
7 1i8j 5 15 32.56 32.79 0.280753 0.467 0.9999943
8 1h7p 0 21 41.46 42.53 0.394005 0.414 0.9999942
9 1ohl 0 21 41.46 42.33 0.376413 0.437 0.9999863
10 1l6y 0 20 40.27 41.07 0.371252 0.540 0.9999744
11 1i8j 0 8 16.40 16.60 0.204785 0.325 0.9999703
12 1h7o 0 21 41.46 42.57 0.412440 0.407 0.9999651
13 1eb3 0 21 41.46 42.38 0.391614 0.422 0.9999554
14 1i8j 2 8 15.34 15.65 0.239518 0.300 0.9998613
15 1l6s 4 8 15.34 15.64 0.242736 0.302 0.9998003
16 1gjp 0 20 40.27 41.48 0.447832 0.352 0.9995251
17 1l6s 2 8 15.29 15.56 0.250936 0.290 0.9995213
18 1h7r 0 20 39.97 41.16 0.473749 0.318 0.9939235
19 1l6y 3 7 12.98 13.14 0.315031 0.192 0.9617287
20 1b4k 0 20 40.27 41.00 0.457764 0.266 0.9577919
21 1e51 0 20 40.07 41.56 0.573448 0.279 0.8148693
22 1gzg 0 21 41.46 42.20 0.504140 0.234 0.7643607
23 1b4k 1 17 34.74 35.62 0.582555 0.203 0.2456983
24 1hrs 2 3 5.72 5.46 0.766603 0.020 0.0006607
25 1m7h 6 5 9.52 8.68 0.869824 0.019 0.0002631
Chapter 4. Match Statistics 67
calculate p-values for matching random (unrelated) configurations in our application
in section 7.4 of Chapter 7.
The distribution for number of matches can also be modelled by the extreme
value distribution e.g. Chen and Crippen (2005).
Chapter 5
EM Algorithm Alignment
The commonly used graph theoretic approach (reviewed in section 1.2.1) and other
related approaches e.g. geometric hashing (Wallace et al., 1997) require adjustment
of a matching distance threshold a priori according to the noise in atomic positions.
This is difficult to pre-determine when matching sites related by varying evolutionary
distances and crystallographic precision.
To avoid the problem of specifying matching distance threshold, in this chapter
we consider using an EM algorithm in the mixture model formulation of the prob-
lem to finding an alignment and point correspondences between two configurations.
Assume we are given two configurations {µi : i = 1, . . . , m} and {xj : j = 1, . . . , n}in ℜd. Suppose there are q ∈ {2, . . . n} corresponding points in these configurations
under rigid body transformation1. However we do not know
(a) which are the corresponding points;
(b) the number of corresponding points, q;
(c) as well as transformation parameters.
We reviewed some approaches for solving this problem in section 1.2.1. We review
a statistical approach by Kent et al. (2004) using a mixture model in section 5.1. In
1We are interested in matching at least two points.
68
Chapter 5. EM algorithm Alignment 69
section 5.2, we consider using concomitant information to point coordinates in Kent
et al. (2004) mixture model framework.
5.1 Mixture Model
Given configurations {µi : i = 1, . . . , m} and {xj : j = 1, . . . , n} in ℜd with corre-
spondence and alignment unknown, Taylor et al. (2003) formulate a mixture model
to solve for both correspondence and alignment simultaneously. Correspondence is
considered to be missing data and EM algorithm is used. Expected values of mix-
tures indicator variables are calculated in the E-step; alignment parameters that
maximise the expected log likelihood are estimated in the M-step using Procrustes
analysis. This is known as “soft” matching because we use expected values of cor-
respondence indicator variables.
5.1.1 Soft Matching of Forms
Let {µi} have more points than {xj} i.e. n ≤ m in order to assume that {xj}has risen from {µi} through some transformation and possibly some points in {µi}not appearing in {xj}. This model is plausible for the motivating problem in pro-
tein functional sites and certainly in some applications in chemoinformatics as well
(Dryden et al., 2006). The restriction that n ≤ m is without loss of generality in
many applications because in practice there is no knowledge of which of the two
configurations to be matched gave rise to the other (parentage) i.e. {xj} and {µi}are exchangeable. Furthermore the parentage is of no practical use as far as match-
ing configurations is concerned. Indeed, for the Bayesian approach in Chapter 6,
configuration sizes do not matter even for formulating the methodological matching
framework and algorithm.
Let the map π(j) = i denote correspondence between points xj and µi. If
π(j) = i then assume
xj = ATµi + b+ εi,
εi ∼ IN(0, σ2)
Chapter 5. EM algorithm Alignment 70
where σ2 is unknown and A is an orthogonal matrix. That is, for fixed j, we take
for i = 1, . . . , m,
φ(xj|π(j) = i) =
(2πσ2)−d/2 exp
{−1
2‖xj − ATµi − b‖2/σ2
}if i 6= 0
1‖W‖ if i = 0.
(5.1)
The convention π(j) = 0 is used to classify a point xj which does not correspond to
any point µi. These points are referred to as coffin bin points. Coffin bin points are
assumed to be uniformly distributed in region W ∈ ℜd i.e.
xj |(π(j) = 0) ∼ Uniform(W ).
The marginal distribution of xj is given by the mixture model
xj ∼
m∑
i=1
P (π(j) = i)N(Aµi + b, σ2Id) + P (π(j) = 0)Uniform(W ) (5.2)
where P (π(j) = i), i = 0, . . . , m, are marginal membership probabilities and
m∑
i=0
P (π(j) = i) = 1.
Alternatively we can assume normal distribution for coffin bin points i.e.
xj |(π(j) = 0) ∼ N(µ0, σ20Id)
where µ0 can be taken to be the centre of mass for {µ} and σ20 is large.
5.1.2 Model Likelihood
Let X = (x1, . . . , xn)T , L be a set of labels. Given L, the likelihood is
Q(X|L) =
m∏
i=0
n∏
j=1
pI[π(j)=i]i φ(xj |π(j) = i)I[π(j)=i]
where I is an indicator function such that
I[π(j) = i] =
1 if π(j) = i
0 otherwise
and pi = P (π(j) = i) is the mixing probability for any x to be with label i.
Chapter 5. EM algorithm Alignment 71
Hence
logQ(X|L) =
m∑
i=0
n∑
j=1
{I[π(j) = i] log pi + I[π(j) = i] log φ(xj|π(j) = i)} . (5.3)
With the labels unknown, let
pi = P (π(j) = i), j = 0, 1, . . . , n;
m∑
i=0
pi = 1
be prior probability of label π(j) to be i. The posterior probability is
pji = P (π(j) = i|xj) =P (xj|π(j) = i)
P (xj)pi
and (pji) is an n× (m+ 1) matrix. Note that
P (xj) =
m∑
i=1
piφ(xj|π(j) = i) + p0φ(xj|π(j) = 0)
and P (xj|π(j) = i) ≡ φ(xj|π(j) = i).
5.1.3 The EM Algorithm
In summary form, the algorithm involves:
• E-step: calculating assignment probabilities (expectation of correspondence
indicator variables).
• M-step: finding transformation and nuisance (variance, σ) parameters which
maximise the expected log likelihood given current assignment probabilities.
Procrustes fit is used to find transformation parameters.
• Repetition of E and M steps until convergence of residual sum of squares.
Algorithm mechanics
Let pi be given, with starting values p(0)i = 1/(m+ 1), say. Then E-step is:
p(r+1)ji =
P (xj|π(j) = i)
P (xj)p
(r)i .
Chapter 5. EM algorithm Alignment 72
Substituting pji for I[π(j) = i] the log likelihood is
m∑
i=0
n∑
j=1
{pji log pi + pji logφ(xj |π(j) = i)} . (5.4)
Thus in M-step, we minimise:
f(A, b) =m∑
i=1
n∑
j=1
pji‖xj − ATµi − b‖2 (5.5)
using Procrustes fit for rigid body motion. A is an orthogonal matrix. If V ΓUT
is a singular value decomposition of B =∑m
i=1
∑nj=1 pji(µi − µ)(xj − x)T where
µ =Pm
i=1
Pnj=1 pjiµiPm
i=1
Pnj=1 pji
; xTr and yTr are rth rows of X and Y then A = V UT .
Thus for the (r + 1)th iteration we have
B(r+1) =m∑
i=1
n∑
j=1
p(r)ji (µi − µ)(xj − x)T , A(r+1) = (V UT )(r+1).
By minimising (5.5) w.r.t. b, we have
b(r+1) =
m∑
i=1
n∑
j=1
p(r)ji (xj − (A(r+1))Tµi)
m∑
i=1
n∑
j=1
p(r)ji
.
Finally update the mixing proportions:
p(r+1)i =
∑
j
p(r)ji
∑
ji
p(r)ji
=
∑
j
p(r)ji
∑
j
1=
∑
j
p(r)ji
n.
E and M steps are repeated until convergence of residual sum of squares
m∑
i=1
(x(r+1)i − x
(r)i )T (x
(r+1)i − x
(r)i )
where x(r)i = A(r)µi + b(r+1). To ensure convergence of the correspondence matrix
(pji) as well, use x(r)i =
∑nj=1 pjiA
(r)µi+b(r+1). Another criteria of convergence could
be the log-likelihood (5.4).
Chapter 5. EM algorithm Alignment 73
At the rth iteration, the correspondence probability weighted maximum likeli-
hood estimate of σ2 is
(σ2)(r) =
m∑
i=1
n∑
j=1
p(r)ji ‖xj − (AT )(r)µi − b(r)‖2
dm∑
i=1
n∑
j=1
p(r)ji
where d = 3 is the dimension. The unweighted estimator is
m∑
i=1
n∑
j=1
‖xj − (AT )(r)µi − b(r)‖2
d× n×m.
For normally distributed coffin bin points, the maximum likelihood estimate of σ20
is
(σ20)
(r) =
n∑
j=1
p(r)j0 ‖xj − (AT )(r)µ0 − b(r)‖2
d
n∑
j=1
p(r)j0
.
The unweighted estimate is
n∑
j=1
‖xj − (AT )(r)µ0 − b(r)‖2
d× n.
We take µ0 to be A(r)Tµc + b(r) where A(r) and b(r) are rth estimates for matrix A
and vector b respectively; µc is centre of mass for {µ}. Simulation studies show that
using the normal distribution for coffin bin points in this way, gives similar results
to using the uniform distribution.
5.1.4 Hardening of Soft Matches
After the algorithm converges, we need to turn (pji) into a “permutation matrix”,
(p′ji) with p′ji ∈ {0, 1}. This is to assign corresponding points and put non corre-
sponding points to the coffin bin. This is a typical linear assignment (LA) task.
There are a number of ways and considerations to accomplish this.
Chapter 5. EM algorithm Alignment 74
Greedy algorithm
We get “hardened” matching probabilities, p′ji ∈ {0, 1} from pji using a greedy
algorithm. Here pji is set to 1 if it is the biggest value in its respective row and
column otherwise set it to 0. Column i and row j are removed if pji is set to 1.
However one has to consider how to treat the coffin bin. We suggest to exclude
the coffin bin column in the greedy algorithm then afterwards allocate all remaining
points to the coffin bin. Thus for i = 1, . . . , m (exclude the coffin bin column, i = 0)
and j = 1, . . . , n, get p′ji according to the following rule:
p′ji =
1 if pji = arg maxi p∗i = arg maxj pj∗
0 otherwise.
By leaving out the coffin bin in the greedy algorithm then only allocating the left-
overs to the coffin bin, we prioritise matching points over coffin bin allocation. The
only problem with the greedy algorithm is that there is no guarantee for a global
maximum assignment.
Dynamic programming (DP) and linear programming (LP)
Mathematically, linear assignment task is a problem of maximisation problem of
Z =
n∑
i=1
n∑
j=1
cjixji,
subject ton∑
i=1
xji = 1, (j = 1, . . . , n)
n∑
j=1
xji = 1, (i = 1, . . . , n)
xji = 1 or 0 ∀i, j
where C = (cji) is a given cost matrix, X = (xji) is the solution matrix. The
constraints enforce unique matching.
The linear assignment problem is a special type of linear programming problem.
Dynamic and linear programming guarantee a globally optimal assignment solution.
Chapter 5. EM algorithm Alignment 75
These methods find a set of pairs {(j, i)} with unique j, i,= 1, . . . n which maximises
the objective function, Z subject to constraints. Among several efficient linear
assignment algorithms are variants of a Hungarian method (Kuhn, 1955; Hung and
Rom, 1980; Karp, 1980; Jonker and Volgenant, 1987; Wright, 1990; Murty Katta,
1968). Another class of linear assignment algorithms include the general simplex
algorithm and simplex-based algorithms2.
To accommodate for the coffin bin, we define a cost matrix
C = (cji), j = −(2m− n− 1), . . . , 0, 1, . . . , n, i = −(m− 1), . . . , 0, 1, . . . , m
with
cji =
pj′i′, for i = i′, j = j′ and i′ = 0, . . . , m; j′ = 1, . . . , n
pj′0, for i < 0, j = j′ and j′ = 1, . . . , n
0, for j ≤ 0.
Thus to allow the possibility of any xj to be assigned to the coffin bin, m−1 columns
are added to the matrix (cji). This matrix is made square by adding extra rows
(dummy xjs) with zeros. Then linear programming is used to solve for assignments
which maximise the objective function, Z. Linear programming is more efficient
than dynamic programming e.g. Karp (1980) gives a linear programming algorithm
with expected execution time of the order O(mn logn).
Threshold level
A threshold value is chosen for pji, say δ. p′ji is set to 0/1 according to
p′ji =
1 if pji ≥ δ
0 if pji < δ
This approach does not guarantee that we get the “permutation” matrix. In
theory there is a possibility to get more than one x matching a particular µ or vice
versa. This is especially true when δ is small and (pji) is not a doubly stochastic
2An example of linear programming simplex-based algorithm implementation is
LPSOLVE http://www.cs.sunysb.edu/ algorith/implement/lpsolve/implement.shtml
Chapter 5. EM algorithm Alignment 76
matrix. This problem can be overcome by assigning such matches to the coffin bin
or arbitrarily breaking ties. A better approach would be to subject as a resultant
matrix to linear assignment algorithm to identify an optimal solution. However after
thresholding, the matrix has less information than the original one so it is better
just to pass the original matrix to a linear assignment algorithm. However if only
strong matches are desired the thresholding can be used at the same time with the
condition of being maximum in both a row and a column for the greedy algorithm.
Sinkhorn method
We can use Sinkhorn method of iteratively normalising rows and columns to get
a doubly stochastic matrix from (pji). The requirement for this method is that
n = m as the method applies to a square matrix, which is a drawback. There is
also to be a consideration of what to do with the coffin bin. Pedersen (2002) uses
a simple heuristic in “extended Sinkhorn” method whereby each entry in the coffin
bin is adjusted so as to have the row and column totals sum to one. The coffin
bin is used in normalisation as well. This approach also suffers from a possibility of
ties i.e. more than one xs matching a particular µ or vice versa. Rangarajan and
Gold (1996) uses a “winner takes all” approach when the matrix is not square. In a
“winner takes all approach”, i is assigned to j if pji is the maximum entry in the jth
row. This approach is a partial greedy algorithm (greedy algorithm assigns if and
only if the entry is a row and column maximum).
Another attempt to solve the problem of hardening matches is binarisation al-
gorithm by Pedersen (2002), which basically does thresholding; “winner takes all”
approach and allocates ties to a coffin bin on columns and rows separately. Fi-
nally use a greedy algorithm on thresholded (pji) i.e. matches are assigned only if
supported by both column-wise and row-wise operations.
Below we compare an improved Hungarian method by (Jonker and Volgenant,
1987), greedy and binarisation algorithms. Figure 5.1 is a plot of correct corre-
spondence proportions for these methods. As expected linear assignment method
outperforms both the greedy and Binarisation algorithm of Pedersen (2002). We do
Chapter 5. EM algorithm Alignment 77
not consider the Sinkhorn method for its unsuitability for non-square matrices. We
leave out thresholding because there is no guarantee to resolve ambiguous matches
hence might require post-processing by the other methods considered. We do not
give results for “winner takes all approach” of Rangarajan and Gold (1996) as it is
a “partial greedy” algorithm already considered.
10 20 30 40 50 60
0.80
0.85
0.90
0.95
point−set size
corr
ect c
orre
spon
denc
e
greedyLAP bin
Figure 5.1: Correct correspondence proportions for greedy algorithm, linear assign-
ment - LA and binarisation algorithm of Pedersen (2002)- P bin.
5.2 Concomitant Information in the Mixture Model
Consider that concomitant information, say colour of points is available. We consider
ways of using this extra information to solve the problem of correspondence and
alignment.
Chapter 5. EM algorithm Alignment 78
5.2.1 Concomitant Information Model
We observe c = (c1, . . . , cn)T in addition to point coordinates {xj} where cj is colour
of xj . We also have coordinates {µi} and their colours k = (k1, . . . , km)T . cj and ki
can take discrete values, say 1, 2, . . . , a for j = 1, . . . , n and i = 1, . . . , m. We would
have a = 20 and a = 4 respectively for amino acid types and groups in Table 5.4.
Denote the frequency of colour cj by fcj . Further, denote the transition proba-
bility of mutating from colour ki to cj by mkicj . For amino acid types, substitution
matrix in Figure 3.3 can be used for these transition probabilities. We assume that
the coordinate generating process X is independent of the colour generating process
C. As before, let L still be a set of labels. We consider a conditional colour substi-
tution model for the points given the labels, L: π(j) = i. Like {µi}, we assume k is
fixed. The model for C conditional on L is
ψ(Cj|π(j) = i) = mkicj
where mkicj is the probability of colour ki mutating to colour cj . ki, cj = 1, 2, . . . , a.
The marginal probability mass function is
P (Cj = cj) = fcj .
The likelihood is
Q(X,C|L) =m∏
i=0
n∏
j=1
{piφ(xj|π(j) = i)ψ(Cj |π(j) = i)}I[π(j)=i]
where I is an indicator function as before.
Hence the log likelihood is
logQ(X|L) =m∑
i=0
n∑
j=1
I[π(j) = i] {log pi + logφ(xj |π(j) = i) + logψ(Cj |π(j) = i)} .
(5.6)
Chapter 5. EM algorithm Alignment 79
The posterior probability is
pji = P (π(j) = i|xj , Cj)=
P (xj ,Cj |π(j)=i)
P (xj ,Cj)P (π(j) = i)
=P (xj |π(j)=i)ψ(Cj |π(j)=i)
P (xj)P (Cj)pi
=P (xj |π(j)=i)mkicj
P (xj)fcjpi
=P (xj |π(j)=i)
P (xj)ωjipi
where ωji =mkicj
fcj.
(5.7)
Note that ψ(Cj |π(j) = 0) = fcj i.e. the marginal probability as colour for the
coffin bin can take all possible values. Hence ωj0 =fcj
fcj= 1.
Thus in EM algorithm, we only modify the E-Step. We can view ωji as the
weight we give for preferring a match of xj to µi based on colour information. The
more the likelihood of mutating from ki to cj , the more the weight. Also given the
same transition likelihood from ki to cj or cj′, the higher the natural abundance of
cj compared to cj′ , the less the weight we give to xj matching µi than xj′ matching
µi.
Thus we have devised one simple way of incorporating colour information in EM
algorithm through weights. Next, we study practical approaches to weighting a
match of xj to µi based on colour information.
5.2.2 Colour Weighting
As the motivating application for these methods is the matching of functional sites
in Bioinformatics, we consider practical ways of weighting the posterior probabilities
pji in the EM algorithm when matching functional sites. We consider amino acid
classes or types as colours. Let class of µi and xj be ki and cj respectively. We
investigate three weighting schemes.
(a) Amino acid substitution matrix weights.
(b) Ad hoc weights.
(c) Simple prior conditional probabilities.
Chapter 5. EM algorithm Alignment 80
Substitution matrix weights
One might consider weighting the posterior probabilities in the EM algorithm pji
with substitution model transition probabilities. i.e. if type of µi is ki and that of
xj is cj then weight pji with mki,cj i.e. substitution matrix entry (ki, cj) in Table
3.3. Then normalised weighted pji values are used in the EM algorithm.
Ad hoc weights
The substitution model mcj ,kiin functional sites is not well characterised and is
different to the one in mutation data from which commonly used substitution ma-
trices are derived. Hence we develop ad hoc weights, (wji) to be used in matching
functional sites data. These weights are data-driven.
Our method is to weight the posterior probabilities pji as follows:
wj,i =
αα×sj+dj
if cj = ki
1α×sj+dj
if cj 6= ki
where
sj = # of points in {µ} with the same type as xj ;
dj = # of points in {µ} with type different from that of xj ;
sj + dj = m;
α controls how much more to weigh matching amino acids of
the same type compared to those of different types.
NOTE:∑m
i wi,j =α×sj
α×sj+dj+
dj
α×sj+dj= 1.
Then we use normalised weighted pji values in the EM algorithm. To illustrate
this, assume we have 8 and 6 coloured points from {µi} and {xj} point sets re-
spectively. Denote colour of µi and xj as ki and cj respectively. We consider two
examples.
Example 5.2.1. Lets suppose the observed colours are as in Table 5.2.
Chapter 5. EM algorithm Alignment 81
Table 5.1: Example 5.2.1 observed colour.
i/j 1 2 3 4 5 6 7 8
k 1 2 1 1 1 3 4 3
c 1 3 1 1 4 3
For these data, with α = 2, the weight matrix is
W = (wji) =
0.17 0.08 0.17 0.17 0.17 0.08 0.08 0.08
0.10 0.10 0.10 0.10 0.10 0.20 0.10 0.20
0.17 0.08 0.17 0.17 0.17 0.08 0.08 0.08
0.17 0.08 0.17 0.17 0.17 0.08 0.08 0.08
0.11 0.11 0.11 0.11 0.11 0.11 0.22 0.11
0.10 0.10 0.10 0.10 0.10 0.20 0.10 0.20
Figure 5.2 is a graphical depiction of this weight function.
Figure 5.2: Illustrative example of data-driven weights for matching.
Chapter 5. EM algorithm Alignment 82
Example 5.2.2. Suppose that all colours are also observed in {µ}.Let’s say c3 = k3 = 2 for data in Example 5.2.1. Hence we have:
Table 5.2: Example 5.2.2 observed colours.
i/j 1 2 3 4 5 6 7 8
k 1 2 2 1 1 3 4 3
c 1 3 2 1 4 3
For these data and α = 5, the weight matrix is
WT = (wji) =
0.25 0.05 0.05 0.25 0.25 0.05 0.05 0.05
0.06 0.06 0.06 0.06 0.06 0.31 0.06 0.31
0.06 0.31 0.31 0.06 0.06 0.06 0.06 0.06
0.25 0.05 0.05 0.25 0.25 0.05 0.05 0.05
0.08 0.08 0.08 0.08 0.08 0.08 0.42 0.08
0.06 0.06 0.06 0.06 0.06 0.31 0.06 0.31
Simple prior conditional probabilities
Here we consider formulating simple prior conditional probabilities as weights. Let
a be the number of colours. Define the prior probability P (Cj = ki|π(j) = i) = β/a.
Hence P (Cj 6= ki|π(j) = i) = 1 − β/a. By having a uniform prior conditional
probabilities on colours other than ki, the conditional mass function is:
P (Cj = cj|π(j) = i) =
βa
if cj = kia−βa2−a otherwise
with β = 1, . . . , a− 1 where a =# of colours and ki is colour of µi.
Comments
Weighting posterior probabilities can be seen as a weighted likelihood approach. We
are merely maximising the weighted likelihood:
Chapter 5. EM algorithm Alignment 83
Lw =∏
π(j)=i
wjipji. (5.8)
As this is just a typical statistical model, it might not necessarily be the most
biochemically plausible despite having the highest likelihood. Other quantities apart
from the likelihood might illuminate biochemical plausibility better. For example
one might consider a score combining a count of colour matches (or other functions
of weights, match/mismatches, etc) and RMSD. RMSD would be a contribution
from geometrical matching. A weighted likelihood can also be viewed in the context
of a matching score; as weights are the contributions from colour matching while
pji values are geometrical matching contributions. On the other hand Gold (2003)
method entails a “strict” geometrical matching as a requisite then comes up with a
score which is a function of colour and geometrical (RMSD) matching measures.
5.2.3 Evaluation
Simulated data is used to assess performance of proposed methods for using concomi-
tant information. To evaluate this approach, the methods are tested on simulated
data to assess their performance. Correct correspondence proportion is used for eval-
uating the methods on simulated data as outlined in 3.2.1. We evaluated correct
correspondence proportions for varying values of m and n as size of configurations
affects efficiency of matching algorithms.
These methods are also applied on real data in section 5.2.4. We compare the
scores (see section 3.2.2) by these methods and graph method of Gold (2003).
Different weighting schemes
If exact substitution weights from substitution matrices were easy to come by, intu-
itively they would be appealing as they would parametrically model the underlining
data generating mechanism. Figure 5.3 gives results for various weighting schemes
in a simulation study. In this simulation, 4 colours (a = 4) were used. Using simple
Chapter 5. EM algorithm Alignment 84
prior conditional probabilities and ad hoc weights gave comparable performances
while exact substitution weights gives the best performance as expected. For ad hoc
weights, α = 4 while β = a−1 = 3 was used for simple prior probabilities approach.
10 20 30 40 50 60
0.6
0.7
0.8
0.9
point−set size
corr
ect c
orre
spon
denc
e
No wgtsAd hoc wgtsSubs. ProbsBayesian
Figure 5.3: Correct correspondence proportions for various weighting schemes.
Bayesian: simple prior conditional probabilities.
Ad hoc weights
We evaluate how increasingly penalising matching points with discordant colours
affect the performance of the algorithm. Figure 5.4 is a plot of correct correspon-
dence proportions for various discordant colour match penalties. Performance is
measured in terms of proportion of correct correspondence identification. For the
results presented here, soft matches are hardened using greedy algorithm on the (pji)
matrix (excluding the coffin bin column). Although linear assignment gives the best
performance in Figure 5.1, the difference with the greedy algorithm is marginal and
linear assignment is computer intensive. Using the greedy algorithm for all values
of α will not affect the study for the effects due to different levels of α.
From simulations we observe that:
Chapter 5. EM algorithm Alignment 85
10 20 30 40 50 60
0.80
0.85
0.90
0.95
point−set size
corr
ect c
orre
spon
denc
e
α = 1 = 2 = 3 = 4 = 5
Figure 5.4: Correct correspondence proportions for various α levels.
(a) Results when colour mutation is allowed and when not allowed are very close
to each other.
(b) Geometric information (with this hardcore model) is so rich
i. that with ground truth initial parameters for dispersion (i.e. σ2 = 0.5)
even without taking colour information into account EM Procrustes gives
correct correspondence proportions of greater than 0.96n. That is wrong
correspondence of at most 1 or 2 points only.
ii. however with a little bit of perturbation to initial parameters say, using
σ2 = 4 as an initial dispersion parameter estimate, the EM algorithm can
converge to local maxima in a few more cases (correct correspondence pro-
portion is around 0.93). This is a serious drawback because combinatorial
nature of the problem even in motivating applications will surely lead to
the likelihood function having many spikes. However using ad hoc colour
weights guides the algorithm to find global maxima in a few more cases
(correct correspondence is back to > 0.96). Figure 5.4 shows improve-
ments on matching when different values of α are used. It seems α = a,
Chapter 5. EM algorithm Alignment 86
the number of colours gives quite substantial improvements. There are
marginal improvements if the value of α is further increased.
(c) What is interesting here is that the use of colour information in this way in-
creases the volume of a region for initial parameter estimates for which the EM
algorithm converges to a global maximum parameter vector. As an example,
Figure 5.5 shows parts of the square region of starting values over which the
algorithm converges to a global maximum parameter vector with and without
weights. Parameters θ1 and θ2 are 2 of the 3 rotation angles. For this evalu-
ation (see section 3.2), m = 48 and only a single dataset (worst case scenario
in our simulations) is used. In the dataset, we had 10%in {µ} with no cor-
responding points in {x} and σ = 2 for noise in {x} coordinates (see section
3.1.1).
5.2.4 Application on Matching Functional Sites
As stated in section 1.1.7, it is highly desirable to have a high number of same amino
acid (residue) matches. The higher the number of similar matches; the better is the
match. We use concomitant information in matching real functional sites. We
compare the results when using or not using concomitant information in the EM
algorithm and the graph method in Gold (2003).
We did pair-wise matching between functional sites from three different protein
families. These functional sites are:
5.2.5 Using Amino Acid Group Information
In this application, amino acid group was used as concomitant information in EM
algorithm. Four groups were used for matching and scoring purposes. As of now
there is no substitution matrix specifically for functional sites in the literature and
existing substitution matrices may not represent very well the substitution in func-
tional sites. Functional sites tend to be conserved more than the rest of the protein
Chapter 5. EM algorithm Alignment 87
Figure 5.5: Convergence regions of starting values for EM algorithm. The algo-
rithm converges to a global optimum for pink values. We get some local optimum
convergence for lighter values otherwise no convergence at all.
Chapter 5. EM algorithm Alignment 88
Table 5.3: Selected functional sites examples for comparing results when using or
not using concomitant information in the EM algorithm and the graph method.
Family/Fold Protein Functional site # of residues
Tim barrel superfold: 5-aminolevulinic acid 1b4e 0 21
5-aminolaevulinate dehydratase 1aw5 5 6
Tyrosine dependent 17 − β hydroxysteroid 1a27 0 63
dehydrogenase
oxidoreductase: NADP-dependent mannitol 1h5q 0 88
dehydrogenase
Trihydroxynaphtalene reductase 1g0n 0 43
Carbonyl reductase 1cyd 1 40
SER-HIS-ASP Subtilisin carlsberg 1bfk 0 38
catalytic triad: Aspartate aminotransferase 1ajr 0 28
Glutaminase asparaginase 3pga 0 63
(Sanchez and Sali, 1998). We used ad hoc weights (section 5.2.2) with α = 2 when
matching with concomitant information. Amino acids were grouped into hydropho-
bic, charged, polar and glycine (see Table 5.4). The difference in centres of mass for
the configurations and the identity matrix were taken to be the starting values for
the translation vector and rotation matrix respectively.
Table 5.4: Groups of amino acids (Branden and Tooze, 1999, p. 6).
Symbols: A C D F G H I K L M N P Q R S T V W Y
Group 1 (hydrophobic) A F I L M P V
Group 2 (charged) D E K R
Group 3 (polar) C H N Q S T W Y
Group 4 (glycine) G
Table 5.5 summarises the results when using EM algorithm with and without
amino acid group information in matching a functional site of 17−β hydroxysteroid-
dehydrogenase (1a27 0) against other functional sites. Table 5.6 summarises the
results obtained by the graph method. We use three scoring options as defined in
Chapter 5. EM algorithm Alignment 89
section 3.2.2. The final score (Score*) is got by dividing the option 2 raw score
by the RMSD. The rule of thumb is, the bigger the score the better the solution.
All scores by the EM algorithm using colour are bigger than when not using colour
information. EM algorithm using colour also find better matches for 1bfk 0, 1cyd 1
and 3pga 0 than the graph methods. EM algorithm did not converge for 1h5q 0
which has ridiculously large RMSD. In Table 5.7 we give a solution for 1h5q 0 after
proper convergence and using distance constraining techniques in section 5.3 to
improve the EM algorithm.
Table 5.5: Comparison of with and without colour matching results when matching
a functional site of 17 − β hydroxysteroid-dehydrogenase (1a27 0) against other
functional sites. Relative weight of (α = 2) was used for similar amino acids when
using colour information.
No colour Colour
Raw Score Raw Score
Option Option
site 0 1 2 RMSD Score* 0 1 2 RMSD Score*
1ajr 0 7 0 1.0 5.13 0.19 13 2 5.5 4.06 1.35
1b4e 0 12 1 2.0 2.79 0.72 13 1 3.0 2.67 1.12
1bfk 0 12 1 2.5 5.24 0.48 18 4 6.5 3.36 1.93
1cyd 1 32 11 15.0 1.82 8.24 31 12 16.0 1.81 8.85
1g0n 0 19 2 4.0 3.33 1.20 22 5 9.5 4.05 2.35
1h5q 0 20 1 7.0 9.35 0.75 22 4 21.0 8.99 2.34
3pga 0 13 1 3.0 5.89 0.51 24 4 11.5 4.82 2.39
Score*= Option 2 Raw Score divided by the RMSD.
Figure 5.7 illustrates that increasing the weight for same group residues (amino
acids), increases the number of same group matches. Figure 5.6 shows superim-
position of 17 − β hydroxysteroid-dehydrogenase on carbonyl reductase when EM
algorithm method is used. There are 27 common matches between colour and no
colour methods. There are 3 pairs exclusively matched when using colour informa-
Chapter 5. EM algorithm Alignment 90
Table 5.6: Results when matching a functional site of 17 − β hydroxysteroid-
dehydrogenase (1a27 0) against other functional sites using Gold (2003) method.
Raw Score
Option
site 0 1 2 RMSD Score*
1ajr 0 12 0 3.0 1.85 1.62
1b4e 0 10 2 5.0 4.19 1.19
1bfk 0 12 1 3.5 2.44 1.43
1cyd 1 27 14 18.5 3.31 5.59
1g0n 0 31 13 21.0 3.07 6.84
1h5q 0 33 16 22.0 2.72 8.09
3pga 0 15 1 4.5 3.20 1.41
Score*= Option 2 Raw Score divided by the RMSD.
tion but not without colour information. On the other hand, 2 pairs are exclusively
matched when not using colour and not matched when using colour. Two of the 3
pairs matched exclusively when using colour are for identical amino acids, the other
pair is for same group amino acids. However amino acids from different groups are
matched in the two exclusive pairs when not using colour information.
Chapter 5. EM algorithm Alignment 91
Figure 5.6: Superposition of carbonyl reductase and 17 − β hydroxysteroid dehy-
drogenase sites when matching with EM algorithm. Amino acid classes information
not used in (a) but used in (b).
Advantages of using amino acid grouping
Use of amino acid group as concomitant information increases
(a) The number of same group/residue matches.
(b) The volume of a region for initial parameter estimates for which the EM algo-
rithm converges to a global maximum parameter vector.
Challenges of using amino acid group information
Using amino acid group information in this way to increase the number of same
residue matches might be at the expense of an overall number of geometrical matches.
Increasing the number of same residue matches sometimes also lead to an increase
in RMSD as seen in Figure 5.7 for matching 17 − β hydroxysteroid dehydrogenase
and 5-aminolevulinic acid.
5.2.6 Summarising Comments
• Use of amino acid type information through weights improves on the quality
of match.
Chapter 5. EM algorithm Alignment 92
• From experimentation, if total number of colours is a then setting α = a
for ad hoc weights and β = a − 1 for simple prior conditional probabilities
gives optimal results. Heavy weights for similar residues is at the expense
of geometrical matching (RMSD) and the gain in class matching is marginal
(for each pair of sites there is a maximum number of possible class matches).
Typical scenario is shown in Figure 5.7.
• It is seen from simulation studies that with the use of concomitant information
we are able to find a set of good starting values for the EM algorithm and the
algorithm converges faster. Figure 5.5 shows good starting values with and
without colour information use.
• To overcome the problem of starting values, a simple approach would be to try
several random starting values. However we consider a more comprehensive
approach using Markov chain Monte Carlo (MCMC) technique in a Bayesian
framework in Chapter 6.2.
5.3 Distance Constraints
It is observed that EM algorithm in sections 5.1 and 5.2 tends to match more
points and hence with larger RMSD than graph method. In the graph method,
matching all inter-point distances enforces strict geometrical matching constraints.
Here we consider more techniques to enforce matching points to be closer in the EM
algorithm.
In addition to using a posterior probability weighted variance estimated at each
iteration of the algorithm for the mixture model, we incorporate three techniques
to ensure smaller distances between matched points:
(a) Variance cooling. If the variance increases from that of the previous estimate at
iteration t of N total number of iterations allowed then use:
Chapter 5. EM algorithm Alignment 93
1 2 3 4 5 6 7
0.4
0.6
0.8
1.0
1.2
weight: α
scor
e
option 1option 2
a)
1 2 3 4 5 6 7
24
68
1012
weight: α
mat
ches
same residue matchessame group matchestotal geometrical matches
b)
1 2 3 4 5 6 7
34
56
weight: α
RM
SD
c)
Score:
option 1 = No. identity matches
RMSD
option 2 = option 1 +No. similar matches
2 x RMSD
Figure 5.7: Match scores and RMSD against α (relative weight for similar amino
acids). Matched sites are 17−β hydroxysteroid dehydrogenase and 5-aminolevulinic
acid. a) Option 1 and 2 scores. b) Total number of pairs matched, pairs with the
same amino acid and pairs with the same group. c) RMSD.
Chapter 5. EM algorithm Alignment 94
σ2 = A0
(ANA0
)t/N
where A0 and AN are desired variance values at t = 0 and t = N respectively.
From an application in section 5.3.1 we observe easy convergence and better
RMSD values for A0 = 100 and A200 = σ2g = 0.32 in most cases. We choose
0.3 to correspond to the threshold value of 1.5A for matching distances in the
graph method (see section 4.1.4 in Chapter 4). Furthermore, under the Gaussian
model, the width of a 85% C.I. for matching distances in graph method is
2 × 1.04√
3 × 2σ2g . Equating this to threshold value of 1.5A gives σg = 0.297.
Kent et al. (2004) independently found out that using σ = 0.3 for EM algorithm
gives similar results to graph method when matching 17 − β hydroxysteroid
dehydrogenase and carbonyl reductase functional sites. And indeed, conversely,
using the graph solution when matching 17 − β hydroxysteroid dehydrogenase
and carbonyl reductase functional sites we estimate σ to be around 0.3A.
(b) Fixing the variance for the coffin bin, σ20 . This value is calculated from the
volume of {µ}, W . Consider a sphere with volume W i.e. W = 43πR3. Let
2σ0 = R then σ20 = 1
4
(3W4π
)2/3.
(c) In linear programming, rule out correspondences with probability less than
cL = φ(r, σ2cI3) where φ is a standard normal density; r and σ2
c are applica-
tion specific values to be specified by the user. We use a probability threshold
value of 0.038 for r = 1 and σ2c = 1.019 which seem to give reasonably good
results. Probability thresholding is similar to the Bayesian approach considered
in section 6.1.5.
5.3.1 Results
Here we consider both query and templates from tyrosine-dependent oxidoreduc-
tases family. We compare a functional site of 17 − β hydroxysteroid dehydrogenase
(1a27 0) to representative sites from each of the 33 domains in this family. Sites
Chapter 5. EM algorithm Alignment 95
in the first column of Table 5.7, were chosen as representatives for their respective
domains.
As in section 5.2.5, we used ad hoc weights with α = 2 when matching with
concomitant information (colour). Amino acids were also grouped into hydrophobic,
charged, polar and glycine. The difference in centres of mass for the configurations
and the identity matrix were taken to be the starting values for the translation
vector and rotation matrix respectively.
Reported in Table 5.7 are RMSD values for graph and EM algorithm with and
without colour information use. Also reported are differences in rotations (A) used
to match the sites by graph and EM algorithm methods. If A and A being rotation
matrices in graph and EM algorithm respectively, A is such that the trace of the
orthogonal matrix taking A to A is approximately equal to 1 + 2 cos A (Green and
Mardia, 2006). Thus A = cos−1(
tr(AAT )−12
).
Results show that these distance constraining techniques considerably lower the
number of matching points and RMSD. Solutions for 1h5q 0 when using the EM
algorithm are now comparable to the graph theoretic solution unlike in Table 5.6
where the EM algorithm did not possibly converge. In general the higher the number
of matching points (q) and the lower the RMSD, the better the solution. RMSD
and q are combined into a single score e.g. in section 5.2.4 (Tables 5.5 and 5.6) to
rank the matches. Alternatively p-values e.g in Chapter 4, section 4.1.4 (Table 4.1)
can be used. However here we just informally note a number of cases with clearly
better solutions by the EM algorithm compared to solutions by the graph method
(cases italicised in Table 5.7). Obvious cases are solutions with many more matching
points with RMSD of similar magnitude or solutions with much lower RMSD but
with comparable matching points.
5.4 Multiple Transformations
For simplicity we consider a situation where the configuration {x} is related to {µ}through two different transformations. The extension to many transformations is
Chapter 5. EM algorithm Alignment 96
Table 5.7: Matching statistics for 17 − β hydroxysteroid dehydrogenase functional
site (1a27 0) against representative functional sites using EM algorithm method
with and without colour information use and Graph method. Italicised cases have
qualitatively better solutions by the EM algorithm compared to graph solutions.
Colour No Colour Graph
Site A RMSD q A RMSD q RMSD N
1udc 0 1.700 3.658 31 1.112 4.749 31 3.605 17
1bxk 0 1.024 2.692 26 0.452 3.631 29 4.203 19
1n2s 0 1.717 0.915 3 1.162 1.672 4 2.547 4
1e6u 0 0.601 3.627 45 3.099 5.648 34 1.159 23
1eq2 0 0.138 5.093 31 0.548 3.583 28 1.648 19
1i24 0 0.439 2.031 13 0.447 1.947 13 3.763 15
1k6x 0 1.661 2.985 18 1.319 6.381 12 2.100 10
1cyd 0 3.037 3.262 22 0.985 2.978 33 1.575 25
1oaa 0 1.131 3.638 35 0.534 3.194 32 1.686 23
1fdv 0 0.009 0.418 51 0.009 0.418 51 0.423 51
1fmc 0 2.218 3.688 35 1.331 5.120 24 1.000 25
1hdc 0 0.648 4.881 24 0.614 1.358 10 2.556 11
1fk8 0 0.578 2.779 25 0.645 2.722 23 0.956 24
1nff 0 1.576 2.806 12 2.766 4.864 10 2.356 28
1nxq 0 1.182 1.061 3 1.668 0.243 4 1.274 4
1bdb 0 0.312 2.561 23 0.313 2.561 23 0.874 40
1b16 0 1.412 1.404 12 2.463 5.388 25 0.823 38
1gco 0 0.515 3.052 26 0.541 3.181 28 0.773 25
1geg 0 0.600 1.034 37 0.428 2.770 33 0.959 30
1iy8 0 1.422 3.796 27 1.423 3.758 27 0.853 26
1h5q 0 2.206 2.347 19 2.206 2.593 21 2.723 33
1gz6 0 1.368 2.751 45 1.327 4.200 59 1.203 27
1edo 0 1.585 2.717 16 1.543 3.343 18 0.845 25
1eno 0 1.068 3.195 18 1.067 3.195 18 1.078 21
1ae1 0 2.190 3.257 19 0.864 4.457 17 0.890 24
1g0o 0 0.520 2.830 39 0.511 2.805 39 0.813 29
1ja9 0 1.011 3.986 31 0.752 3.517 21 2.173 28
1hdo 0 1.803 3.322 38 1.140 3.707 37 2.165 20
1e6w 0 0.173 2.567 57 0.223 4.341 57 2.632 39
1n5d 0 1.905 1.069 3 1.905 1.069 3 1.134 3
Chapter 5. EM algorithm Alignment 97
straightforward.
In two transformations case, we assume the set of points {x} is divided into two
distinct sets:
S1 = {xj}: collection of points with the first transformation.
S2 = {x′j}: collection of points with the second transformation.
On the other hand set {µ} is divided into three distinct sets:
G1 = {µi}: collection of points corresponding to points in S1.
G1 = {µ′i}: collection of points corresponding to points in S2.
G3 = {µ′′i }: collection of points with no corresponding points in S1 ∪ S2.
However the set membership for all the points is not known.
5.4.1 Soft Matching Model
As in one transformation case, let {µi}, i = 1, 2, . . . , m, and {xj}, j = 1, 2, . . . , n;
m ≥ n, be two sets of sites in ℜd of a region W. Let π(j) = i for xj where i =
{0, 1, . . . , m} be a map of correspondence between xj and µi. Now we define a map
H(j) = s if xj arises from µi, i = 1, . . . , m through transformation s = 1, 2. For
compact notation, we introduce a joint map γs(j) = i iff π(j) = i and H(j) = s. If
γs(j) = i assume
xj = ATs µi + bs + εi
εi ∼ IN(0, σ2) where σ2 is unknown and As is an orthogonal matrix. That is, for
fixed j, we take for i = 1, . . . , m,
φ(xj |π(j) = i, H(j) = s) = φ(xj |γs(j) = i)
=
(2πσ2)
−d2 exp
{−1
2‖xj − ATs µi − bs‖2/σ2
}if i 6= 0
1‖W‖ if i = 0.
(5.9)
As per convention, π(j) = 0 or γs(j) = 0 is used to classify a point xj which does not
correspond with any of the points µi. In this case, we suppose that xj is uniformly
distributed on W i.e.
xj |(γs(j) = 0) ∼ Uniform(W ).
Chapter 5. EM algorithm Alignment 98
Alternatively we can assume normal distribution for coffin bin points (see section
5.1.1). We have experienced that the results are not sensitive to using either uniform
or Gaussian distribution for the coffin bin.
The marginal distribution of xj is given by the mixture model
xj ∼
2∑
s=1
m∑
i=1
P (γs(j) = i)N(Asµi + bs, σ2Id) + P (γs(j) = 0)Uniform(W ) (5.10)
where P (γs(j) = i), i = 0, . . . , m, are marginal membership probabilities and
2∑
s=1
m∑
i=1
P (γs(j) = i) + P (γs(j) = 0) = 1.
5.4.2 Model Likelihood
LetX = (x1, . . . , xn)T , L, S be sets of labels for map functions π(j) = i and S(j) = s.
Given L,H , the likelihood is
Q(X|L, S) =2∏
s=1
m∏
i=0
n∏
j=1
pI[γs(j)=i]i φ(xj|γs(j) = i)I[γs(j)=i]
where I is an indicator function such that
I[γs(j) = i] =
1 if γs(j) = i
0 otherwise
and pi is the mixing probability for any x to be with label i i.e. be mapped to µi
under transformation s.
Hence
logQ(X|L, S) =
2∑
s=1
m∑
i=0
n∑
j=1
{I[γs(j) = i] log pi + I[γs(j) = i] logφ(xj |γs(j) = i)} .
(5.11)
With the labels unknown, let
pi = P (γs(j) = i), i = 1, 2, . . . , m,
2∑
s=1
m∑
i=1
pi + p0 = 1
Chapter 5. EM algorithm Alignment 99
be the prior probability of the label π(j) to be i and label S(j) to be s. The posterior
probability is
pji = P (γs(j) = i|xj) =P (xj|γs(j) = i)
P (xj)pi
and (pji) is an n× (2m+ 1) matrix. Note that
P (xj) =
2∑
s=1
m∑
i=1
piφ(xj |γs(j) = i) + p0φ(xj|γs(j) = 0)
and P (xj|γs(j) = i) ≡ φ(xj |γs(j) = i).
It is straightforward to extend this model formulation to more than two groups
of transformation. For the model to be identifiable, obviously s should be much
smaller than n.
5.4.3 The EM Algorithm
A simple extension of an EM algorithm with a coffin bin to two separate transfor-
mations is considered.
Let pi be given, with starting values p(0)i = 1/(2m+ 1), say. Then the E-step is:
p(r+1)ji =
P (xj|γs(j) = i)
P (xj)p
(r)i .
Substituting pji for I[γs(j) = i] the log likelihood is
2∑
s=1
m∑
i=0
n∑
j=1
{pji log pi + pji logφ(xj |γs(j) = i)} . (5.12)
Thus in the M-step, we minimise:
f(A1, A2, b1, b2) =
2∑
s=1
m∑
i=1
n∑
j=1
pji‖xj − ATs µi − bs‖2 (5.13)
using Procrustes fit for rigid body motion. As is an orthogonal matrix. If VsΓUTs
is a singular value decomposition of Bs =∑m
i=1
∑nj=1 pji(µi − µs)(xj − xs)
T where
µs =
m∑
i=1
n∑
j=1
pjiµi
m∑
i=1
∑
j=1
pji
; xTr and yTr are the rth rows of X and Y then As = VsUTs .
Chapter 5. EM algorithm Alignment 100
Thus for the (r + 1)th iteration we have
B(r+1)s =
m∑
i=1
n∑
j=1
p(r)ji (µi − µs)(xj − xs)
T , A(r+1)s = (VsU
Ts )(r+1).
By minimising (5.13) w.r.t. bs, we have
b(r+1)s =
∑mi=1
∑nj=1 p
(r)ji (xj − (A
(r+1)s )Tµi)
∑mi=1
∑nj=1 p
(r)ji
.
Finally update the mixing proportions:
p(r+1)i =
∑j p
(r)ji∑
ji p(r)ji
.
E and M steps are repeated until convergence of residual sum of squares:
2∑
s=1
n∑
i=1
(x(r+1)i − x
(r)i )T (x
(r+1)i − x
(r)i−s)
where x(r)i = A
(r)s µi + b
(r+1)s . To ensure convergence of the correspondence matrix
(pji) as well, use x(r)i =
∑nj=1 pjiA
(r)s µi+b
(r+1)s . Another criteria of convergence could
be the log-likelihood (5.12).
At the rth iteration, the correspondence probability weighted maximum likeli-
hood estimate of σ2 is
(σ2)(r) =P2
s=1
Pmi=1
Pnj=1 p
(r)ji ‖xj−(AT
s )(r)µi−b(r)s ‖2
d×P2
s=1
Pmi=1
Pnj=1 p
(r)ji
where d = 3 is the dimension.
The unweighted estimate is∑2
s=1
∑mi=1
∑nj=1 ‖xj − (ATs )(r)µi − b
(r)s ‖2
2 × d× n×m.
We assumed these two transformations have the same nuisance parameter σ. It is
straightforward to extend the theory to different parameters case. The maximum
likelihood estimate of variance, (σ2s )
(r) becomes
Pmi=1
Pnj=1 p
(r)ji ‖xj−(AT
s )(r)µi−b(r)s ‖2
d×Pmi=1
Pnj=1 p
(r)ji
and the unweighted estimator is
Pmi=1
Pnj=1 ‖xj−(AT
s )(r)µi−b(r)s ‖2
d×n×m .
Chapter 5. EM algorithm Alignment 101
5.4.4 Simulations
We did some simulations to evaluate the algorithm. We evaluated performance
for m = 24, n = 20 and sets S1, S2 having 10 points each. With equal number
of points in each set, we expect to have equal membership preference for either
transformations.
Table 5.8 gives correct correspondence proportions and rotation errors for A1 and
A2 i.e. measures of distance between true and estimated rotation matrices. Reported
are results for several runs using different starting values for A1, A2, b1 and b2 (The
EM algorithm is very sensitive to starting values, see Figure 5.5). For each run
we had 30 dataset replicates. Correct correspondence proportions for the algorithm
are around 0.7. Here a point has a correct correspondence if assigned to a true
corresponding point µi and under the true transformation s or rightly not assigned
to any other µi. As expected the performance is not as good as in a simpler case of
one transformation only. Rotation errors for the first transformation are around 0.05
radians while for the second transformation are in the range of 0.1 to 0.7 radians.
There is higher accuracy in estimating the rotation for the first transformation than
for the second. This is surprising considering that we had equal number of points in
each set. However since we estimated A1 first, the higher accuracy could be due to
the algorithm drifting quickly towards the first transformation as we started quite
near the true parameter setting (otherwise the designation of transformations as first
or second is arbitrary). Obviously, extensive simulations are required to conclusively
assess performance of the algorithm especially transformation errors.
Chapter 5. EM algorithm Alignment 102
Table 5.8: Proportions of correct correspondence and rotation errors when using
EM algorithm for matching forms with two transformations. A point has a correct
correspondence if assigned to a true corresponding point µi and under the true
transformation s
or the point is rightly not assigned to any other point µi.
Correspondence Rotation error
Run All Points Set 1 Set 2 A1 A2
1 0.681 0.692 0.669 0.055 0.767
(0.0057) (0.0071) (0.0075) (0.0070) (0.0340)
2 0.695 0.706 0.684 0.051 0.136
(0.0058) (0.0070) (0.0077) (0.0040) (0.0072)
3 0.681 0.692 0.670 0.054 0.702
(0.0057) (0.0072) (0.0074) (0.0053) (0.0336)
Given in parentheses are the std. errors.
Chapter 6
Bayesian Alignment
In this chapter we consider Markov chain Monte Carlo (MCMC) technique in a
Bayesian paradigm to overcome the problem of sensitivity to starting values for EM
algorithm in Chapter 5. We consider finding alignment and point correspondences
between two configurations using a full joint distribution for correspondence matrix
and transformation parameters. Using MCMC with detailed balance update and
drawing from the posterior of all parameters should stand a better chance of escaping
from local maxima for the model better than by simply trying several starting values
for the EM algorithm.
6.1 Bayesian Hierarchical Model
Green and Mardia (2006) build a hierarchical model to solve alignment and matching
of configurations, according to the Bayesian paradigm. This method gives a complete
distribution of probable matches and hence an opportunity to explore several other
solutions near the “optimal” solution.
103
Chapter 6. Bayesian Alignment 104
6.1.1 Point Process Model, with Geometrical Transforma-
tion and Random Thinning
Suppose there are two point configurations in d-dimensional space Rd: {xj , j =
1, 2, . . . , m} and {yk, k = 1, 2, . . . , n}. The points are labelled for identification, but
arbitrarily.
Both point sets are regarded as noisy observations on subsets of a set of true
locations {µi}, where the mappings from j and k to i is unknown. There may be a
geometrical transformation between the x-space and the y-space, which may also be
unknown. The objective is to make model-based inference about these mappings,
and in particular make probability statements about matching – which pairs (j, k)
correspond to the same true location?
The geometrical transformation between the x-space and the y-space is denoted
A; thus y in y-space corresponds to x = Ay in x-space. The notation does not
imply that the transformation A is necessarily linear. It may be a rotation or more
general linear transformation, a translation, both of these, or some non-rigid motion.
Regard the true locations {µi} as being in x-space.
The mappings between the indexing of {µi} and that of data {xj} and {yk} are
captured by indexing arrays {ξj} and {ηk}; specifically assume that
xj = µξj + ε1j (6.1)
for j = 1, 2, . . . , m, where {ε1j} have probability density f1, and
Ayk = µηk+ ε2k (6.2)
for k = 1, 2, . . . , n, where {ε2k} have density f2. All {ε1j} and {ε2k} are independent
of each other, and independent of {µi}.
6.1.2 Formulation of Poisson Process Prior
Suppose that the set of true locations {µi} forms a homogeneous Poisson process
with rate λ over a region V ⊂ Rd of volume v, and that there are N points realised in
Chapter 6. Bayesian Alignment 105
this region. Some of these give rise to both x and y points, some to points of one kind
and not the other, and some are not observed at all. Suppose these four possibilities
occur independently for each realised point, with probabilities parameterised so that
with probabilities (1−px−py−ρpxpy, px, py, ρpxpy) observe neither, x alone, y alone,
or both x and y, respectively. The parameter ρ is a certain measure of the tendency
a priori for points to be matched: the random thinnings leading to the observed x
and y configurations can be dependent, but remain independent from point to point.
Given N , m and n, there are L matched pairs of points in the sample if and
only if the numbers of these four kinds of occurrence among the N points are
(N −m− n+ L,m− L, n− L,L). Under the assumptions above these four counts
will be independent Poisson distributed variables, with means (λv(1 − px − py −ρpxpy), λvpx, λvpy, λvρpxpy). The prior marginal1 probability distribution of L con-
ditional on m and n is therefore proportional to
e−λvpx(λvpx)m−L
(m− L)!× e−λvpy(λvpy)
n−L
(n− L)!× e−λvρpxpy(λvρpxpy)
L
L!
so that
P (L) ∝ (ρ/λv)L
(m− L)!(n− L)!L!
for L = 0, 1, . . . ,min{m,n}. Here and later, use the generic P (·) notation for distri-
butions and conditional distributions in the hierarchical model.
The matching of the configurations is represented by the matching matrix M ,
where Mjk indicates whether xj and yk are derived from the same µi point, or not,
that is,
Mjk =
1 if ξj = ηk
0 otherwise.
(6.3)
Note that M is the adjacency matrix for the bipartite graph representing the match-
ing, and that∑
j,kMjk = L. Assume for the moment that conditional on L, M is
a priori uniform: there are L!(mL
)(nL
)different M matrices consistent with a given
1Integrated over N ,
∞∑
N=n+m−L
{λv(1 − px − py − ρpxpy)}N−m−n−L
(N − m − n + L)!= 1.
Chapter 6. Bayesian Alignment 106
value of L, and these are taken as equally likely. Thus
P (M) = P (L)P (M |L) ∝ (ρ/λv)L
(m− L)!(n− L)!L!
{L!
(m
L
)(n
L
)}−1
∝ (ρ/λv)L,
(where here and later “∝” means proportional to, as functions of the variable(s) to
the left of the conditioning |, in this case, M). Thus
P (M) =(ρ/λv)L
∑min{m,n}ℓ=0 ℓ!
(mℓ
)(nℓ
)(ρ/λv)ℓ
. (6.4)
Because of the choice of parameterisation for the probabilities of observing hidden
points, P (M) does not involve px and py.
µ
ξ η
M
X Y
σ
A
τ
Figure 6.1: Directed acyclic graph representing the model, showing all data and
parameters treated as variable.
6.1.3 Data Likelihood
Given M , the likelihood of the observed configurations of points is specified as
follows. Assume that A is an affine transformation: Ay = Ay + τ . From (6.1)
and (6.2), the densities of xj and yk, conditional on A, τ , {µi}, {ξj} and {ηk} are
f1(xj − µξj) and |A|f2(Ayk + τ − µηk), respectively, |A| denoting the absolute value
of the determinant of A.
The locations {µi} of the m − L points that generate an x observation but not
a y observation are independently uniformly distributed over the region V , so that
Chapter 6. Bayesian Alignment 107
the likelihood contribution of these m− L observations, namely {j :∑
k
Mjk = 0},
is∏
j:Mjk=0∀kv−1
∫
V
f1(xj − µ)dµ.
Similarly, the contributions from the unmatched y observations, and from the matched
pairs are
∏
k:Mjk=0∀jv−1
∫
V
|A|f2(Ayk+τ−µ)dµ and∏
j,k:Mjk=1
v−1
∫
V
f1(xj−µ)|A|f2(Ayk+τ−µ)dµ
respectively. These integrals all exhibit “edge effects” from the boundary of the
region V , which can be neglected if V is large relative to the supports of f1 and f2.
In this case these three expressions approximate to
v−(m−L), (|A|/v)n−L, and (|A|/v)L∏
j,k:Mjk=1
∫
Rd
f1(xj − µ)f2(Ayk + τ − µ)dµ
respectively. The last expression can be written
(|A|/v)L∏
j,k:Mjk=1
g(xj −Ayk − τ)
where g(z) =∫f1(z + u)f2(u)du (the density of ε1j − ε2k).
Combining these terms together, the complete likelihood is
P (x, y|M,A, τ) = v−(m+n)|A|n∏
j,k:Mjk=1
g(xj − Ayk − τ). (6.5)
Multiplying (6.4) and (6.5), then
P (M,x, y|A, τ) ∝ |A|n∏
j,k:Mjk=1
{(ρ/λ)g(xj −Ayk − τ)}.
Note that the constant of proportionality involves m, n, λ, ρ, and v, but not A, τ ,
any parameters in f1 or f2, or M of course.
By further making assumptions of spherical normality for f1 and f2:
xj ∼ Nd(µξj , σ2xI) and Ayk + τ ∼ Nd(µηk
, σ2yI),
with σx = σy = σ, say, then
g(z) =1
(σ√
2)dφ(z/σ
√2)
Chapter 6. Bayesian Alignment 108
where φ is the standard normal density in Rd, and the final joint model is
P (M,A, τ, σ, x, y) ∝ |A|nP (A)P (τ)P (σ)∏
j,k:Mjk=1
(ρφ({xj − Ayk − τ}/σ√2)
λ(σ√
2)d
).
(6.6)
Note that not only px and py but also v does not appear in this expression, principally
from the choice of parameterisation, and that only the ratio ρ/λ is identifiable.
The directed acyclic graph representing this joint probability model, including the
variables (µ, ξ and η) that have been integrated out, is displayed in Figure 6.1.
6.1.4 Prior Distributions and Computations
We assumed the existence of true but unobservable locations {µi} from a Poisson
process just to conveniently formulate the mathematical framework and simplify the
algebra. The assumption of Poisson points would not exactly represent the model for
functional sites (see Chapter 2). However in section 6.1.8 we do sensitivity analysis
for the Poisson assumption and find that violations of the assumption do not impede
the effectiveness of the algorithm.
Green and Mardia (2006) treat ρ and λ as fixed, and consider inference for the
remaining unknowns M , τ , σ2 and sometimes A, given the data {xj} and {yk}.Markov chain Monte Carlo methods are used for the computation.
Suppose that prior information about τ , σ2 and A will be at best weak and use
generic prior formulations that facilitate the posterior analysis. Prior assumptions
are therefore discussed in parallel with MCMC implementation. Note that the for-
mulation has some affinity with mixture models, the matching matrix M playing a
similar role to the allocation variables often used in computing with mixtures; see,
for example, Richardson and Green (1997). As in that paper, this full Bayesian anal-
ysis aims at simultaneous joint inference about both the discrete and continuously
varying unknowns, in contrast to frequentist approaches.
This model has another similarity with a mixture formulation, in that as M
varies, the number of hidden points needed to generate all the observed data also
varies, and thus there seems to be a “variable-dimension” aspect to the model. How-
Chapter 6. Bayesian Alignment 109
ever, the approach of integrating out hidden point locations eliminates the variable-
dimension parameter, so that reversible jump MCMC is not needed.
Priors and MCMC updating for a rotation matrix
From equation 6.6, the full conditional distribution for A given data and values for
all other parameters is
P (A|M, τ, σ, x, y) ∝ |A|nP (A)∏
j,k:Mjk=1
φ({xj − Ayk − τ}/σ√2).
Viewing this as a density for A, there is still freedom to choose the dominating
measure for P (A) arbitrarily. Then the full conditional density will be with respect
to the same measure.
In matching functional sites, we would only consider rigid body transforma-
tion other than a general (linear) transformation. Thus considering only rotations
(orthogonal matrices A with positive determinant) and expanding the expression
above:
P (A|M, τ, σ, x, y) ∝ P (A) exp
∑
j,k:Mjk=1
−0.5(||xj −Ayk − τ ||/σ√2)2
∝ P (A) exp
(1/2σ2)∑
j,k:Mjk=1
(xj − τ)TAyk
∝ P (A) exp
tr
(1/2σ2)∑
j,k:Mjk=1
yk(xj − τ)TA
.
There is (conditional) conjugacy – if P (A) has the form P (A) ∝ exp(tr(F T0 A))
for some matrix F0. That is the posterior has the same form with F0 replaced by
F = F0 + (1/2σ2)∑
j,k:Mjk=1
(xj − τ)yTk . (6.7)
This is known as the matrix Fisher distribution (Downs, 1972; Mardia and Jupp,
2000, p. 289). Here for symmetry we use uniform prior with F0 = 0.
Chapter 6. Bayesian Alignment 110
Sampling the matrix Fisher distribution
We will review how to sample from the matrix Fisher distribution in the 3-dimensional
case.
For 3-dimensional case, A can be represented as a product of 3 elementary rota-
tions
A = A12(θ12)A13(θ13)A23(θ23) (6.8)
as in Raffenetti and Ruedenberg (1970), and Khatri and Mardia (1977). For i < j,
Aij(θij) is the matrix with mii = mjj = cos θij , −mij = mji = sin θij , mrr = 1 for
r 6= i, j and other entries 0. Each of the generalised Euler angles θij is sampled
in turn, conditioning on the other two angles and the other variables (M, τ, σ, x, y)
entering the expression for F .
The joint full conditional density of the Euler angles is
∝ exp[tr{F TA}] cos θ13
for θ12, θ23 ∈ (−π, π) and θ13 ∈ (−π/2, π/2). The cosine term arises since the natural
dominating measure, corresponding to uniform distribution of rotation, has volume
element cos θ13dθ12dθ13dθ23 in these coordinates.
By substituting the representation (6.8) and simplifying, the trace can be written
variously as
tr{F TA} = a12 cos θ12 + b12 sin θ12 + c12 + a13 cos θ13 + b13 sin θ13 + c13
+a23 cos θ23 + b23 sin θ23 + c23
where
a12 = (F22 − sin θ13F13) cos θ23 + (−F23 − sin θ13F12) sin θ23 + cos θ13F11
b12 = (− sin θ13F23 − F12) cos θ23 + (F13 − sin θ13F22) sin θ23 + cos θ13F21
a13 = sin θ12F21 + cos θ12F11 + sin θ23F32 + cos θ23F33
b13 = (− sin θ23F12 − cos θ23F13) cos θ12 + (− sin θ23F22 − cos θ23F23) sin θ12 + F31
a23 = (F22 − sin θ13F13) cos θ12 + (− sin θ13F23 − F12) sin θ12 + cos θ13F33
b23 = (−F23 − sin θ13F12) cos θ12 + (F13 − sin θ13F22) sin θ12 + cos θ13F32
Chapter 6. Bayesian Alignment 111
and the cij can be ignored, combined into the normalising constants. Thus the full
conditionals for θ12 and θ23 are von Mises distributions. These can be updated by
Gibbs sampling or an efficient rejection method, Best/Fisher algorithm (see Mardia
and Jupp, 2000, p. 43).
However the distribution of θ13 is proportional to
exp[a13 cos θ13 + b13 sin θ13] cos θ13.
Mardia and Gadsden (1977) studied this distribution without discussing how to
simulate a sample from it. Green and Mardia (2006) use a random walk Metropolis
algorithm, with a perturbation uniformly distributed on [−0.1, 0.1], to sample from
this distribution.
Priors and updating for other parameters
Here τ and σ−2 are taken to have respectively prior Gaussian and Gamma distri-
butions. These priors are computationally convenient and most importantly also
plausible for τ and σ in matching functional sites. Thus
τ ∼ Nd(µτ , σ2τI)
and
σ−2 ∼ Γ(α, β).
Under the assumptions of (6.6), there is conjugacy for τ and σ, and the explicit full
conditionals:
τ |M,A, σ, x, y ∼ Nd
(µτ/σ
2τ +
∑j,k:Mjk=1(xj −Ayk)/2σ
2
1/σ2τ + L/2σ2
,1
1/σ2τ + L/2σ2
I
),
σ−2|M,A, τ, x, y ∼ Γ
α + (d/2)L, β + (1/4)∑
j,k:Mjk=1
||xj − Ayk − τ ||2
and Gibbs sampler is used to update these parameters.
Chapter 6. Bayesian Alignment 112
Updating M
The matching matrix M is updated in detailed balance using Metropolis-Hastings
moves that only propose changes to a few entries: the number of matches L =∑
j,kMjk can only increase or decrease by 1 at a time, or stay the same. The
possible changes are
(a) adding a match: changing one entry Mjk from 0 to 1.
(b) deleting a match: changing one entry Mjk from 1 to 0.
(c) switching a match: simultaneously changing one entry from 0 to 1, and another
in the same row or column from 1 to 0.
These changes respect the constraint that there should be unique matches between
js and ks (0 ≤∑
j
Mjk ≤ 1 and 0 ≤∑
k
Mjk ≤ 1).
The proposal proceeds as follows: first a uniform random choice is made from all
m+n data points x1, x2, . . . , xm, y1, y2, . . . , yn. Suppose without loss of generality, by
the symmetry of the set-up, that an x is chosen, say xj . There are two possibilities:
either xj is currently matched (∃k such that Mjk = 1) or not (there is no such k).
If xj is matched to yk, with probability p⋆ propose deleting the match, and with
probability 1 − p⋆ propose switching it from yk to yk′, where k′ is drawn uniformly
at random from the currently unmatched y points. On the other hand, if xj is not
currently matched, propose adding a match between xj and a yk, where again k is
drawn uniformly at random from the currently unmatched y points.
The acceptance probabilities for these three possibilities are easily derived from
the expression (6.6) for the joint distribution, since in each case the proposed
new matching matrix M ′ is only slightly perturbed from M , so that the ratio
P (M ′, τ, σ|x, y)/P (M, τ, σ|x, y) has only a few factors. Taking into account also
the proposal probabilities, whose ratio is (1/nu)÷p⋆, where nu = #{k ∈ 1, 2, . . . , n :
Mjk = 0∀j} is the number of unmatched y points in M , the acceptance probability
for adding a match (j, k) is
min
{1,ρφ({xj −Ayk − τ}/σ√2)p⋆nu
λ(σ√
2)d
}. (6.9)
Chapter 6. Bayesian Alignment 113
Similarly, the acceptance probability for switching the match of xj from yk to yk′ is
min
{1,φ({xj − Ayk′ − τ}/σ√2)
φ({xj − Ayk − τ}/σ√2)
}(6.10)
and for deleting the match (j, k) is
min
{1,
λ(σ√
2)d
ρφ({xj −Ayk − τ}/σ√2)p⋆n′u
}(6.11)
where n′u = #{k ∈ 1, 2, . . . , n : M ′
jk = 0∀j} = nu + 1. Since the changes effected are
so modest, typically make several moves updating M per sweep along with just one
at a time for each of the other updates.
6.1.5 Inference
Point estimates for M , A and τ are important in Bioinformatics applications. We
need to specify loss functions giving the cost incurred in declaring point estimates.
We consider estimators which minimise expected loss functions with respect to con-
ditional posterior distributions.
Match Matrix
Suppose that the loss when Mjk = a and Mjk = b, for a, b = 0, 1 is ℓab; for example,
ℓ01 is the loss associated with declaring a match between xj and yk when there is
really none, that is, a “false positive”. Then
E[L(M, M )|x, y]=∑
j,k
Mjkℓ11pjk +∑
j,k
Mjkℓ01(1 − pjk) +∑
j,k
(1 − Mjk)ℓ10pjk +∑
j,k
(1 − Mjk)ℓ00(1 − pjk)
=∑
j,k
Mjk(ℓ11pjk − ℓ01pjk − ℓ10pjk + ℓ00pjk + ℓ01 − ℓ00) +∑
j,k
(ℓ00 + ℓ10pjk − ℓ00pjk)
=∑
j,k
Mjk ((ℓ11 − ℓ01 − ℓ10 + ℓ00)pjk + ℓ01 − ℓ00) +∑
j,k
(ℓ00 + ℓ10pjk − ℓ00pjk).
The last sum is invariant to Mjk, hence interested in minimising the first part:
−(ℓ01 + ℓ10 − ℓ11 − ℓ00)∑
j,k:cMjk=1
(pjk −K)
Chapter 6. Bayesian Alignment 114
where
K = (ℓ01 − ℓ00)/(ℓ01 + ℓ10 − ℓ11 − ℓ00)
and pjk = P (Mjk = 1|x, y) is the posterior probability that (j, k) is a match, which
is estimated by the empirical frequency of this match from an MCMC run.
Thus M is a solution to a “linear assignment” problem with cost matrix (pjk−K).
This is exactly what is suggested in section 5.3 i.e. to use linear assignment with
thresholding to harden match probabilities.
First Ordered-Set and Linear Assignment
In practise, taking first non-duplicate matches with high probability or using linear
programming to find optimal matches give similar results.
For linear programming, LPSOLVE (an implementation of a linear programming
simplex-based algorithm2) can be used. In this approach the matrix (pjk) is made
square by adding extra rows (dummy x points) with zeros. Denote this square
matrix C = (cj′k), j′, k = 1, . . . , n.
Then linear programming is used to solve for assignments which maximise∑
j′k
(cj′k−K) subject to unique values of j′, k = 1, . . . , n in the solution set {(j′, k)}.
K ∈ (0, 1) is an arbitrarily chosen matching probability threshold.
Thus linear programming finds n pairs of one-to-one assignment. Afterward any
pair (j′, k) in the solution set with cj′k−K < 0 is removed. This is just thresholding
on pj′k as linear assignment tries to match all n pairs without regard to individual
cj′k − K values. Obviously, this also removes assignments involving dummy xj′
points (j′ > m).
Rotation Matrix and Translation Vector
For quadratic error loss function, the mean of the posterior distribution is used as
a point estimate. Green and Mardia (2006) compute element-wise averages of the
realisations from the posterior distribution to get A and τ . The later is used as a
2http://cran.r-project.org/src/contrib/Descriptions/lpSolve.html.
Chapter 6. Bayesian Alignment 115
point estimate for the translation. A point estimate of A is taken to be a positive
definite square root of ATA which is a proper rotation matrix3.
6.1.6 Using Concomitant Information
Concomitant information (e.g. colour) of points can also be used in Bayesian hier-
archical modelling. Green and Mardia (2006) give details on incorporating colour
distributions when the log probability can be expressed linearly in entries of M i.e.
the colour distribution is independent of the point process. In this case, the contri-
bution to the likelihood from colour information is multiplicative. In implementing
this modified likelihood, MCMC acceptance ratios in section 6.1.4 are modified ac-
cordingly.
6.1.7 Results for Graph Theoretic and MCMC
In this section we compare the performance of a full Bayesian alignment using hier-
archical model (Green and Mardia, 2006) with that of the graph method.
For graph matching we use an algorithm of Applegate and Johnson (1993) in the
implementation of Gold (2003). Bayesian solution is found using MCMC algorithm
of Green and Mardia (2006). The MCMC method is adaptive to different levels of
noise in positions of functional site atoms such that is able to find good matches
in distantly related proteins. Thus MCMC can be used to explore the relationship
between functional sites in a database and the query.
Parameters
The graph theoretic method requires a threshold value for matching distances (see
section 1.2.1). In the application, a threshold value of 1.5A was used. On the other
hand, MCMC requires initial estimates for λ/ρ, µτ , σ2τ , ν, κ, β and α. We took
λ/ρ = 0.0005, µTτ = (0, 0, 0), β = 1.5, α = 1 and ν = κ = 0. We have observed that
analyses are less sensitive to choice of ν and κ.
3(ATA)1/2 is the polar part of A (see Mardia and Jupp, 2000, pp. 286, 290).
Chapter 6. Bayesian Alignment 116
Data
We consider a functional site for 17-beta-hydroxysteroid dehydrogenase as a query.
This protein belongs to tyrosine-dependent oxidoreductases family of the Rossmann
fold (NAD(P)-binding domain). We match the query to one template from each
family of the following folds:
I. Rossmann fold: NAD(P)-binding domain.
II. FAD/NAD(P)-binding domain.
III. TIM beta/alpha-barrel.
All these folds are from α/β class.
We considered one randomly chosen functional site for each and every domain
in these folds except for TIM beta/alpha-barrel fold. In TIM beta/alpha-barrel fold
we considered one randomly chosen functional site for each and every domain in 2
of the 28 superfamilies. Thus one representative of each and every domain from the
following families were considered.
I. Fold: NAD(P)-binding Rossmann-fold domains.
a) 1.1 Alcohol dehydrogenase-like, C-terminal domain family.
b) 1.2 Tyrosine-dependent oxidoreductases family.
c) 1.3 Glyceraldehyde-3-phosphate dehydrogenase-like, N-terminal domain.
d) 1.4 Formate/glycerate dehydrogenases, NAD-domain.
e) 1.5 Siroheme synthase N-terminal domain-like.
f) 1.6 LDH N-terminal domain-like.
g) 1.7 6-phosphogluconate dehydrogenase-like, N-terminal domain.
h) 1.8 Aminoacid dehydrogenase-like, C-terminal domain.
i) 1.9 Potassium channel NAD-binding domain.
j) 1.10 AT-rich DNA-binding protein p25, C-terminal domain.
Chapter 6. Bayesian Alignment 117
k) 1.11 CoA-binding domain.
II. Fold: FAD/NAD(P)-binding domain.
a) 2.1 C-terminal domain of adrenodoxin reductase-like.
b) 2.2 FAD-linked reductases, N-terminal domain.
c) 2.3 GDI-like N domain.
d) 2.4 Succinate dehydrogenase/fumarate reductase flavoprotein N-terminal
domain.
e) 2.5 FAD/NAD-linked reductases, N-terminal and central domains.
III. Fold: TIM beta/alpha-barrel (2 out of 28 superfamilies).
Superfamily: NAD(P)-linked oxidoreductase.
a) 3.1 Aldo-keto reductases (NADP).
Superfamily: FAD-linked oxidoreductase.
b) 3.2 Methylenetetrahydrofolate reductase.
c) 3.3 Proline dehydrohenase domain of bifunctional PutA protein.
Results
MCMC identifies some functional sites distantly related to the query which otherwise
the graph theoretic method might have missed.
After finding corresponding Cα atoms, rotation matrix and translation vector
are re-estimated using Procrustes. We used q⋆MCMC = min(qMCMC , qg) pairs to
calculate RMSD in MCMC where qMCMC is number of non-duplicate pairs with
highest matching probabilities and qg is the number of matching pairs found by the
graph theoretic method. Reported in Table 6.1 are RMSD for graph and MCMC
solutions in cases where MCMC manages to find better matches which the graph
Chapter 6. Bayesian Alignment 118
theoretic method missed. For these 67 configurations out of 136 cases, MCMC finds
geometrically better solutions than the graph theoretic method.
Tables 6.4 and 6.5 give RMSD for graph theoretic and MCMC solutions in cases
where the graph finds a better solution. For these 69 cases, we evaluate RMSD for
MCMC solution using both q⋆MCMC = min(qMCMC , qg) and qMCMC pairs. In a few
cases even with qMCMC > qg, RMSD with qMCMC pairs gave a lower RMSD than
graph solution with qg pairs. Figure 6.2 shows corresponding amino acids found
by MCMC method matching 17 − β hydroxysteroid dehydrogenase functional site
(1a27 0) against functional sites of aldose reductase (1ads 0), 3 − α hydroxysteroid
dehydrogenase (1afs 0), aspartate β-semialdehyde dehydrogenase (1brm 0), CHO
reductase (1c9w 0), UDP-glucose dehydrogenase (1dlj 0), glucose 6-phosphate de-
hydrogenase (1dpg 0), dihydrodipicolinate reductase (1drw 0) and ketose reductase
(1e3j 0). RMSD for these are shown in Table 6.1 in rows 1, 3, 7, 9,11,12,13 and
14. MCMC finds solutions which are better but very different to graph solutions in
these cases. It might be worthy exploring biological significance of these solutions.
6.1.8 Sensitivity of Poisson Prior Assumption
Poisson point process might not be ideal for the motivating applications in Bioinfor-
matics. In this section we consider how the MCMC algorithm with a Poisson prior
fair when matching “hardcore” configurations or “short chains” simulated like in
Aszodi and Taylor (1994). We compare the algorithm performance to that of graph
theoretic method (Gold, 2003) and EM algorithm (Kent et al., 2004).
Data Simulations
We simulate data in three ways:
Hardcore Data: Dataset 1
We simulate a database of paired configurations, {µ} and {x} with hardcore points.
These are pairs of configurations as in section 3.1.1 except for the noise level in the
Chapter 6. Bayesian Alignment 119
a)
c)
b)
d)
e)
f) g)
h)
Figure 6.2: Corresponding amino acids found by MCMC method matching 17 − β
hydroxysteroid dehydrogenase functional site (1a27 0) against functional sites of a)
aldose reductase (1ads 0), b) 3−α hydroxysteroid dehydrogenase (1afs 0), c) aspar-
tate β-semialdehyde dehydrogenase (1brm 0), d) CHO reductase (1c9w 0), e) UDP-
glucose dehydrogenase (1dlj 0), f) glucose 6-phosphate dehydrogenase (1dpg 0), g)
dihydrodipicolinate reductase (1drw 0) and h) ketose reductase (1e3j 0).
Chapter 6. Bayesian Alignment 120
Table 6.1: Matching statistics for 17 − β hydroxysteroid dehydrogenase functional
site (1a27 0) against functional sites from family representatives using graph and
MCMC methods (cases with MCMC doing better). Continued as Table 6.2.
Graph MCMC
No. Site n RMSD qg RMSD qMCMC Common Pairs SCOP1
1 1ads 0 32 4.379 12 1.707 12 0 3.1
2 1ae1 0 33 0.796 24 0.773 24 21 1.2 ⋆
3 1afs 0 36 3.452 11 1.809 11 0 3.1
4 1b37 0 6 2.402 6 2.596 4 1 2.2
5 1b5t 0 5 2.475 5 0.895 3 0 3.2
6 1bdb 0 54 0.784 40 0.686 40 36 1.2 ⋆
7 1brm 0 39 1.919 12 1.229 12 0 1.3
8 1bxk 0 33 1.008 19 0.952 19 15 1.2 ⋆
9 1c9w 0 31 5.014 11 1.532 11 0 3.1
10 1d5t 0 7 1.719 6 1.095 3 1 2.3
11 1dlj 0 10 2.160 8 0.895 8 0 1.7
12 1dpg 0 8 2.195 7 1.036 6 0 1.3
13 1drw 0 59 4.567 15 2.512 30 0 1.3 †14 1e3j 0 17 3.707 11 1.227 11 3 1.1 †15 1ebf 0 10 3.664 7 3.659 5 1 1.3
16 1edo 0 35 0.670 25 0.669 25 24 1.2 ⋆
17 1eno 0 30 0.819 21 0.785 21 20 1.2 ⋆
18 1f8f 0 35 1.635 12 1.092 12 0 1.1
19 1f8r 0 4 0.899 4 0.152 2 0 2.2
20 1ff9 0 9 1.298 7 0.024 2 1 1.3 †21 1fmc 0 47 0.762 25 0.680 25 23 1.2
22 1foh 0 53 2.091 13 1.881 13 0 2.2 †23 1frb 0 45 7.719 13 2.528 32 0 3.1 †24 1gdh 0 6 1.503 6 0.891 4 0 1.4
25 1geg 0 38 0.710 30 0.654 30 28 1.2 ⋆
26 1gpj 0 104 7.940 16 2.335 16 0 1.8 †27 1gu7 0 82 14.684 15 2.900 15 0 1.1
28 1gve 0 120 2.078 16 1.847 37 0 3.1 †† Lower RMSD for MCMC even with qMCMC > qg pairs.
⋆ Same family as the query.
1 See family names in section 6.1.7.
Chapter 6. Bayesian Alignment 121
Table 6.2: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)
against sites from family representatives using graph and MCMC methods (cases
with MCMC doing better). Continuation of Table 6.1 and continued as Table 6.3.
Graph MCMC
No. Site n RMSD qg RMSD qMCMC Common Pairs SCOP1
29 1gz4 0 70 5.449 16 1.951 10 0 1.8 †30 1h6d 0 120 6.521 18 2.441 16 0 1.3 †31 1h7w 0 120 8.422 16 2.686 14 0 2.1
32 1hye 0 30 0.929 13 0.807 13 11 1.6
33 1hyu 0 44 2.849 13 2.152 24 0 2.5 †34 1i36 0 28 2.107 11 1.925 22 0 1.7 †35 1j3v 0 40 2.935 12 1.646 12 0 1.7
36 1j4a 0 8 5.403 7 1.970 6 0 1.4
37 1j5p 0 34 1.170 14 1.031 14 9 1.3
38 1jax 0 8 1.241 6 1.067 6 1 1.7
39 1jnr 0 50 6.481 14 1.937 13 0 2.4 †40 1jqb 0 6 2.427 6 0.635 3 0 1.1
41 1ju2 0 11 3.983 8 3.789 6 1 2.2
42 1k6j 0 5 1.239 5 0.182 3 0 1.2 †⋆43 1k87 0 40 3.920 13 2.109 12 0 3.3
44 1kdg 0 7 2.118 7 0.813 5 1 2.2
45 1kol 0 42 4.918 13 2.302 13 0 1.1 †46 1kss 0 38 5.110 12 1.739 12 0 2.4
47 1l0v 0 56 6.014 14 2.358 40 0 2.4 †48 1l0v 0 56 6.014 14 2.211 14 0 2.4
49 1lc0 0 10 2.944 8 2.401 3 0 1.3 †50 1lqa 0 35 5.046 12 1.824 12 0 3.1
51 1lss 0 25 8.583 13 1.538 9 0 1.9 †52 1lvl 0 66 6.057 14 2.791 29 0 2.5 †53 1nek 0 56 6.489 15 2.100 26 1 2.4 †54 1npd 0 9 7.195 7 2.972 7 0 1.8
55 1lnq 0 6 2.641 5 0.743 2 0 1.9
56 1nrh 0 20 4.824 11 1.938 9 4 1.6
† Lower RMSD for MCMC even with qMCMC > qg pairs.
⋆ Same family as the query.
1 See family names in section 6.1.7.
Chapter 6. Bayesian Alignment 122
Table 6.3: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)
against sites from family representatives using graph and MCMC methods (cases
with MCMC doing better). Continuation of Table 6.2.
Graph MCMC
No. Site n RMSD qg RMSD qMCMC Common Pairs SCOP1
57 1nvm 0 14 4.205 9 1.480 10 0 1.3 †58 1pj5 0 49 2.772 13 2.353 26 0 2.2 †59 1sez 0 92 5.827 14 2.430 31 0 2.2 †60 1trb 0 49 6.780 14 1.890 22 0 2.5 †61 1udc 0 58 1.007 17 0.905 17 13 1.2 ⋆
62 1uuf 0 18 3.504 10 1.461 13 1 1.1 †63 1vj0 0 9 2.854 7 1.099 7 0 1.1
64 1vj1 0 23 1.865 10 1.594 10 0 1.1
65 2dap 0 50 3.690 13 2.291 27 0 1.3 †66 2nac 0 7 3.982 7 0.608 3 0 1.4
67 2scu 0 49 6.094 13 2.047 26 1 1.11
† Lower RMSD for MCMC even with qMCMC > qg pairs.
⋆ Same family as the query.
1 See family names in section 6.1.7.
positions for {x}. Here xj ∼ N(µi, σ2I3) with σ2 = 2 and then with σ2 = 4. This
was just to explore the effect of increasing the noise level for the {x} coordinates.
Furthermore, simulated configurations have larger volumes here. We simulate {µ}in a 1003A3 cube. Thus
(a) Set {µ} consists of hardcore points in cube with volume, V = 100×100×100A3.
Inhibition distance, d = 5A.
(b) Set {x} consists of xj ∼ N(µi, σ2I3). σ
2 = 2 and σ2 = 4
(c) Order of points in {x} is permuted so that we do not “know” the map π(j) = i.
No rotation and translation is used.
Chapter 6. Bayesian Alignment 123
Table 6.4: Matching statistics for 17 − β hydroxysteroid dehydrogenase functional
site (1a27 0) against functional sites from family representatives using graph and
MCMC methods (cases with graph doing better). Continued as Table 6.5.
Graph MCMC
No. Site n RMSD qg RMSD q‡MCMC Common Pairs SCOP1
1 1a4i 0 26 0.986 13 2.466 12 0 1.8
2 1c1d 0 43 1.207 12 2.087 12 0 1.8 †2 1c1d 0 43 1.207 12 2.264 22 0 1.8
3 1cjc 0 42 1.025 13 1.228 13 1 2.1 †3 1cjc 0 42 1.025 13 1.691 22 1 2.1
4 1coy 0 61 1.052 14 1.885 14 0 2.2 †4 1coy 0 61 1.052 14 2.400 35 0 2.2
5 1cyd 0 40 0.716 25 1.111 36 25 1.2 ⋆
6 1d7y 0 42 1.135 13 1.905 13 0 2.5 †6 1d7y 0 42 1.135 13 1.971 22 0 2.5
7 1dhr 0 31 1.021 14 1.171 14 9 1.2 †⋆7 1dhr 0 31 1.021 14 1.570 25 13 1.2
8 1dxy 0 73 1.082 15 2.332 15 0 1.4 †8 1dxy 0 73 1.082 15 2.494 29 0 1.4
9 1e6u 0 120 0.973 23 2.129 23 1 1.2 †⋆9 1e6u 0 120 0.973 23 2.207 27 0 1.2
10 1e6w 0 111 0.735 39 2.361 38 0 1.2 †⋆11 1el5 0 50 1.358 14 2.356 14 0 2.2 †12 1eq2 0 36 0.791 19 2.133 18 0 1.2 †⋆12 1eq2 0 36 0.791 19 2.326 19 0 1.2
13 1exb 0 36 0.921 12 1.906 12 0 3.1 †13 1exb 0 36 0.921 12 2.008 15 0 3.1
14 1fcd 0 54 1.069 12 1.670 12 0 2.5 †14 1fcd 0 54 1.069 12 2.489 28 0 2.5
15 1fdu 0 60 0.377 58 0.414 59 58 1.2 ⋆
16 1fec 0 50 2.224 14 2.619 11 0 2.5 †17 1fk8 0 30 0.945 24 1.857 24 1 1.2 ⋆
18 1fmc 0 47 0.762 25 1.117 37 25 1.2 ⋆
q‡MCMC is either qMCMC or q⋆MCMC
† MCMC RMSD with q⋆MCMC pairs.
⋆ Same family as the query.
1 See family names in section 6.1.7.
Chapter 6. Bayesian Alignment 124
Table 6.5: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)
against sites from family representatives using graph and MCMC methods (cases
with graph doing better). Continuation of Table 6.4 and continued as Table 6.6.
Graph MCMC
No. Site n RMSD qg RMSD q‡MCMC Common Pairs SCOP1
19 1g0o 0 42 0.599 29 0.599 29 29 1.2 †⋆19 1g0o 0 42 0.599 29 2.901 36 29 1.2
20 1gco 0 33 0.729 25 2.069 25 0 1.2 †⋆20 1gco 0 33 0.729 25 2.140 26 0 1.2
21 1gos 0 120 1.140 17 2.072 17 0 2.2 †21 1gos 0 120 1.140 17 2.518 24 0 2.2
22 1gr0 0 44 2.018 13 2.328 27 3 1.3
23 1gz6 0 120 0.942 27 2.245 25 0 1.2 †⋆24 1h5q 0 88 0.955 33 3.124 29 0 1.2 †⋆25 1h6v 0 120 1.278 17 1.846 14 0 2.5
25 1h6v 0 120 1.278 17 1.937 17 0 2.5 †26 1hdc 0 24 0.867 11 2.050 20 0 1.2 ⋆
27 1hdo 0 109 0.852 20 1.219 4 0 1.2 ⋆
27 1hdo 0 109 0.852 20 1.953 20 0 1.2 †28 1heu 0 118 1.257 16 2.340 37 0 1.1
29 1hyh 0 37 0.802 13 1.880 13 0 1.6 †29 1hyh 0 37 0.802 13 2.282 19 0 1.6
30 1iy8 0 41 0.654 26 0.665 26 24 1.2 †⋆30 1iy8 0 41 0.654 26 3.132 37 26 1.2
31 1ja9 0 43 0.786 28 1.353 39 28 1.2 ⋆
32 1kyq 0 5 1.469 5 1.904 2 0 1.5 †33 1li4 0 49 1.452 14 2.111 25 0 1.4
34 1lj8 0 23 0.786 11 2.695 18 0 1.7
35 1lqt 0 82 1.360 15 2.476 25 0 2.1
36 1lsu 0 27 1.154 12 1.264 12 0 1.9 †36 1lsu 0 27 1.154 12 1.845 19 0 1.9
37 1m66 0 6 1.672 6 2.306 6 3 1.7 †38 1m6i 0 43 1.261 13 1.728 13 0 2.5 †
q‡MCMC is either qMCMC or q⋆MCMC
† MCMC RMSD with q⋆MCMC pairs.
⋆ Same family as the query.
1 See family names in section 6.1.7.
Chapter 6. Bayesian Alignment 125
Table 6.6: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)
against sites from family representatives using graph and MCMC methods (cases
with graph doing better). Continuation of Table 6.5 and continued as Table 6.7.
Graph MCMC
No. Site n RMSD qg RMSD q‡MCMC Common Pairs SCOP1
38 1m6i 0 43 1.261 13 2.141 25 4 2.5
39 1mg5 0 39 0.734 28 0.995 34 28 1.2 ⋆
40 1mld 0 14 0.919 9 0.965 9 0 1.6 †41 1mo9 0 44 0.968 14 2.067 14 0 2.5 †41 1mo9 0 44 0.968 14 2.201 21 0 2.5
42 1mv8 0 35 1.282 13 2.175 22 0 1.7
43 1mx3 0 37 1.145 12 1.295 12 0 1.4 †43 1mx3 0 37 1.145 12 2.117 24 0 1.4
44 1nff 0 35 0.726 28 0.712 28 26 1.2 †⋆45 1ng4 0 52 1.355 14 2.501 20 0 2.2
46 1nhp 0 37 1.195 12 1.853 12 0 2.5 †47 1npy 0 5 1.806 5 0.006 2 0 1.8 †48 1nyt 0 40 1.127 14 2.659 19 0 1.8
49 1o94 0 120 1.131 16 2.463 16 0 2.1 †50 1oaa 0 39 0.777 23 1.926 23 1 1.2 †⋆50 1oaa 0 39 0.777 23 2.028 30 0 1.2
51 1obb 0 120 0.750 17 2.812 16 0 1.6 †52 1og6 0 88 1.025 16 3.157 17 0 3.1
53 1ono 0 8 1.412 6 3.047 2 0 1.3
53 1ono 0 8 1.412 6 5.315 5 2 1.3 †54 1orr 0 35 0.895 17 1.931 16 0 1.2 †⋆54 1orr 0 35 0.895 17 2.209 18 0 1.2
55 1pbe 0 51 1.353 14 1.829 14 0 2.2 †55 1pbe 0 51 1.353 14 2.000 18 0 2.2
56 1pjq 0 17 1.057 9 1.005 2 0 1.5
56 1pjq 0 17 1.057 9 2.090 8 0 1.5 †57 1ps9 0 69 1.070 15 2.057 15 0 2.1 †
q‡MCMC is either qMCMC or q⋆MCMC
† MCMC RMSD with q⋆MCMC pairs.
⋆ Same family as the query.
1 See family names in section 6.1.7.
Chapter 6. Bayesian Alignment 126
Table 6.7: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)
against sites from family representatives using graph and MCMC methods (cases
with graph doing better). Continuation of Table 6.6.
Graph MCMC
No. Site n RMSD qg RMSD q‡MCMC Common Pairs SCOP1
58 1psd 0 76 0.971 16 2.194 16 1 1.4 †58 1psd 0 76 0.971 16 2.286 29 1 1.4
59 1px0 0 18 0.826 10 1.284 10 0 1.2 †⋆59 1px0 0 18 0.826 10 1.805 12 1 1.2
60 1q1r 0 41 1.073 14 1.495 14 4 2.5 †60 1q1r 0 41 1.073 14 2.067 25 8 2.5
61 1qmg 0 40 1.677 13 1.505 13 5 1.7 †61 1qmg 0 40 1.677 13 2.572 18 0 1.7
62 1qor 0 101 0.974 15 2.179 15 0 1.1 †62 1qor 0 101 0.974 15 2.280 22 0 1.1
63 1qp8 0 20 1.108 10 1.965 10 0 1.4 †64 1qrr 0 58 1.029 15 2.168 15 0 1.2 †⋆64 1qrr 0 58 1.029 15 2.710 28 1 1.2
65 1r72 0 30 0.822 14 1.164 14 11 1.1 †65 1r72 0 30 0.822 14 1.298 18 13 1.1
66 1vjt 0 40 1.371 12 1.910 12 4 1.6 †66 1vjt 0 40 1.371 12 2.187 20 6 1.6
67 2pgd 0 9 2.120 8 2.898 6 0 1.7
68 3grs 0 45 1.101 13 1.598 13 0 2.5 †68 3grs 0 45 1.101 13 2.305 33 0 2.5
69 9ldt 0 39 2.255 13 5.365 11 0 1.6 †q‡MCMC is either qMCMC or q⋆
MCMC
† MCMC RMSD with q⋆MCMC pairs.
⋆ Same family as the query.
1 See family names in section 6.1.7.
Chapter 6. Bayesian Alignment 127
(d) Set {µ} has 10% more points than {x}. Extra points in {µ} have no corre-
sponding points in {x}.
Poisson model data - dataset 2
We simulate a database of paired configurations according to Poisson model of
Green and Mardia (2006). We generate a pair of configurations as follows:
(a) Get number of points, N for the set of true locations {µi}, i ∈ {1, 2, . . . , N}.N ∼ Poi(λ).
(b) Uniformly sample N points in a region V ⊂ ℜ3 of volume v.
(c) Thus {µi} forms a homogeneous Poisson process with rate λ.
(d) Configurations {xj}, j = 1, . . . , n and {yl}, l = 1, . . . , m arise from {µi} such
that:
• With probabilities px, py, ρpxpy, 1 − px − py − ρpxpy, µi gives rise to xj
alone, yl alone, both and neither respectively. ρ is a certain measure
of the tendency a priori for points to be matched. We set py = 0.05;
px = 0.05; ρpxpy = 0.90 and ρ = ρpxpy
pxpy= 360.
• Thus ∀j : xj ∼ N(µi, σ2I3) and ∀l : yl ∼ N(µi, σ
2I3) for some i.
• We choose say 12 realisations for N .
• For each N we sample 30 configurations of true locations {µ}, from where
we get pairs of {x} and {y}.
• Set {x} consists of xj ∼ N(µi, σ2I3) and {y} consists of yk ∼ N(µi, σ
2I3).
σ2 = 2. No rotation and translation are used.
• Permute the order in {x} thus we do not “know” the correspondence
between {x} and {y}.
• Therefore our database consist of 360 (30 × 12) pairs of configurations.
Chapter 6. Bayesian Alignment 128
Short chains data - dataset 3
We simulate a database of paired configurations according to Aszodi and Taylor
(1994) model. We generate a pair of configurations as follows:
(a) Generate a chain {µ} with at most 50 points (see section 3.1.2 in Chapter 3).
(b) Generate a twin set {x} whereby xj ∼ N(µi, σ2I3). σ
2 = 0.5.
(c) Set {µ} has 10% of the points not corresponding with any xj .
(d) Order of points in {x} is permuted so that we do not “know” the map π(j) = i.
No transformation is used.
Comparing MCMC, graph matching and the EM algorithm
We match paired configurations with graph, MCMC and EM algorithms then eval-
uate correct correspondence proportion for each method. Figure 6.3 are graphs for
correct correspondence proportions for hardcore and Poisson model datasets. For
matching short chains data, MCMC and EM algorithms had correct corrrespon-
dence proportions of 0.898 and 0.997 respectively. Graph method could not match
this type of data; it became too computer intensive.
It is noted that graph theoretic method does poorly with Poisson model data.
On the other hand, MCMC does quite well in matching hardcore configurations
even though the prior is Poisson. MCMC is quite adaptable to large noise while
graph theoretic method becomes very computationally intensive and finds fewer
true matches when there is large noise in the coordinates for corresponding points.
Figure 6.4 shows proportions of true matches found by the EM algorithm, MCMC
and graph theoretic methods in a dataset with large variance for corresponding
points’ coordinates. The EM algorithm has the best performance. The performance
for MCMC and graph methods are similar for small configurations (4 to 20 points).
However the performance for the graph method degrades faster with more than 20
points.
Chapter 6. Bayesian Alignment 129
Figure 6.3: True correspondence proportions for MCMC, graph and EM algorithm
methods. a) Hardcore data. b) Poisson model data.
Parameters setting
When σ2 = 2:
For MCMC and EM algorithm, true values for translation (0), rotation (I3) and
σ2 = 2 were given as starting values. Threshold value of 1.5A was used for the
Chapter 6. Bayesian Alignment 130
10 20 30 40
0.60
0.65
0.70
0.75
0.80
0.85
point−set size
corr
ect c
orre
spon
denc
e
GraphMCMCEM
Figure 6.4: True correspondence proportions for MCMC and graph for hardcore
data with large variance, σ2 = 4. Note that graph method could not match more
than 32 points with large variance.
graph theoretic method. We took non-duplicate matches with highest probabilities
in MCMC while linear assignment for hardening soft matches was used for EM
algorithm results. Further, in linear assignment, we required matching probabilities
to be at least 0.1.
Other settings for MCMC are:
Model hyperparameters
λ/ρ = 0.001; µτ = 0; στ = 5; α = 1; β = 2; γ = 0; δ = 0.
κ = 0, ν = 0 (parameters for the prior on A).
Sampler control and parameters
p⋆ = 0.5.
# of updates for matching matrix M per sweep = 10.
Chapter 6. Bayesian Alignment 131
# of sweeps = 10000.
burning in period = 2000 sweeps.
Initial values
τ = 0; θ = 0; σ =√
βα.
For large variance, σ2 = 4:
β = 4 for MCMC and threshold=7 for graph theoretic method.
6.2 Using Two Atoms for each Amino Acid
In this section we extend the Bayesian alignment method of Green and Mardia
(2006) presented in section 6.1 to matching coupled points in a configuration. This
is motivated by the requirement in Bioinformatics to prefer matching amino acids
with similar orientation when matched configurations are superposed.
We take into account relative orientation of side chains by using Cα and Cβ
atoms in matching amino acids. Positions of these atoms from the same amino
acid are dependent. Let y1k and x1j denote coordinates for Cα atoms in the query
and functional site. We denote Cβ coordinates for the query and functional site by
y2k and x2j respectively. Thus x1j and x2j are dependent. Similarly, y1k and y2k
are dependent. We take into account the position of y2k by using the conditional
distribution given the position of y1k. Given x1j , y1k, it is plausible to assume that
f(x1j , x2j , y1k, y2k) = f(x1j , y1k)f(x2j , y2k|x1j , y1k),
x2j |x1j ∼ N(x1j , σ2oI3),
Ay2k|y1k ∼ N(Ay1k, σ2oI3)
or the displacement
x2j − Ay2k|(x1j , y1k) ∼ N(x1j −Ay1k, 2σ2oI3).
We assume for “symmetry” that f(x2j−Ay2k|x1j, y1k) depends only on the displace-
ment as in the likelihood in equation 6.6. Thus φ(.) in Green and Mardia (2006) is
replaced by φ(.) × φ({x2j − x1j − A(y2k − y1k)}/σo√
2) for the new full likelihood.
Now the final joint model is
Chapter 6. Bayesian Alignment 132
P (M,A, τ, σ, x1, y1, x2, y2) ∝|A|nP (A)P (τ)P (σ)×∏
j,k:Mjk=1
(ρφ({x1j − Ay1k − τ}/σ√2) × φ({x2j − x1j −A(y2k − y1k)}/σo
√2)
λ(σ√
2)d
).
(6.12)
Some probability mass for the distribution of x2j |x1j and Ay2k|y1k is unaccounted
for because there is inhibition distance between x2j and x1j and also between y2k and
y1k. Thus x2j − x1j is not isotropic. This is not expected to affect the performance
of the algorithm because relative contributions from each x2j − x1j is unaffected. In
other words the unaccounted probability mass can be attributed to the proportion-
ality constant.
6.2.1 Prior Distributions and Computations
The additional term in the new full likelihood does not involve τ hence the posterior
and updating of τ is unchanged.
Rotation Matrix
The full conditional distribution of A is
P (A|M, τ, σ, x1, y1, x2, y2) ∝|A|2nP (A)×∏
j,k:Mjk=1
φ
(x1j − Ay1k − τ
σ√
2
)φ
(x2j − x1j − A(y2k − y1k)
σo√
2
).
(6.13)
Chapter 6. Bayesian Alignment 133
Thus
P (A|M, τ, σ, x1, y1, x2, y2)
∝ P (A)×exp
(tr{
12σ2
∑y1k(x1j − τ)TA
}+ 1
2σ2o
∑(x2j − x1j)
TA(y2k − y1k))
∝ P (A)×exp
(tr{
12σ2
∑y1k(x1j − τ)TA
}+ tr
{1
2σ2o
∑(y2k − y1k)(x2j − x1j)
TA})
∝ P (A)×
exp
(tr
{(1
2σ2
∑y1k(x1j − τ)T +
1
2σ2o
∑(y2k − y1k)(x2j − x1j)
T
)A
})
where the summation is over j, k : Mjk = 1.
Similar to equation 6.7, with P (A) ∝ exp(tr(F T0 A)) for some matrix F0, the full
conditional distribution of A (given data and values for all other parameters) has
the same form with F0 replaced by
F = F0 + (1/2σ2)∑
j,k:Mjk=1
(x1j − τ)yT1k
+(1/2σ2o)
∑
j,k:Mjk=1
(x2j − x1j)(y2k − y1k)T .
(6.14)
6.2.2 Updating M
Similar to expression 6.9, acceptance probability for adding a match (j, k) is
min
1,
ρφ({x1j−Ay1k−τ}/σ√
2)p∗nu
λ(σ√
2)d×φ({x2j−x1j−A(y2k−y1k}/σo
√2)
(σo√
2)d
ff.
Similarly, the acceptance probability for switching the match of xj from yk to yk′ is
min
1,
φ({x1j−Ay1k′−τ}/σ
√2)
φ({x1j−Ay1k−τ}/σ√
2)×φ({x2j−x1j−A(y
2k′−y1k′ }/σo
√2)
φ({x2j−x1j−A(y2k−y1k}/σo√
2)
ff
and for deleting the match (j, k) is
min
1, λ(σ
√2)d
ρφ({x1j−Ay1k−τ}/σ√
2)p∗nu× (σo
√2)d
φ({x2j−x1j−A(y2k−y1k}/σo√
2)
ff.
Chapter 6. Bayesian Alignment 134
6.2.3 Results
Figure 6.5 shows residues matched between 17 − β hydroxysteroid dehydrogenase
and carbonyl reductase functional sites when both Cα and Cβ atoms are used.
Graph theoretic method (with a simple adaptation) can also use both Cα and
Cβ atoms. Section 7.3.2 in Chapter 7 gives the adaptation required. Using both
Cα and Cβ atoms and δ = 1.5, the graph theoretic method finds 19 corresponding
pairs with RMSD=0.72. MCMC match gives RMSD=0.57 for 19 pairs with highest
matching probabilities.
a) MCMC
b) Graph
Figure 6.5: Corresponding amino acids in matching functional sites of 17 − β hy-
droxysteroid dehydrogenase (1a27 0) and carbonyl reductase (1cyd 1) using Cα and
Cβ atoms in MCMC and graph theoretic methods.
Figure 6.6 shows residues matched using only Cα atoms. Using only Cα atoms
and with a threshold δ = 0.98 for matching distances, graph also matches 19 pairs
with RMSD=0.634. The threshold was chosen to give the same number of matches
as when using Cα and Cβ atoms. On the other hand MCMC gives RMSD=0.594
for 19 highest probability matching pairs when only Cα atoms are used.
In this example, solutions found by graph and MCMC methods using singles i.e.
one atom for each amino acid or couples i.e. two atoms for each amino acid are
Chapter 6. Bayesian Alignment 135
a) MCMC
b) Graph
Figure 6.6: Corresponding amino acids in matching functional sites of 17−β hydrox-
ysteroid dehydrogenase (1a27 0) and carbonyl reductase (1cyd 1) using Cα atoms
only in MCMC and graph theoretic methods.
very similar despite a lower RMSD value for the MCMC method with two atoms.
However we had to search for the right threshold for graph matching using Cα atoms
only; otherwise RMSD value is higher4 for the threshold value of 1.5A. Table 6.8
gives the number of common pairs between these solutions.
Table 6.8: Number of same pairs in MCMC and graph solutions when using Cα
atoms only (single) and when using both Cα and Cβ atoms (couple).
Graph MCMC
single couple single couple
Graph:
single - 13 12 14
couple - - 13 15
MCMC:
single - - - 16
Note: each solution with 19 matched pairs.
4RMSD=0.778; # of corresponding amino acids=27.
Chapter 6. Bayesian Alignment 136
6.2.4 Comments
• It is observed that mostly with coupled points MCMC converges just after 104
number of sweeps while needs around 106 sweeps for uncoupled points.
• Using coupled points involves more matching constraints such that solutions
tend to have smaller RMSD and fewer number of corresponding amino acids.
• The parameter σ2o controls the flexibility in orientation for matching coupled
points. Smaller σ2o requires more similar orientation for matching couples.
Chapter 7
Bayesian Refinement of Graph
Solutions
In this chapter we consider augmenting a graph-theoretic method with an MCMC
refinement step in matching protein functional sites. Thus consider a method based
on initial graph matching followed by refinement using Markov chain Monte Carlo
(MCMC) procedure.
7.1 Introduction
MCMC refinement step can provide significant improvements over graph matching
techniques. With the Bayesian approach we are able to refine graph solutions to find
more biologically interesting and statistically significant matches between functional
sites.
In our application in section 7.4, we show that the MCMC refinement step is able
to significantly improve graph based matches. We apply the method to matching
FAD/NAD(P)(H) binding sites within single Rossmann fold families, between dif-
ferent Rossmann superfamilies and within different folds. Within families sites are
often well conserved, but there are examples where significant shape based matches
do not retain similar amino acid chemistry, indicating that even within families the
same ligand may be bound using substantially different physico-chemistry. We also
137
Chapter 7. Bayesian Refinement of Graph Solutions 138
show that the procedure finds significant matches between binding sites for the same
cofactor in different superfamilies and different folds. The results show the method
can be used to detect structural similarity between functional sites from proteins
with different folds.
7.2 Motivation
Graph theoretic approach requires to adjust the matching distance threshold apri-
ori according to noise in atomic positions which is difficult to pre-determine when
matching templates from a database with varying distance relatives and crystallo-
graphic precision. Furthermore, the graph method is unable to identify alternative
but sometimes important solutions in the neighbourhood of the distance based so-
lution because of strict distance thresholds. On the other hand, the graph theo-
retic approach is very fast, robust and can quickly give corresponding points from
which we can get rough estimates for rotation and translation. Using MCMC in
the Bayesian hierarchical modelling (starting from the rotation and translation esti-
mates by graph method) relaxes strict distance thresholds used in graph matching.
That is MCMC automatically adapts to the level of noise in functional site atomic
positions. Furthermore, using MCMC to sample from the full joint distribution in
equation (6.12) provides an extremely flexible basis for reporting aspects of the full
joint posterior that are of interest, including alternative matching matrices.
7.3 Method
We consider the Bayesian hierarchical approach to improve the graph based solution.
A graph theoretic matching algorithm is used to get an initial estimate of rotation
and translation followed by refinement using Bayesian hierarchical modelling.
Chapter 7. Bayesian Refinement of Graph Solutions 139
7.3.1 Representation and Matching
In this chapter we consider matching at the level of amino acid residues. However
we use both Cα and Cβ atoms of each residue (except glycine where we only use Cα
atoms; there is no Cβ atom in glycine). Note that since there are several examples of
similarities in protein functional sites from evolutionarily unrelated proteins, which
do not preserve the amino to carboxy terminal order of the matching residues,
methods in this thesis take no account of the sequential ordering of residues.
At the least restricted level, any residue is allowed to match any other, thus
producing matches considering only the form or shape of the sites, in terms of
the spatial arrangement of their constituent residues, irrespective of amino acid
identities and physico-chemical properties. A more restricted scheme is also con-
sidered where residues are only allowed to match within the same physico-chemical
class: hydrophobic (A,F,I,L,M,P,V), polar (C,H,N,Q,S,T,W,Y), charged (D,E,K,R),
or glycine (G). These groups (Branden and Tooze, 1999, p. 6), tabulated in Table
5.4 and also used in Chapter 5, are chosen to illustrate the value of the MCMC
procedure. However the procedure would be equally applicable to other possible
physico-chemical groupings.
We consider matching two functional site configurations {xj , j = 1, 2, . . . , m} and
{yk, k = 1, 2, . . . , n} in 3-dimensional space. The jth and kth amino acids in {x} and
{y} are represented by xj and yk respectively. We do not know the correspondence
between j and k. Possibly some js do not correspond with any k and similarly some
ks do not correspond with any j. Let x1j and y1k denote coordinates for Cα atoms
for the jth and kth amino acids in the functional sites. Similarly, we denote Cβ
coordinates for the kth and jth amino acids by y2k and x2j . Thus x1j and x2j are
dependent. Similarly, y1k and y2k are dependent.
7.3.2 Graph Theoretic Step
The graph matching method described in section 1.2.1 is used. However, here we
use two atoms i.e. Cα and Cβ for matching and superposition. In addition to the
Chapter 7. Bayesian Refinement of Graph Solutions 140
requisite for connecting vertices representing amino acids in the product graph in
Definition 1.2.1, the inter-point distances between Cβ atoms have to be within 1.5A
as well. That is all corresponding Cα to Cα and Cβ to Cβ distances in matched
configurations are within 1.5A of each other. Thus we define a vertex product graph
as follows
Definition 7.3.1. If V1 and V2 are the sets of vertices for G1 and G2 respectively.
The vertex product graph Hv = G1 ◦v G2 includes the vertex set VH = V1 × V2, in
which the vertex pairs (xj , yk) with xj ∈ V1 and yk ∈ V2 have the same attribute.
An edge between two vertices vh = (xj , yk), vh′ = (xj′ , yk′) ∈ VH exists for j 6= j′
and k 6= k′ such that
• the absolute difference between the distances |x1j −x1j′ | and |y1k− y1k′| is less
than 1.5A.
• also the absolute difference between the distances |x2j − x2j′ | and |y2k − y2k′|is less than 1.5A.
As before, in the least restrictive case all vertices (amino acids) are assumed to have
the same attribute and hence matching can occur between any amino acid and is
only dependent on inter-residue distances. Alternatively vertices can be labelled
with residue physico-chemical properties to restrict matching to amino acids in the
same group (Section 7.3.1).
We search for the maximum similarity between two graphs G1 and G2 represent-
ing {x} and {y} respectively. Thus we search for the maximal common subgraph or
a clique within the vertex product graph for G1 and G2 (Hv = G1 ◦v G2). Example
applications in this chapter use the clique detection algorithm of Carraghan and
Pardalos (1990) for graph matching.
7.3.3 MCMC Refinement Step
We use MCMC sampling in Bayesian hierarchical modelling described in section 6.2,
starting from the rotation and translation obtained from graph solution. We start
Chapter 7. Bayesian Refinement of Graph Solutions 141
from several random initial values for the noise parameter and monitor convergence
(the log posterior likelihood) and quality of the solution in terms of RMSD, statistical
significance and the number of corresponding amino acids.
7.3.4 Accounting for Physico-chemistry Properties
The method accounts for the 3-dimensional form of the site as well as the physico-
chemistry of constituent amino acids. Restrictive matching is used to account for
physico-chemistry in graph matching. Thus amino acid groups are used as vertex
attributes. An edge between two vertices vh = (xj , yk), vh′ = (xj′ , yk′) ∈ VH in
the product graph VH (Definition 7.3.1) are connected only if pairs (xj , yk) and
(xj′, yk′) represent amino acids in the same physico-chemistry group. A detailed
description on how to flexibly account for physico-chemistry in Bayesian hierarchical
modelling is given in Green and Mardia (2006) and a brief discussion is in section
6.1.6. However in order to compare graph theoretic and MCMC refinement results
(in the application in section 7.4) we have not used a “full prior” for ρ/λ as in
Green and Mardia (2006) when accounting for physico-chemistry. The matching
indicator Mjk in equation 6.3 is constrained to be zero in the probability model and
all algorithmic steps if jth and kth amino acids are in different physico-chemistry
groups. As in the graph theoretic approach when accounting for physico-chemistry,
this matches amino acids in the same group only.
7.3.5 Assessing Quality of Matches
Matches were assessed in terms of a number of parameters. First the number of
matched residues and and the root mean square deviation between matched posi-
tions (RMSD) which are very commonly used in the field. It is intuitively clear
that matches of lower RMSD over larger numbers of matching residues are more
statistically significant.
In Chapter 4, we considered p-values for RMSD and the score (Gold, 2003) for
ranking matches or assessing goodness-of-fit under the assumption that matched
Chapter 7. Bayesian Refinement of Graph Solutions 142
configurations are related. Here we consider assessing significance of the evidence
that our match is not by mere chance and there is some relationship between the
configurations. A measure of this significance has recently been suggested Stark
et al. (2003b), and was modified in this work to correct for number of amino acids
in functional sites being matched. Matchings were pair-wise and we calculated E-
values and P-values with a correction for number of amino acids in the functional
sites. Thus we used P-values formula:
P = 1 − e−E (7.1)
where
E = C(m,n)PaΦbqR2.93q−5.88M [yR2
M ]S[zR3M ]T , q ≥ 3
and E is the expected number of matches with this RMSD or better, P is the number
of binding sites that were matched, Φ is the product of percentage abundances of all
matched amino acids, RM is RMSD, q is the number of matched amino acids, S is
the number of amino acids with two atoms matched, T is the number of amino acids
with more than two atoms matched. In our applications T = 0. Incidentally, the
first exponent of RM : 2.93q− 5.88 ≃ 3q− 6 which is expected from Mardia-Dryden
distribution of size-and-shape (see Dryden and Mardia, 1998). We used empirically
derived constants a = 3.704 × 106, b = 1.790 × 10−3, y = 0.196 and z = 0.094 as in
Stark et al. (2003b). C(m,n) = 3!(n3
)(m3
)is a correction factor for number of amino
acids, m and n in the functional sites. The expected number of matches with this
RMSD or better by chance is factored by C(m,n) = 3!(n3
)(m3
)as matching 3 points
exhausts all degrees of freedom in optimal matching of rigid bodies (Kuhl et al.,
1984). Equation 7.1 is derived from the extreme value distribution (see section 1.2.2
of Chapter 1). Because RMSD is positive and the distribution shows a heavy tail
attenuated at zero, Frechet type distribution is used.
It is important to note that the MCMC procedure is not directly aiming to
optimise any of these measures, and it is equally important to appreciate that the
connection between statistical and biological significance is not straightforward. Ac-
cordingly the example applications in the section below were carefully chosen to be
Chapter 7. Bayesian Refinement of Graph Solutions 143
well understood cases where matches can be interpreted relatively easily in biochem-
ical terms.
7.4 Applications
The method uses matching schemes that are relatively unrestricted in terms of amino
acid identity (either with no restriction or matching in broadly defined physico-
chemical groups). As currently formulated it is therefore better suited to the study of
larger ligand binding sites, than smaller sites associated, for example, with enzymatic
catalysis. The former are more likely to be defined by shape and physico-chemical
properties, while the latter depend critically on precise amino acid residue identities.
For our example applications we have therefore chosen sites for the binding of some
very common biochemical ligands related to FAD (flavin adenine dinucleotide) and
NAD(P) i.e. nicotinamide adenine dinucleotide (phosphate). These ligands are
bound as cofactors by a large variety of enzyme domains many of which come from
the Rossmann family of protein folds. Importantly, there are many proteins of known
structure that bind these related cofactors ranging from close evolutionary relatives,
through very distant relatives to proteins of different fold and likely independent
evolutionary origin. For structural and evolutionary relationships SCOP (Andreeva
et al., 2004) was used.
We consider two binding sites, the NAD binding site from an alcohol dehydro-
genase structure (1hdx 1 in SITESDB), and a larger NADP binding site from a
17− β hydroxysteroid dehydrogenase (1a27 0 from SITESDB) which includes both
the cofactor and substrate binding regions. For these binding sites we performed
the following matching studies
1. A functional site of alcohol dehydrogenase functional site against NAD(P)(H)
binding sites from proteins in the same SCOP family as alcohol dehydrogenase
(alcohol dehydrogenase-like, N-terminal domain; SCOP: c.2.1.1).
2. A functional site of 17−β hydroxysteroid dehydrogenase (1a27 0) against NAD(P)(H)
Chapter 7. Bayesian Refinement of Graph Solutions 144
binding sites from proteins in the same SCOP family as 17 − β hydroxysteroid
dehydrogenase (tyrosine-dependent oxidoreductases; SCOP: c.2.1.2).
3. The alcohol dehydrogenase functional site in (1) against NAD(P)(H) binding
sites from proteins in the same SCOP superfamily as alcohol dehydrogenase but
different families (SCOP: c.2.1.x; for x 6= 1).
4. The alcohol dehydrogenase functional site against
FAD/NAD(P)(H) binding sites from proteins in FAD/NAD(P)-binding domain
(SCOP: c.3.1.x).
The first of these test cases is the most straightforward, involving matching the
NAD binding site against similar sites in closely related proteins. The second is sim-
ilar, but more challenging, because the larger 17− β hydroxysteroid dehydrogenase
site (1a27 0) also incorporates the substrate binding region. The associated family
(c.2.1.2) is functionally broad and members catalyse reactions on a variety of diverse
substrate molecules. Matching methods therefore need to identify matches in the
related cofactor binding region and ignore local site dissimilarities owing to sub-
strate variation. The third test case considers similarities in sites with more distant
evolutionary relationships (where sequence similarity between the protein domains
concerned is very low, but the structural similarity of the Rossmann fold remains).
The forth test case assess the ability of the method to locate site similarities between
different folds that bind the same or related ligands.
7.4.1 Case 1: Alcohol Dehydrogenase and Family
Figure 7.1a shows the results of using graph matching only where matching was
performed with and without amino acid property group information. First note
that in the less restricted matching scenario, without amino acid group informa-
tion, matches generally involve more residues or lower RMSD values, as would be
expected. Thus, in the figure, the lines connecting the restricted matches (green
circles) with the unrestricted matches (blue crosses) for each site family member
Chapter 7. Bayesian Refinement of Graph Solutions 145
often have a gradient that is negative or close to zero. In the case of matching with-
out property information, most sites in the family show a match with 1hdx 1 with
a low RMSD (< 1.5A) and a significant number of corresponding residues (> 8).
However, this is not the case when amino acid property information is taken account
of, and a minority of the matches show relatively high RMSD values, over generally
lower numbers of matching residues. Thus it appears that lower quality matches can
result from the use of amino acid property information, perhaps because these close
relatives have conserved the shape of the binding site but not the physico-chemical
characteristics. This may happen in binding site regions whose properties are not
crucial to ligand binding.
Figures 7.1b and 7.2 show the the effect of the MCMC refinement on the graph
only matches of Figure 7.1a. The same basic conclusions can be drawn from Figure
7.1b as from Figure 1a. However, from Figures 7.1b and 7.2 it is clear that in the
cases of 3 site matches with amino acid property information, the MCMC refinement
procedure produces significant improvements in RMSD values (RMSD is improved
from> 1.5A to less than 1A while also marginally increasing the number of matching
residues). Thus the refinement procedure is able to improve some matches, even the
the cases of closely related sites examined here.
The overall effect of the refinement procedure within this family can be consid-
ered in terms of the statistical significance of the matches obtained. This informa-
tion is summarised in Table 7.1. Without taking physico-chemical properties into
account, 142 of the 145 sites produced significant matches (p-value < 0.05).
MCMC refinement step significantly improves solutions in matching sites of
quinone oxidoreductase (1qor) and hypothetical protein YhdH (1o8c). Multiple
sequence alignment of 1hdx with family members shows that they share a common
dinucleotide binding motif GL-GGVG. For 1qor 0, before MCMC refinement step,
we match 2 glycines in dinucleotide binding motif GL-GGVG. On the other hand we
match 3 glycines in the motif after the MCMC refinement step. We match 3 glycines
before MCMC refinement step and all 4 glycines after the MCMC refinement step
in 1o8c 1. Figures 7.3 and 7.4 show corresponding amino acids before and after the
Chapter 7. Bayesian Refinement of Graph Solutions 146
0 10 20 30 40 50 60
01
23
45
Number of matched amino acids
RMSD
(Å)
a)
0 10 20 30 40 50 60
01
23
45
Number of matched amino acids
RMSD
(Å)
b)
Figure 7.1: Alcohol dehydrogenase NAD-binding site (1hdx 1) matching against
NAD(P)(H) binding sites of SCOP alcohol dehydrogenase-like family proteins
(Case 1). a) Graph matching prior to MCMC refinement step showing results
with/without amino acid property information. Each site is represented by a green
circle (with) and blue cross (without) connected by a straight line to highlight the
difference. b) MCMC refinement step of (a).
Chapter 7. Bayesian Refinement of Graph Solutions 147
0 10 20 30 40 50 60
01
23
45
Number of matched amino acids
RMSD
(Å)
Figure 7.2: Effect of MCMC refinement on graph matches of the NAD-binding
functional site of alcohol dehydrogenase (1hdx 1) against NAD(P)(H) binding sites
of SCOP alcohol dehydrogenase-like family proteins (Case 1) where corresponding
amino acids are restricted to others in the same group. Each site is represented by
a green circle (graph only) and blue cross (after MCMC refinement) connected by
a straight line to highlight the difference.
refinement in 1qor 0 and 1o8c 1 respectively. These are some of the cases probably
with several alternative solutions which the probabilistic approach is able to explore.
When taking physico-chemical properties into account, we find 132 out of 145
sites significant after the refinement step. There are only 125 significant matches be-
fore MCMC refinement step. Matches with 1qlh 1, 1hdy 1, 3hud 1, 1n9q 1 and sites
from 1pl6 are only significant after MCMC refinement step. Figure 7.2 is a plot of
RMSD against number of corresponding amino acids before and after MCMC refine-
ment step when accounting for physico-chemical properties in matching. This plot
shows that MCMC refinement step achieves better RMSD and more corresponding
amino acids in a number of cases.
Chapter 7. Bayesian Refinement of Graph Solutions 148
Graph:
MCMC Refinement Step:
Figure 7.3: Corresponding amino acids between the NAD-binding site of alcohol
dehydrogenase (1hdx 1) and NADP-binding site of quinone oxidoreductase (1qor 0)
before and after MCMC refinement step (Case 1). Amino acids with bold borders
are part of the dinucleotide binding motif GL-GGVG.
Graph:
MCMC Refinement Step:
Figure 7.4: Corresponding amino acids between the NAD-binding site of alcohol de-
hydrogenase (1hdx 1) and NADP-binding site of hypothetical protein YhdH (1o8c 1)
before and after MCMC refinement step (Case 1). Amino acids with bold borders
are part of the dinucleotide binding motif GL-GGVG.
Chapter 7. Bayesian Refinement of Graph Solutions 149
7.4.2 Case 2: 17 − β Hydroxysteroid Dehydrogenase and
Family
We took a functional site of 17 − β hydroxysteroid dehydrogenase (1a27 0) and
matched it against NAD(P)(H) binding sites belonging to members of the same
SCOP family (c.2.1.2) with and without taking into account the amino acid chem-
istry. The query, 1a27 0 binds NADP and oestradiol molecules.
When not accounting for physico-chemistry properties, MCMC step refines 70
matches to become statistically significant. Before and after MCMC refinement step,
248 and 318 sites respectively are significant. Some of these are 1udb 0, 2udp 1,
1uda 0, 1lrl 2, 1lrj 0, 1kvt 0, 1kvs 0, 1i3k 0, 1i3l 6, 1i3l 7, 1i3l 7, 1i3n 0, 1i3m 0,
1hzj 1, 1bxk 0. Figure 7.5 is a plot of RMSD against number of corresponding amino
acids before and after MCMC refinement step. Improvement after the refinement is
evident in many cases.
The comparison between matching with and without physico-chemistry proper-
ties gives the same pattern as in alcohol dehydrogenase. Figures 7.6a and 7.6b show
RMSD plotted against the number of corresponding amino acids when matching
with and without physico-chemistry before and after MCMC refinement step. Both
before and after MCMC refinement step, accounting for physico-chemistry restricts
the matching at the expense of RMSD and the number of corresponding amino
acids.
7.4.3 Case 3: Alcohol Dehydrogenase and Superfamily
We matched the same query as in Case 1 (1hdx 1) against other NAD(P)(H) binding
sites belonging to members of the same SCOP superfamily (c.2.1.x) but different
family as the query. No amino acid information is used in matching in this case.
MCMC refinement step achieves significant matches in, among other sites, 1nq5 1,
1nqo 1, 1nqo 3, 3dbv 0, 3dbv 2, 3dbv 3, 4dbv 0, 4dbv 1, 4dbv 2, 4dbv 3 and 1efl 12.
In all these cases, at least GGXG of the dinucleotide binding motif GXXGGXG is
matched. Graph theoretic solutions before MCMC refinement step are not signifi-
Chapter 7. Bayesian Refinement of Graph Solutions 150
Figure 7.5: Effect of MCMC refinement on graph matches of the NADP-binding
site of 17 − β hydroxysteroid dehydrogenase (1a27 0) against NAD(P)(H) binding
sites of SCOP tyrosine dependent oxidoreductase family proteins (Case 2) where
corresponding amino acids are not restricted to others in the same group. Each site
is represented by a green circle (graph only) and blue cross (after MCMC refinement)
connected by a straight line to highlight the difference.
cant. Figure 7.7 is a superposition of corresponding amino acids between functional
sites of glyceraldehyde-3-phosphate dehydrogenase (3dbv 3) and alcohol dehydroge-
nase (1hdx 1) after MCMC refinement step.
7.4.4 Case 4: Alcohol Dehydrogenase and FAD/NAD(P)-
binding Domain
We took a NAD-binding functional site of alcohol dehydrogenase (1hdx 1) and
matched it against FAD/NAD(P)(H) binding sites belonging to members of SCOP
FAD/NAD(P)-binding domain (c.3.1.x) without taking into account the amino acid
chemistry. A distance threshold value of 1.0A other than 1.5A was found to give
better matches for graph theoretic solution and was used in this case. A total of 338
Chapter 7. Bayesian Refinement of Graph Solutions 151
Figure 7.6: RMSD against number of corresponding amino acids for matching 17−β hydroxysteroid dehydrogenase NADP-binding site against NAD(P)(H) binding
sites of SCOP tyrosine dependent oxidoreductase family proteins (Case 2). a)
Graph matching prior to MCMC refinement showing results with/without amino
acid property information. Each site is represented by a green circle (with) and blue
cross (without) connected by a straight line to highlight the difference. b) MCMC
refinement of (a).
Chapter 7. Bayesian Refinement of Graph Solutions 152
Figure 7.7: Superposition of matching amino acids (Case 3) between alcohol dehy-
drogenase (1hdx 1; blue) and glyceraldehyde-3-phosphate dehydrogenase (3dbv 3;
red) binding sites after MCMC refinement (RMSD = 0.672; number of correspond-
ing amino acids = 12; p-value = 3.68e-05). The matched dinucleotide binding motif
is shown in ball-and-stick representation. Ligands are coloured in CPK colours.
pair-wise comparisons were made and 64 were significant before MCMC refinement
step. Sites from dihydropyrimidine dehydrogenase (1gth 13) and fumarate reductase
(1qla 5; 1qla 7; 1qlb 2; 1qlb 6) become statistically significant only after MCMC re-
finement step (p-values for 1gth 13, 1qla 5, 1qla 7, 1qlb 2 and 1qlb 6 before MCMC
refinement step: 0.3742, 0.6621, 0.6766, 0.6199 and 0.6199; after MCMC refinement
step: 0.0258, 0.0141, 0.0141, 0.0001 and 0.0001).
7.4.5 Assessing MCMC Refinement
Table 7.1 gives a summary on improvements achieved after MCMC refinement step
in the applications considered when matching with and without physico-chemistry
properties. In all considered cases, there are sites which give statistically significant
matches only after MCMC refinement step.
When not using physico-chemistry properties, much improvement (relative to
the number of sites considered) is registered in matching the query from 17 − β
hydroxysteroid dehydrogenase against sites from the same SCOP family members.
Chapter 7. Bayesian Refinement of Graph Solutions 153
There are also many improved cases in matching the query from alcohol dehydroge-
nase against sites from different families and fold (FAD/NAD(P)-binding domain).
Tables 7.2 and 7.3 compare RMSD and number of matched amino acids found be-
fore and after refinement when not using physico-chemistry properties. RMSD does
not change much after refinement. There are marginal RMSD mean increases for
matching functional sites of alcohol dehydrogenase and 17 − β hydroxysteroid de-
hydrogenase against sites of same SCOP family members. However there are also
marginal mean decreases for matching alcohol dehydrogenase against sites from dif-
ferent families and fold. The mean number of matched amino acids increases after
MCMC refinement except in matching functional sites of alcohol dehydrogenase and
members of the same superfamily but different families where there is a marginal
decrease.
There are even more improvements after MCMC refinement step when using
physico-chemistry properties. However there are less significant matches both before
and after refinement when matching with physico-chemistry properties compared
to matching without physico-chemistry properties. Tables 7.4 and 7.5 compare
RMSD and number of matched amino acids found before and after refinement when
using physico-chemistry properties. There are marginal mean decreases after MCMC
refinement in all cases. The mean number of matched amino acids increases after
MCMC refinement as well.
7.5 Comments
The examples given above make a clear case that MCMC refinement can improve
ligand binding site matches generated by graph matching, in terms of both the sta-
tistical and biological significance of the match. We attribute this success to the lack
of dependence on a strict matching tolerance, which is enforced in graph matching.
Statistical modelling in refinement of matches appears to have been successful in
automatically adapting to shape variations in ligand binding sites, which might be
due to different noise levels in atomic positions or protein phylogeny differences,
Chapter 7. Bayesian Refinement of Graph Solutions 154
Table 7.1: Assessment of statistical significance of functional site matching before
and after MCMC refinement step with/out amino acid property information.Without amino acid property With amino acid property
Case Total Sig. Graph Sig. MCMC Sig. Graph Sig. MCMC
alcohol dehydrogenase 145 142 142 125 132
and family
17 − β hydroxysteroid 326 248 318 159 236
dehydrogenase and family
alcohol dehydrogenase 897 200 324 33 222
and superfamily
alcohol dehydrogenase and 338 64 69 5 12
FAD/NAD(P)-binding domain
Sig. Graph: significant before refinement.
Sig. MCMC: significant after MCMC refinement step.
Table 7.2: RMSD(A) before and after MCMC refinement step without amino acid
property.Graph MCMC
Case Mean Std. Dev. Mean Std. Dev.
alcohol dehydrogenase and family 0.590 0.2350 0.619 0.2824
17 − β hydroxysteroid dehydrogenase and family 0.874 0.2208 0.958 0.1987
alcohol dehydrogenase and superfamily 2.093 1.6820 1.934 1.7367
alcohol dehydrogenase and FAD/NAD(P)-binding domain 1.723 1.3155 1.715 1.3188
Table 7.3: The number of matched amino acids before and after MCMC refinement
step without amino acid property.Graph MCMC
Case Mean Std. Dev. Mean Std. Dev.
alcohol dehydrogenase and family 33.7 12.26 34.6 13.11
17 − β hydroxysteroid dehydrogenase and family 17.0 6.83 21.8 7.04
alcohol dehydrogenase and superfamily 13.3 2.14 12.4 2.24
alcohol dehydrogenase and FAD/NAD(P)-binding domain 10.5 1.03 10.6 1.22
Chapter 7. Bayesian Refinement of Graph Solutions 155
Table 7.4: RMSD(A) before and after MCMC refinement step with amino acid
property.Graph MCMC
Case Mean Std. Dev. Mean Std. Dev.
alcohol dehydrogenase and family 0.805 0.8892 0.737 0.7944
17 − β hydroxysteroid dehydrogenase and family 1.047 0.6706 0.997 0.6226
alcohol dehydrogenase and superfamily 2.751 1.9127 2.459 2.0150
alcohol dehydrogenase and FAD/NAD(P)-binding domain 3.424 2.3234 3.337 2.3814
Table 7.5: The number of matched amino acids before and after MCMC refinement
step with amino acid property.Graph MCMC
Case Mean Std. Dev. Mean Std. Dev.
alcohol dehydrogenase and family 28.1 13.51 32.8 14.74
17 − β hydroxysteroid dehydrogenase and family 12.4 6.55 17.0 7.75
alcohol dehydrogenase and superfamily 8.3 1.21 8.7 1.69
alcohol dehydrogenase and FAD/NAD(P)-binding domain 8.7 0.81 9.0 1.06
among other factors. Refined matches usually retain a similar RMSD, and achieve
greater significance through expansion of the number of matching residues from the
core graph match. We have noted however that in some cases significant reductions
in the match RMSD are also achieved by refinement.
Dependence on a strict matching tolerance is not limited to graph matching,
but is also a feature of other matching methods commonly used in the field (e.g.
geometric hashing: Wallace et al., 1997). It is important to note that the MCMC re-
finement procedure can be applied to a starting match generated by any method; and
that the graph procedure chosen here was simply intended as an example. Equally
MCMC procedure can be applied to matching with no previously generated start-
ing match, for example by starting from randomly generated matches. That is, the
MCMC method provides a stand-alone algorithm for matching. Furthermore, the
method provides the full joint posterior distribution so that we have for example,
the posterior distribution for the matching matrix as well as the parameters of the
transformation simultaneously. However, we find that obtaining good matches by
Chapter 7. Bayesian Refinement of Graph Solutions 156
this method is very expensive in terms of computational time. While methods such
as graph matching can be applied to database searching, where a site is matched
against all members of a large database of sites, this would be impractical for match-
ing by MCMC alone. We suggest therefore that the MCMC procedure would be
most advantageous when applied to the best hits from a database search using a
faster method, and that in many cases it would increase the number of significant
hits.
We have made only a very basic study of the effect of including amino-acid residue
physico-chemical property information in matching, contrasting matches obtained
without restriction (any residue may match any other) with slightly more restrictive
matching (residues only allowed to match within relatively broadly defined groups).
It is interesting that even with very broadly defined groups, fewer statistically sig-
nificant matches are generally obtained than when matching is without restriction.
This could suggest that the physico-chemical properties of sites binding the same
or similar ligands can change significantly in evolution. It is however most likely to
reflect increased flexibility to change in peripheral residues that are less important
for binding, and needs further investigation. The main point of this work is that
MCMC refinement can improve matches under either matching regime. Indeed in
a few cases of matching with physico-chemical groups, we showed that some graph
matches without statistical significance were converted to significant matches by the
MCMC procedure, revealing that using graph matching alone could lead to some
erroneous conclusions in this respect.
Chapter 8
Conclusions and Further Work
In this Chapter we summarise important points from Chapters 3 to 7. We also
highlight potential areas for further work on the topics discussed in this thesis.
8.1 Conclusions
A few conclusions can already be drawn from work reported in this thesis.
8.1.1 Functional Sites
Exploratory analysis shows that functional sites in SITESDB tend to consist of short
contiguous segments (motifs) from the protein chain. Although in theory, side chains
from different parts of the chain can come together spatially to form an active site
or binding site, the automatic extraction of these sites in SITESDB leads to the
inclusion of all adjacent side chains (within 5A of residues annotated with SITE
RECORD in PDB or bound ligands). Presence of adjacent side chains is reflected
in the dataset but could be not part of the core binding or functional site. However
it has to be noted that currently most well known functional sites or active sites are
motifs.
157
Chapter 8. Conclusions and Further Work 158
8.1.2 Simulating Random Protein Structures
In section 3.1.2 we successfully simulated random protein Cα traces. A simpler
and more flexible approach similar to Aszodi and Taylor (1994) was used. With a
simple modelling of hydrophobic effects, the method produces compact and globular
structures.
8.1.3 Matching Algorithms
All the algorithms considered here (Graph theoretic, EM algorithm and MCMC) do
better when configuration points are further apart. As expected the performance
decreases with more cluttering of points and increasing positional noise.
The graph method
The graph theoretic is robust, fast but not very flexible to account for concomitant
information and different noise levels in functional sites. However the method can
also break down or become very computer intensive when matching configurations
with many inter-point distances of the same magnitude since the product graph
becomes very huge. Consequently the graph method might not be the ideal approach
in some applications like matching whole protein chains.
The EM algorithm
Concomitant information can flexibly be used in the EM algorithm. With good
starting values the EM algorithm does impressively well in finding corresponding
points. However the EM algorithm is sensitive to starting values. It is recommended
to try the algorithm from several starting values for rotation, translation and noise
parameters then monitor convergence. Simple match constraining techniques e.g.
variance cooling improves the algorithm to find better solutions.
Chapter 8. Conclusions and Further Work 159
The Bayesian hierarchical model (MCMC)
The algorithm is mostly not sensitive to starting values. Unlike the EM algorithm,
MCMC can easily escape local minima.
Although an assumption of a hidden homogeneous Poisson process was made
to formulate the model, the algorithm is not sensitive to this assumption. The
algorithm can match hardcore configurations, simulated short and virtual protein
chains and most importantly real functional sites.
The meta algorithm
There seems to be no silver bullet solution to matching functional sites. MCMC
does better than EM algorithm when starting values are far from true parameter
values in the EM algorithm. MCMC can escape local maxima. However MCMC is
very computer intensive and sometimes can drift away from the optimal solution.
There is need to monitor convergence in both EM and MCMC algorithms. On
the other hand the graph theoretic method is robust, fast but not very flexible to
account for concomitant information and different noise levels in functional sites.
Thus Bayesian modelling of the graph solution i.e. using MCMC method starting
from transformation parameter estimates by the graph method was suggested. This
meta algorithm is observed to be a good strategy. MCMC refinement step was able
to improve graph based matches to be more biologically significant.
8.1.4 Concomitant Information
Concomitant information (amino acid type) guides the EM algorithm to converge
(faster) to the true solution. However in most cases geometric information is so rich
such that the contribution from amino acid types information is marginal in both
EM and MCMC algorithms.
Chapter 8. Conclusions and Further Work 160
8.1.5 Hardening Soft Matches
Both EM and MCMC algorithms give probabilities of matching points in a pair-wise
alignment. Using linear programming to optimally assign unique matches gives best
results. However for small problems, just getting first non-duplicate set of matches
with high probabilities or using greedy algorithm gives practically similar solutions.
8.1.6 Assessing Significance of Matches
We have considered assessing the significance of matches that they are non-random
under the null hypothesis of random matches. We have also considered the goodness-
of-fit for matching related configurations.
Random versus non-random matches
Significance of matching two configurations under the null hypothesis of random
matches depends on RMSD, total number of amino acids in each configuration and
the number of amino acids matched. Extreme value distribution with empirically
derived constants for matching two random configurations can be used for evaluat-
ing p-values. The p-value calculation takes into account the RMSD, the number of
amino acids matched and the total number of amino acids in each of the configura-
tions.
Goodness-of-fit
We considered goodness-of-fit for matches known to be related (not matching by
chance) in section 4.1 of Chapter 4. P-values for assessing goodness-of-fit or ranking
matches from the RMSD distribution under the isotropic Gaussian error model
mostly agrees with the decision using the score suggested by Gold (2003).
Chapter 8. Conclusions and Further Work 161
8.1.7 Application: Matching NAD Binding Functional Sites
In Chapter 7, section 7.4, when using the meta algorithm of graph theoretic and
MCMC to match NAD(P)(H) binding sites, we find examples where significant shape
based matches do not retain similar amino acid chemistry. Matches were within
single Rossmann fold families, between different families in the same superfamily,
and in different folds. This indicates that even within families the same ligand may
be bound using substantially different physico-chemistry. We also showed that the
procedure finds significant matches between binding sites for the same cofactor in
different families and different folds.
In our basic study of the effect of including amino-acid residue physico-chemical
property information in matching, we contrasted matches obtained without restric-
tion (any amino acid could match any other) with slightly more restrictive matching
(amino acids only allowed to match within relatively broadly defined groups). It is
interesting that even with very broadly defined groups, fewer statistically signifi-
cant matches were generally obtained than when matching is without restriction. It
is also interesting to note that MCMC refinement improved matches under either
matching regime. Indeed in a few cases of matching with physico-chemical groups,
we showed that some graph matches without statistical significance were converted
to significant matches by the MCMC procedure, revealing that using graph matching
alone could lead to some erroneous conclusions in this respect.
8.2 Further Work
This work has also raised some issues which are interesting and need further work.
8.2.1 Simulating Random Protein Structures
Aszodi and Taylor (1994) and our alternative method in section 3.1.2 use fixed target
distances between Cα atoms, only taking into account the hydrophobic property
of amino acids. Further work could explore the idea of varying target distances
Chapter 8. Conclusions and Further Work 162
according to the target chain length. There would be need to explore how distances
between different types of amino acids in 3-dimensional structures vary with the
spacing in the sequence as well as the length of the chain.
In our method, the minimum of three random angles from the von Mises distri-
bution was used at each hydrophobic Cα atom in order to fold the chain towards
the centre of mass and create a hydrophobic interior core. Further work could in-
corporate using variable number of random angles, depending on the number of Cα
atoms (already) in the chain. This approach would control the level of structure
compactness. Furthermore, this approach would decrease chances for the chain to
crash into itself.
8.2.2 Matching Statistics
More research on the distribution for RMSD or size-and-shape distance and number
of matches when matching random configuration is required. The following are quite
interesting and very much open questions:
(a) What is the exact distribution for the number of matches q when matching two
random configurations {x} and {µ} with say n and m points?
(b) And what is the exact distribution of RMSD for matching q pairs of points for
two random configurations with m and n points?
Empirical approaches (Stark et al., 2003b; Chen and Crippen, 2005) have been
quite successful in answering these question. However the true analytical distribu-
tions have not been worked out. With analytical distributions, the adjustment for
database size would be straightforward. Empirical (model fitting) approximation by
the limiting distribution (EVD) in section 1.2.2 for minimum RMSD or number of
matches in a database search would not be required.
Chapter 8. Conclusions and Further Work 163
8.2.3 Matching Algorithms
In this thesis we have only considered matching pair-wise configurations. An impor-
tant extension to this approach is matching multiple configurations simultaneously.
The EM algorithm
There are still a number of questions of interest to be investigated with regard
to the EM algorithm alignment. For example, further work need to be done on
exploring the idea of multiple transformations approach. We observe that there
is asymmetry in the performance of the algorithm when matching configurations
with two transformations. The algorithm gave fewer errors with respect to the first
transformation compared to the second transformation. Further work needs to be
done in order to understand this observation.
Sensitivity Analysis for Multiple Transformations
Relevant questions in multiple transformation approach include:
(a) What are the effects of mis-specifying the number of transformations e.g. as-
suming the presence of two transformations when actually there is only one?
Simulations similar to those in section 5.4.4 but more extensive are required.
(b) How to infer on the number of transformations?
(c) When does the problem become over-parameterised? (number of transforma-
tions versus the number of points in the configurations).
Using Multiple Atoms for each Amino Acid
We will consider using more than one atom from each amino acid for matching
functional sites in the future. Some of the issues to be considered are:
(a) Which atoms to choose?
(b) How to account for dependence between atoms.
Chapter 8. Conclusions and Further Work 164
The Bayesian hierarchical model
In the future we will consider alternative formulations to relax the assumption of
conditional normal distribution for the second atom given the first atom when using
two atoms in an amino acid for matching functional sites.
Sequence ordering
All matching algorithms (MCMC, graph theoretic and EM algorithms) can be
extended to take into account the sequence ordering information especially when
matching whole protein structures. In addition to an enhanced capability to solve
alignment and correspondence for configurations with many points, this would speed
up running times of the algorithms. Sequence information would constrain the
matching further and dramatically reduce the solution space.
8.2.4 Application: Matching NAD Binding Functional Sites
There is a suggestion that the physico-chemical properties of sites binding the same
or similar ligands can change significantly in evolution. This was observed when
matching NAD(P)(H) binding sites within single Rossmann fold families, between
different families in the same superfamily, and in different folds in Chapter 7, section
7.4. It is however most likely to reflect increased flexibility to change in peripheral
residues that are less important for binding, and this needs further investigation.
Bibliography
Andreeva, A., Howorth, D., Brenner, S.E, Hubbard, T.J.P., Chothia, C. and Murzin,
A.G. (2004). SCOP database in 2004: refinements integrate structure and se-
quence family data. Nucl. Acid Res. 32 (1), D226–D229.
Applegate, D. and Johnson, D. An implementation of the Carraghan and Pardalos
algorithm. ftp://dimacs.rutgers.edu/pub/challenge/graph/solvers/ .
Artymiuk, P.J., Poirrette, A.R., Grindley, H.M., Rice, D.W. and Willett, P. (1994).
A graph-theoretic approach to the identification of three-dimensional patterns of
amino acid side-chains in protein structures. J. Mol. Biol. 243, 327–44.
Aszodi, A. and Taylor, W.R. (1994). Folding polypeptide α− carbon backbones by
distance geometry methods. Biopolymers 34, 489–505.
Bartlett, M.S. (1964). The spectral analysis of two-dimensional point processes.
Biometrika 51, 299–311.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H.,
Shindyalov, I.N. and Bourne, N.E. (2000). The protein data bank. Nucleic Acids
Research 28, 235–242.
Binkowski, T.A., Adamian, L. and Liang, J. (2003). Inferring functional rela-
tionships of proteins from local sequence and spatial surface patterns. J. Mol.
Biol. 332, 505–26.
Blow, D.M., Birktoft, J.J. and Hartley, B.S. (1969). Role of a buried acid group in
the mechanism of action of chymotrypsin. Nature 221, 337–40.
165
Bibliography 166
Branden, C. and Tooze, J. (1999). Introduction to Protein Structure (2nd ed.). New
York: Garland Publishing, Inc.
Brenner, S.E. and Levitt, M. (2000). Expectations from structural genomics. Protein
Science 9, 197–200.
Bron, C. and Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undi-
rected graph. Communications of the ACM 16, 575–577.
Burry, K.V. (1975). Statistical Methods in Applied Science. New Jersey: John Wiley
and Sons, Ltd.
Carraghan, R. and Pardalos, P.M. (1990). Exact algorithm for the maximum clique
problem. Operations Research Letters, 9–375.
Carugo, O. and Pongor, S. (2001). A normalized root-mean-square distance for
comparing protein three-dimensional structures. Protein Science 10, 1470–1473.
Chen, Y. and Crippen, G.M. (2005). A novel approach to structural alignment using
realistic structural and environmental information. preprint .
Chhajer, M. and Crippen, G.M. (2002). A protein folding potential that places the
native states of a large number of proteins near a local minimum. BMC Structural
Biology 2.
Chothia, C. and Lesk, A.M. (1986). The relation between the divergence of sequence
and structure in proteins. EMBO J. 5, 823–826.
Cressie, N.A.C. (1993). Statistics for spatial data (Rev. ed.). Chichester ; New York:
John Wiley and Sons.
Dafas, P., Bolser, D.M., Gomoluch, J., Park, J., Schroeder, M. (2004). Using convex
hulls to extract interaction interfaces from known structures. Bioinformatics 20,
1486–1490.
Bibliography 167
Dayhoff, M., Schwartz, R. and Orcutt, B. (1978). A model of evolutionary change in
proteins. In M. Dayhoff (Ed.), Atlas of Protein Sequence and Structure, Volume 5,
pp. 345–352. Washington, D.C.: Natl. Biomed. Res. Found.
Deb, K. (2001). Multi-objective Optimization Using Evolutionary Algorithms
(1st ed.). Chichester ; New York: John Wiley and Sons.
Diggle, P.J. (1983). Statistical Analysis of Spatial Point Patterns. London: Academic
Press.
Downs, T.D. (1972). Orientation statistics. Biometrika 59, 665–676.
Dryden, I.L. and Mardia, K.V. (1998). Statistical Shape Analysis. Chichester: John
Wiley.
Dryden, I.L., Hirst, J.D. and Melville, J.L. (2006). Statistical analysis of unla-
belled point sets: comparing molecules in chemoinformatics. Under revision for
Biometrics .
Eidhammer, I., Jonassen, I. and Taylor, W.R. (2004). PROTEIN BIOINFORMAT-
ICS: An Algorithmic Approach to Sequence and Structure Analysis. New Jersey:
John Wiley and Sons, Ltd.
Ewens, W.J. and Grant, G.R. (2001). Statistical Methods in Bioinformatics : an
introduction. New York: Springer.
Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (Eds.) (1996). Markov chain
Monte Carlo in practice. London: Chapman and Hall.
Gold, N.D. (2003). Computational approaches to similarity searching in a functional
site database for protein function prediction. Ph.D thesis, Leeds University, School
of Biochemistry and Microbiology.
Gold, N.D., Pickering, S.J. and Westhead, D.R. (2003). Predicting protein function
from structure using sitesdb: evaluation of a method based on functional site-
similarity. Preprint .
Bibliography 168
Gold, N.D. and Jackson, R.M. (2006). Fold independent structural comparisons
of protein-ligand binding sites for exploring functional relationships. J. Mol.
Biol. 355 (5), 1112–1124.
Gong, S. and Park, C. and Choi, H. and Ko, J. and Jang I. and Lee, J. and Bolser,
D.M. and Oh, D. and Kim D. and Bhak, J. (2005). A protein domain interaction
interface database: InterPare. BMC Bioinformatics 6.
Gong, S.S. and , Yoon, G.S. and Jang, I.S. and Bolser, D.M. and Dafas, P. and
Schroeder, M. and Choi, H.S. and Cho, Y.B. and Han, K.S. and Lee, S.H.
and Choi, H.H. and Lappe, M. and Holm, L. and Kim, S.S. and Oh, D.H. and
Bhak, J.H. (2005). PSIbase: a database of Protein Structural Interactome map
(PSIMAP). Bioinformatics 21, 2541–2543.
Green, P.J. (2001). A primer on Markov chain Monte Carlo. In O. Barndorff-
Nielsen, D. Cox, and C. Kluppelberg (Eds.), Complex Stochastic Systems, pp.
1–62. London: Chapman and Hall.
Green, P.J. and Mardia, K.V. (2006). Bayesian alignment using hierarchical models,
with applications in protein bioinformatics. Biometrika in press.
Gumbel, E.J. (1958). Statistics of Extremes. New York: Columbia University Press.
Holm, L., Ouzounis, C., Sander, C., Tuparev, G. and Vriend, G. (1992). A database
of protein structure families with common folding motifs. Protein Sci. 1, 1691–8.
Holm, L. and Sander, C. (1993). Protein structure comparison by alignment of
distance matrices. J. Mol. Biol. 233, 123–38.
Hubbard, T.J., Murzin, A.G., Brenner, S.E. and Chothia, C. (1997). SCOP: a
structural classification of proteins database. Nucleic Acids Res 25, 236–9.
Hung, M.S. and Rom, W.O. (1980). Solving the assignment problem by relaxation.
Operations Research 28, 969–982.
Bibliography 169
Jaramillo, A., Wernischdagger, L., Hery, S. and Wodak, S.J. (2002). Folding free
energy function selects native-like protein sequences in the core but not on the
surface. Proc. Natl. Acad. Sci. 99(21), 13554–9.
Jeong, J.I., Jang, Y. and Kim, M.K. (2006). A connection rule for α-carbon coarse-
grained elastic network models using chemical bond information. Journal of
Molecular Graphics and Modelling 24, 296–306.
Jonker, R. and Volgenant, A.A. (1987). Shortest augmenting path algorithm for
dense and spare-linear assignment problems. Computing 38, 325–340.
Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two
sets of vectors. Acta Cryst A A34, 827–828.
Karp, R.M. (1980). An algorithm to solve the m×n assignment problem in expected
time o(mn logn). Networks 10, 143–152.
Kent, J.T., Mardia, K.V. and Taylor, C.C. (2004). Matching unlabelled configu-
rations of unequal size with applications to bioinformatics. In R.G. Aykroyd,
S. Barber, and K.V. Mardia (Eds.), Bioinformatics, Images, and Wavelets, pp.
33–36. Leeds University Press.
Khatri, C.G. and Mardia, K.V. (1977). The von Mises-Fisher matrix distribution in
orientation statistics. Journal of the Royal Statistical Society. Series B (Method-
ological) 39 (1), 95–106.
Kinoshita, K., Furui, J. and Nakamura, H. (2002). Identification of protein functions
from a molecular surface database, eF-site. J. Struct. Funct. Genomics 2, 9–22.
Kinoshita, K., Sadanami, K., Kidera, A. and Go, N. (1999). Structural motif of
phosphate-binding site common to various protein superfamilies: all-against-all
structural comparison of protein-mononucleotide complexes. Protein Eng. 12,
11–4.
Bibliography 170
Kleywegt, G.J. (1999). Recognition of spatial motifs in protein structures. J. Mol.
Biol. 285, 1887–97.
Kuhl, F.S., Crippen, G.M. and Friesen, D.K. (1984). A combinatorial algorithm for
calculating ligand binding. Journal of Computational Chemistry 5 (1), 24–34.
Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Naval
Research Logistics Quarterly 2, 83–97.
Lesk, A.M. (2000). Introduction to protein architecture: the structural biology of
proteins. Oxford: Oxford University Press.
Lesk, A.M. (2002). Introduction to Bioinformatics. Oxford: Oxford University
Press.
van Lieshout, M.N.M. (2000). Markov point processes and their application. London:
Imperial College Press.
Luo, B. and Hancock, E.R. (2001). Structural Graph Matching Using the EM
Algorithm and Singular Value Decomposition. IEEE Trans. Pattern Analysis and
Machine Intelligence 23 (10), 1120–1136.
Mardia, K.V. (1972). Statistics of Directional Data. London and New York: Aca-
demic Press.
Mardia, K.V. and Gadsden, R.J. (1977). A small circle of best fit for spherical data
and areas of vulcanism. Applied Statistics 26 (3), 238–245.
Mardia, K.V. and Jupp, P.E. (2000). Directional Statistics. Chichester: John Wiley
and Sons Ltd.
Mardia, K.V., Nyirongo, V. and Westhead, D.R. (2003). Protein matching us-
ing amino acids information. In R.G. Aykroyd, K.V. Mardia, and M.J. Langdon
(Eds.), Stochastic Geometry, Biological Structure and Images, pp. 147. Leeds Uni-
versity Press.
Bibliography 171
Mardia, K.V., Taylor, C.C. and Westhead, D.R. (2003). Structural Bioinformatics
Revisited. In R.G. Aykroyd, K.V. Mardia, and M.J. Langdon (Eds.), Stochastic
Geometry, Biological Structure and Images, pp. 11–18. Leeds University Press.
Mardia, K.V. and Nyirongo, V. (2004). Procrustes statistics for unlabelled points
and applications. In R.G. Aykroyd, S. Barber, and K.V. Mardia (Eds.), Bioin-
formatics, Images, and Wavelets, pp. 137. Leeds University Press.
Mardia, K.V., Nyirongo, V. and Westhead, D.R. (2005). EM algorithm, Bayesian
and distance approaches to matching active site. Mathematical and Statistical
Aspects of Molecular Biology 15th Annual meeting, Abstracts pp. 13–14.
Murty Katta, G. (1968). An algorithm for ranking all assignments in order of
increasing cost. Operations Research 16, 682–687.
Naor, D., Fischer, D., Jernigan, R.L., Wolfson, H.J. and Nussinov, R. (1996). Amino
acid pair interchanges at spatially conserved locations. J. Mol. Biol. 256 (5), 924–
9382.
Orengo, C.A., Jones, D.T. and Thornton, J.M. (1994). Protein superfamilies and
domain superfolds. Nature 372, 631–4.
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B. and Thorn-
ton, J.M. (1997). Cath–a hierarchic classification of protein domain structures.
Structure 5, 1093–108.
Park, J. and Lappe, M. and Teichmann, S. (2001). Mapping protein family interac-
tions: intramolecular and intermolecular protein family interaction repertoires in
the PDB and yeast. J. Mol. Biol. 307 (3), 929–938.
Pedersen, L. (2002). Analysis of two-dimensional electrophoresis gel images. Ph. D.
thesis, Technical University of Denmark.
Pereira De Araujo, A.F. (1999). Folding protein models with a simple hydrophobic
Bibliography 172
energy function: The fundamental importance of monomer inside/outside segre-
gation. Proc. Natl. Acad. Sci. 96(22), 12482–7.
Raffenetti, R.C. and Ruedenberg, K. (1970). Parameterization of an orthogonal
matrix in terms of generalized Eulerian angles. International Journal of Quantum
Chemistry IIIS, 625–634.
Rangarajan, A. and Gold, S. (1996). A graduated assignment algorithm for graph
matching. IEEE Trans. Pattern Analysis and Machine Intelligence 18 (4), 377–
388.
Richardson, S. and Green, P.J. (1997). On Bayesian analysis of mixtures with an
unknown number of components (with discussion). Journal of the Royal Statistical
Society. Series B (Methodological) 59 (4), 731–792.
Ripley, B.D. (1976). The second-order analysis of stationary point processes. Journal
of Applied Probability 13, 255–266.
Ripley, B.D. (1977). Modelling spatial patterns (with discussion). Journal of the
Royal Statistical Society. Series B (Methodological) 39 (2), 172–212.
Sanchez, R. and Sali, A. (1998). Large-scale protein structure modeling of the
Saccharomyces cerevisiae genome. Biophysics 95 (23), 13597-602.
Sayle, R.A. and Milner-White, E.J. (1995). Rasmol: biomolecular graphics for all.
Trends in Biochemical Sciences 20 (9), 374.
Schmitt, S., Kuhn, D. and Klebe, G. (2002). A new method to detect related function
among proteins independent of sequence and fold homology. J. Mol. Biol. 323,
387–406.
Shindyalov, I. and Bourne, P.E. (1998). Protein structure alignment by incremental
combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–47.
Shulman-Peleg, A., Nussinov, R. and Wolfson, H.J. (2004). Recognition of functional
sites in protein structures. J. Mol. Biol. 339, 607–33.
Stark, A., Sunyaev, S. and Russell, R.B. (2003b). A model for statistical significance
of local similarities in structure. J. Mol. Biol. 326 (5), 1307–1316.
Taylor, C.C., Mardia, K.V. and Kent, J.T. (2003). Matching unlabelled configura-
tions using the EM algorithm. In R.G. Aykroyd, K.V. Mardia, and M.J. Langdon
(Eds.), Proceedings in Stochastic Geometry, Biological Structure and Images, pp.
19–21. Leeds University Press.
Torrance, J.W., Bartlett, G.J., Porter, C.T. and Thornton, J.M (2005). Using a
Library of Structural Templates to Recognise Catalytic Sites and Explore their
Evolution in Homologous Families. J. Mol. Biol. 347 (3), 565–81.
Wallace, A.C., Borkakoti, N. and Thornton, J.M. (1997). TESS: a geometric hashing
algorithm for deriving 3d coordinate templates for searching structural databases.
Protein Sci. 6, 2308–23.
Weiner, S.J., Kollman, P.A., Case, D.A, Singh, U.C., Alagona, G., Profeta Jr., S.
and Weiner, P. (1984). A new force field for the molecular mechanical simulation
of nucleic acids and proteins. J. Am. Chem. Soc. 106, 765–84.
Wright, C.S., Alden, R.A. and Kraut, J. (1969). Structure of subtilisin bpn’ at 2.5
angstrom resolution. Nature 221, 235–42.
Wright, M.B. (1990). Speeding up the Hungarian algorithm. Computers and Oper-
ations Research 17 (1), 95–96.
Wu, T.D., Schmidler, S.C., Hastie, T. and Brutlag, D.L. (1998). Modeling and
superposition of multiple protein structures using affine transformations: Analysis
of the globins. In Pacific Symposium on Biocomputing ’98, Maui, Hawaii, pp. 509–
520. World Scientific.
173
Appendix A
Computational Cost
Time and storage space for algorithms used in Bioinformatics applications is very im-
portant because usually many comparisons or huge amounts of data are processed.
In the future we would like to compare run times for graph, EM algorithm and
MCMC. This would require the methods to be implemented in the same program-
ming language. Presently run times are not directly comparable as graph method
is implemented in C, EM algorithm in R while MCMC is implemented in Fortran.
Nevertheless, we present estimated processor times in the next section: not for com-
parison purposes but to give an indication of time it would take to search a typical
database.
A.1 Processor Times
First 100 sites in the SITESDB were matched against a large functional site of 17−βhydroxysteroid dehydrogenase (1a27 0) with 63 amino acids. Sun Microsystems c©UltraSPARC II (360 MHz) and UltraSPARC-IIe (650 MHz) processors were used.
Table A.1 gives estimated times it takes to do 55,000 and 100,000 pair-wise com-
parisons by EM algorithm and graph methods.
For the Bayesian method, Green and Mardia (2006) reports run time of 2 sec-
onds on a 800MHz PC to match the same functional site of 17 − β hydroxysteroid
dehydrogenase (1a27 0) against another functional site with 40 points.
174
Table A.1: Estimated time (hrs) for EM algorithm and graph methods to do pair-
wise comparisons between a functional site of 17− β hydroxysteroid dehydrogenase
(1a27 0) and functional sites in SITESDB on 360 and 650MHz processors.
EM algorithm Graph
Database size 360MHz 650MHz 360MHz 650MHz
55,000 121 67 23 10
100,000 221 121 41 19
A.2 Comments
• Naive analysis of times in Table A.1 shows that the graph method is about
six times faster than the EM algorithm. Coincidentally C implementations
are roughly about six times faster than R implementations in general. Thus
implementing the EM algorithm in C will improve run times.
• This analysis involved relatively a big query site consisting of 63 points.
175