statistical approaches to protein matching in bioinformatics

198
Statistical approaches to protein matching in Bioinformatics Vysaul B. Nyirongo Submitted in accordance with the requirements for the degree of Doctor of Philosophy The University of Leeds Department of Statistics January 2006 The candidate confirms that the work submitted is his own and that appropriate credit has been given where reference has been made to the work of others. This copy has been supplied on the understanding that it is a copyright material and that no quotation from the thesis may be published without proper acknowledgement.

Upload: others

Post on 09-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical approaches to protein matching in Bioinformatics

Statistical approaches toprotein matching in Bioinformatics

Vysaul B. Nyirongo

Submitted in accordance with the requirements for the degree

of Doctor of Philosophy

The University of Leeds

Department of Statistics

January 2006

The candidate confirms that the work submitted is his own and that appropriate

credit has been given where reference has been made to the work of others. This

copy has been supplied on the understanding that it is a copyright material and that

no quotation from the thesis may be published without proper acknowledgement.

Page 2: Statistical approaches to protein matching in Bioinformatics

2

Page 3: Statistical approaches to protein matching in Bioinformatics

Dedication

To my father W.C. Nyirongo, the kindest

and

my mother Nee Dorothy Nyirenda, the finest.

Page 4: Statistical approaches to protein matching in Bioinformatics

Acknowledgements

I am deeply thankful to my supervisor, Prof. K.V. Mardia for his guidance, discus-

sions, helpful comments and inspiring interest on this research. I am also deeply

indebted to Prof. P.J. Green for kindly providing the source code for Bayesian

alignment using hierarchical models.

I wish to thank Dr. C. Xu for his many helpful comments on spatial point pro-

cesses and kindly allowing to use his program for analysing spatial point processes.

I am also grateful to Dr. D.R. Westhead and Dr. N.D. Gold for their many helpful

discussions, comments and for the access to functional sites database (SITESDB).

Finally, but not least, I would like to express by gratitude for financial support

from Universities UK, University of Leeds and the Department of Statistics at Uni-

versity of Leeds. My research studies were financed by Universities UK through

ORS scholarship and University of Leeds through Tetley and Lupton scholarship.

During this research, I was financially supported by the Department of Statistics,

University of Leeds.

i

Page 5: Statistical approaches to protein matching in Bioinformatics

Abstract

Structural genomics projects aim to provide structural data or accurate models

for uncharacterised proteins (Brenner and Levitt, 2000). The motivation for these

initiatives is the knowledge that similarity between protein structures can provide

evidence of common evolutionary ancestry (and hence possible functional similarity)

even where sequence similarity lies undetectable because structure is conserved for

longer in evolution than sequence (Chothia and Lesk, 1986). Recent advances in

high-throughput protocols for structural determination of structural genomics target

proteins have produced an explosion in volume of structural data prior to knowledge

of protein biochemical function. With these advances has come the need to rapidly

predict functions for proteins based on structure.

We present statistical matching of functional sites. In particular, we are using

the EM algorithm in a mixture model formulation to solve for correspondence and

alignment in matching two configurations of functional sites. We extend the EM

algorithm of Kent et al. (2004) to incorporate concomitant information in matching

functional sites. We also extend Green and Mardia (2006) to matching configura-

tions of coupled points using hierarchical models for Bayesian alignment.

We also present goodness-of-fit statistics for matching two functional sites un-

der the Gaussian error model. We consider the Procrustes statistic for matching of

forms. The Procrustes statistic is related to RMSD except for a divisor. P-values

are used to indicate goodness-of-fit. Related but harder is the problem of finding

the distribution for the minimum Procrustes statistic when the points are unla-

belled. First we will discuss this problem and the inherent difficulty. For illustrative

ii

Page 6: Statistical approaches to protein matching in Bioinformatics

purposes, we use Gaussian configurations on a line.

Key words: active site, binding site, Bayesian, Bioinformatics, correspondence

and alignment, EM algorithm, functional site, hierarchical models, Markov chain

Monte Carlo, mixture model, Procrustes, Root mean square deviation.

iii

Page 7: Statistical approaches to protein matching in Bioinformatics

Contents

Abstract ii

Abbreviations and Acronyms xiv

About this Thesis xv

Overview and Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Research Goals and Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

Conference Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

1 Introduction and Literature Review 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Mathematical Abstraction of the Problem . . . . . . . . . . . 1

1.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.4 Functional Sites . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.5 The SITESDB Database . . . . . . . . . . . . . . . . . . . . . 6

1.1.6 Structure comparisons . . . . . . . . . . . . . . . . . . . . . . 8

1.1.7 Objectives in matching protein structures . . . . . . . . . . . 8

1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.1 Matching and Superposition Algorithms . . . . . . . . . . . . 11

1.2.2 Extreme Values in Bioinformatics . . . . . . . . . . . . . . . . 17

iv

Page 8: Statistical approaches to protein matching in Bioinformatics

2 Exploratory Analysis of Protein Geometry 21

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.1 Inter-event distances . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.2 Point to nearest event distances . . . . . . . . . . . . . . . . . 22

2.1.3 The K-function . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Functional Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Protein Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Simulation Design and Evaluation of Algorithms 38

3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.1 Functional Sites Simulations . . . . . . . . . . . . . . . . . . . 38

3.1.2 Whole Structure Simulations . . . . . . . . . . . . . . . . . . . 44

3.1.3 Appropriateness of Simulated Data . . . . . . . . . . . . . . . 52

3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.1 Correct Correspondence . . . . . . . . . . . . . . . . . . . . . 56

3.2.2 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Match Statistics 58

4.1 Goodness-of-fit Statistics for Rigid Body Superpositions . . . . . . . . 58

4.1.1 Minimum RMSD Distribution . . . . . . . . . . . . . . . . . . 58

4.1.2 Distribution of Size-and-shape Distance . . . . . . . . . . . . . 60

4.1.3 Simulations for RMSD Distribution . . . . . . . . . . . . . . . 64

4.1.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 EM Algorithm Alignment 68

5.1 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.1 Soft Matching of Forms . . . . . . . . . . . . . . . . . . . . . 69

5.1.2 Model Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.4 Hardening of Soft Matches . . . . . . . . . . . . . . . . . . . . 73

v

Page 9: Statistical approaches to protein matching in Bioinformatics

5.2 Concomitant Information in the Mixture Model . . . . . . . . . . . . 77

5.2.1 Concomitant Information Model . . . . . . . . . . . . . . . . . 78

5.2.2 Colour Weighting . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.4 Application on Matching Functional Sites . . . . . . . . . . . 86

5.2.5 Using Amino Acid Group Information . . . . . . . . . . . . . 86

5.2.6 Summarising Comments . . . . . . . . . . . . . . . . . . . . . 91

5.3 Distance Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4 Multiple Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.4.1 Soft Matching Model . . . . . . . . . . . . . . . . . . . . . . . 97

5.4.2 Model Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.4.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6 Bayesian Alignment 103

6.1 Bayesian Hierarchical Model . . . . . . . . . . . . . . . . . . . . . . . 103

6.1.1 Point Process Model, with Geometrical Transformation and

Random Thinning . . . . . . . . . . . . . . . . . . . . . . . . 104

6.1.2 Formulation of Poisson Process Prior . . . . . . . . . . . . . . 104

6.1.3 Data Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.1.4 Prior Distributions and Computations . . . . . . . . . . . . . 108

6.1.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.1.6 Using Concomitant Information . . . . . . . . . . . . . . . . . 115

6.1.7 Results for Graph Theoretic and MCMC . . . . . . . . . . . . 115

6.1.8 Sensitivity of Poisson Prior Assumption . . . . . . . . . . . . 118

6.2 Using Two Atoms for each Amino Acid . . . . . . . . . . . . . . . . . 131

6.2.1 Prior Distributions and Computations . . . . . . . . . . . . . 132

6.2.2 Updating M . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

vi

Page 10: Statistical approaches to protein matching in Bioinformatics

6.2.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7 Bayesian Refinement of Graph Solutions 137

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.3.1 Representation and Matching . . . . . . . . . . . . . . . . . . 139

7.3.2 Graph Theoretic Step . . . . . . . . . . . . . . . . . . . . . . 139

7.3.3 MCMC Refinement Step . . . . . . . . . . . . . . . . . . . . . 140

7.3.4 Accounting for Physico-chemistry Properties . . . . . . . . . . 141

7.3.5 Assessing Quality of Matches . . . . . . . . . . . . . . . . . . 141

7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.4.1 Case 1: Alcohol Dehydrogenase and Family . . . . . . . . . . 144

7.4.2 Case 2: 17 − β Hydroxysteroid Dehydrogenase and Family . . 149

7.4.3 Case 3: Alcohol Dehydrogenase and Superfamily . . . . . . . . 149

7.4.4 Case 4: Alcohol Dehydrogenase and FAD/NAD(P)-binding

Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.4.5 Assessing MCMC Refinement . . . . . . . . . . . . . . . . . . 152

7.5 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8 Conclusions and Further Work 157

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.1.1 Functional Sites . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.1.2 Simulating Random Protein Structures . . . . . . . . . . . . . 158

8.1.3 Matching Algorithms . . . . . . . . . . . . . . . . . . . . . . . 158

8.1.4 Concomitant Information . . . . . . . . . . . . . . . . . . . . 159

8.1.5 Hardening Soft Matches . . . . . . . . . . . . . . . . . . . . . 160

8.1.6 Assessing Significance of Matches . . . . . . . . . . . . . . . . 160

8.1.7 Application: Matching NAD Binding Functional Sites . . . . . 161

8.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

8.2.1 Simulating Random Protein Structures . . . . . . . . . . . . . 161

vii

Page 11: Statistical approaches to protein matching in Bioinformatics

8.2.2 Matching Statistics . . . . . . . . . . . . . . . . . . . . . . . . 162

8.2.3 Matching Algorithms . . . . . . . . . . . . . . . . . . . . . . . 162

8.2.4 Application: Matching NAD Binding Functional Sites . . . . . 164

Bibliography 165

A Computational Cost 174

A.1 Processor Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

A.2 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

viii

Page 12: Statistical approaches to protein matching in Bioinformatics

List of Figures

1.1 Peptide bond. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Functional site in 5-aminolaevulinate dehydratase protein structure. . 7

1.3 RasMol ball representation ofCα for functional sites of 17−β hydroxysteroid-

dehydrogenase and carbonyl reductase. . . . . . . . . . . . . . . . . . 9

2.1 The K-function and inter-point distance distribution for a functional

site of 17 − β hydroxysteroid dehydrogenase. . . . . . . . . . . . . . . 25

2.9 The K-function and inter-point distance distribution for Cα atoms in

17 − β hydroxysteroid dehydrogenase structure. . . . . . . . . . . . . 33

2.13 The K-function and inter-point distance distribution for all atoms in

17 − β hydroxysteroid dehydrogenase. . . . . . . . . . . . . . . . . . . 37

3.1 Virtual distances and angles in a protein backbone. . . . . . . . . . . 45

3.2 Distance constraints in a protein virtual backbone. . . . . . . . . . . 47

3.3 Orientation of Cα atoms in simulating protein short chains. . . . . . . 50

3.4 Typical chain realisations in short protein chain simulations without

hydrophobic effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 Chain realisations in short protein chain simulations with hydropho-

bic effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6 The K-function and inter-point distance distribution for a simulated

hardcore configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7 The K-function and inter-point distance distribution for a simulated

short chain configuration. . . . . . . . . . . . . . . . . . . . . . . . . 55

ix

Page 13: Statistical approaches to protein matching in Bioinformatics

4.1 RMSD against number of corresponding points. . . . . . . . . . . . . 59

4.2 RMSD histogram, approximate and empirical distribution functions. . 65

5.1 Correct correspondence proportions for different hardening methods. . 77

5.2 Illustrative example of data-driven weights for matching. . . . . . . . 81

5.3 Correct correspondence proportions for various weighting schemes.

Bayesian: simple prior conditional probabilities. . . . . . . . . . . . . 84

5.4 Correct correspondence proportions for various α levels. . . . . . . . . 85

5.5 Convergence regions of starting values for EM algorithm. . . . . . . . 87

5.6 Superposition of carbonyl reductase and 17−β hydroxysteroid dehy-

drogenase sites when matching with EM algorithm. . . . . . . . . . . 91

5.7 Match scores and RMSD when using weights. . . . . . . . . . . . . . 93

6.1 Acyclic graph for Bayesian hierarchical model. . . . . . . . . . . . . . 106

6.2 Corresponding amino acids found by MCMC method where the graph

theoretic method gives worse solutions. . . . . . . . . . . . . . . . . . 119

6.3 True correspondence proportions for MCMC, graph and EM algo-

rithm methods for hardcore and Poisson model data. . . . . . . . . . 129

6.4 True correspondence proportions for MCMC and graph for hardcore

data with large variance. . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.5 Corresponding amino acids in matching functional sites of 17 − β

hydroxysteroid dehydrogenase and carbonyl reductase using Cα and

Cβ atoms in MCMC and graph theoretic methods. . . . . . . . . . . 134

6.6 Corresponding amino acids in matching functional sites of 17−β hy-

droxysteroid dehydrogenase and carbonyl reductase using Cα atoms

only in MCMC and graph theoretic methods. . . . . . . . . . . . . . 135

7.1 RMSD against number of corresponding amino acids for matching al-

cohol dehydrogenase NAD-binding site against NAD(P)(H) binding

sites of SCOP alcohol dehydrogenase-like family proteins with/without

amino acid property information . . . . . . . . . . . . . . . . . . . . . 146

x

Page 14: Statistical approaches to protein matching in Bioinformatics

7.2 Effect of MCMC refinement on graph matches of the NAD-binding

functional site of alcohol dehydrogenase against NAD(P)(H) binding

sites of SCOP alcohol dehydrogenase-like family proteins . . . . . . . 147

7.3 Corresponding amino acids between the NAD-binding site of alco-

hol dehydrogenase and NADP-binding site of quinone oxidoreductase

before and after MCMC refinement . . . . . . . . . . . . . . . . . . . 148

7.4 Corresponding amino acids between the NAD-binding site of alcohol

dehydrogenase and NADP-binding site of hypothetical protein YhdH

before and after MCMC refinement step . . . . . . . . . . . . . . . . 148

7.5 Effect of MCMC refinement on graph matches of 17 − β hydroxys-

teroid dehydrogenase NADP-binding site against NAD(P)(H) binding

sites of SCOP tyrosine dependent oxidoreductase family proteins . . . 150

7.6 RMSD against number of corresponding amino acids for matching

17−β hydroxysteroid dehydrogenase NADP-binding site against NAD(P)(H)

binding sites of SCOP tyrosine dependent oxidoreductase family pro-

teins with/without amino acid property information . . . . . . . . . . 151

7.7 Superposition of matching amino acids between alcohol dehydroge-

nase and glyceraldehyde-3-phosphate dehydrogenase binding sites af-

ter MCMC refinement . . . . . . . . . . . . . . . . . . . . . . . . . . 152

xi

Page 15: Statistical approaches to protein matching in Bioinformatics

List of Tables

3.1 Frequencies of amino acids. . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Relative mutabilities of amino acidsa. . . . . . . . . . . . . . . . . . . 41

3.3 Amino acid substitution matrix. . . . . . . . . . . . . . . . . . . . . . 43

3.4 Target (desired) distances between Cα atoms in simulated short chains

of proteins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1 Best fitting functional sites in the database when matched against

5-aminolaevulinate dehydratase functional site. . . . . . . . . . . . . . 66

5.3 Example functional sites for comparing results when using or not

using concomitant information in the EM algorithm. . . . . . . . . . 88

5.4 Groups of amino acids. . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 Comparison of with and without colour matching results. . . . . . . . 89

5.6 Matching results using Gold (2003) method. . . . . . . . . . . . . . . 90

5.7 Matching statistics for 17−β hydroxysteroid dehydrogenase and rep-

resentative functional sites with/out colour information in EM algo-

rithm and graph methods. . . . . . . . . . . . . . . . . . . . . . . . . 96

5.8 Proportions of correct correspondence and rotation errors when match-

ing forms with two transformations . . . . . . . . . . . . . . . . . . . 102

6.1 Matching statistics for 17−β hydroxysteroid dehydrogenase and fam-

ilies representative functional sites using graph and MCMC methods

(cases with MCMC doing better). . . . . . . . . . . . . . . . . . . . . 120

xii

Page 16: Statistical approaches to protein matching in Bioinformatics

6.4 Matching statistics for 17−β hydroxysteroid dehydrogenase and fam-

ilies representative functional sites using graph and MCMC methods

(cases with graph doing better). . . . . . . . . . . . . . . . . . . . . . 123

6.8 Number of same pairs in MCMC and graph solutions when using Cα

atoms only and when using both Cα and Cβ atoms. . . . . . . . . . . 135

7.1 Assessment of statistical significance of functional site matching be-

fore and after MCMC refinement step with/out amino acid property

information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.2 RMSD(A) before and after MCMC refinement step without amino

acid property. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.3 The number of matched amino acids before and after MCMC refine-

ment step without amino acid property. . . . . . . . . . . . . . . . . . 154

7.4 RMSD(A) before and after MCMC refinement step with amino acid

property. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.5 The number of matched amino acids before and after MCMC refine-

ment step with amino acid property. . . . . . . . . . . . . . . . . . . 155

A.1 Database search times . . . . . . . . . . . . . . . . . . . . . . . . . . 175

xiii

Page 17: Statistical approaches to protein matching in Bioinformatics

Abbreviations and Acronyms

CATH: Class, Architecture, Topology and Homologous superfamily

CE: Combinatorial Extension

DALI: Distance (matrix) alignment

DP: Dynamic programming

eF-site: electrostatic surface of Functional site

FAD: Flavin Adenine Dinucleotide

FSSP: Families of Structurally Similar Proteins

LA: Linear assignment

LP: Linear programming

MCMC: Markov chain Monte Carlo

NAD(P)(H): Nicotinamide Adenine Dinucleotide (Phosphate)

PDB: Protein Data Bank

PINTS: Patterns In Non-homologous Tertiary Structures

pvSoar: pocket and void surfaces of amino acid residues

RMSD: Root mean square deviation

SCOP: Structural Classification of Proteins

SitesBase: Database of ligand binding site similarities

SITESDB: Sites Database

xiv

Page 18: Statistical approaches to protein matching in Bioinformatics

About this Thesis

This thesis is on statistical approaches in matching proteins. We consider match-

ing functional sites of proteins using atom coordinates and type of amino acids as

concomitant information.

We also consider matching configurations of coupled points. By coupled points

we mean two spatially dependent points i.e. points which are say physically con-

nected e.g. two atoms from the same amino acid. This arises in Bioinformatics

application of aligning functional sites when matching amino acids using two atoms

from each amino acid to take into account the relative orientation of the amino acid

We are also interested in statistics which measure quality of the matching.

Overview and Organisation

The thesis is divided into eight chapters. We introduce the problem and review

current literature on the problem in Chapter 1. Particularly, we briefly describe

graph theoretic formulation for structural similarity problem (Gold et al., 2003) in

section 1.2.1.

Exploratory analysis on functional sites is presented in Chapter 2. Simulation

of proteins and functional sites is considered in Chapter 3. Work on matching

statistics is presented in Chapter 4. We consider the EM algorithm for the mixture

model formulation of the problem (Taylor et al., 2003) in Chapter 5, section 5.1. In

section 5.2 we investigate the added value of using concomitant information in the

mixture model. An application of the graph theoretic method and EM algorithm

on representatives of functional sites from tyrosine dependent oxidoreductase family

xv

Page 19: Statistical approaches to protein matching in Bioinformatics

is discussed in section 5.2.4. We give some of the results for graph theoretic and

MCMC (Green and Mardia, 2006) methods in section 6.1.7. An extension of the

Bayesian alignment method of Green and Mardia (2006) to matching coupled points

is presented in section 6.2. In Chapter 7 we undertake a Bayesian refinement of graph

solutions for matching protein functional sites.

Chapter 8 gives conclusions to work in this thesis and possible future work.

Research Goals and Aims

Two main aims of this research are to investigate:

(a) Optimising correspondence and alignment between two configurations using con-

comitant information.

(b) Statistics and their distributions for matching configurations.

Correspondence and alignment

We aim to investigate and develop methods for solving correspondence and align-

ment between configurations with concomitant information. This aim is multi-

objective as we want a maximal number of concordant corresponding points with

respect to the concomitant information; and also close geometrical alignment of

the configurations. We investigate effective ways of optimising correspondence and

alignment with regard to these objectives.

Optimisation of multi-objective problems

When faced with multi-objective problem, the traditional approach is to come up

with a composite objective function that incorporates all individual objectives. This

is called a preference-based multi-objective optimisation. The other approach is

multi-objective optimisation per se.

Preference-based multi-objective optimisation

This is usually a weighted sum of all the objectives. This procedure of handling

xvi

Page 20: Statistical approaches to protein matching in Bioinformatics

multi-objective optimisation is much simpler. The disadvantage of this approach

is the subjectivity in coming up with weights (Deb, 2001). This subjectivity is

particularly acute in our problem of protein matching. There is no clear indication

as to how to weight the two objectives in this problem. Protein 3-dimensional

structure comparisons are mainly by root mean square distance (RMSD). Sometimes

a score derived from amino acid type matches alone is used Gold, 2003. There is no

universally accepted score function for both amino acid type matches and RMSD.

Multi-objective optimisation

Recently, with an advent of evolutionary algorithms (EAs), there has been a rise in

interest for multi-objective optimisation per se. This approach does not lead to one

solution but a set of optimal solutions called Pareto-optimal solutions. Users choose

one of the obtained solutions using higher-level information. Refer to Deb (2001)

for a detailed discussion on multi-objective optimisation. Definitely this approach

would be very relevant to protein 3-dimensional structure matching.

Other objective functions

Energy function derivatives are also popular choices for optimisation in protein 3-

dimensional structure matching.

Statistics and distributions

The second aim of this research is to investigate and develop statistics of matching

3-dimensional configurations in general and protein structures in particular. We in-

vestigate these statistics and their distributions under “random” and “non-random”

configuration hypotheses.

xvii

Page 21: Statistical approaches to protein matching in Bioinformatics

Contributions

In this thesis, there are a few contributions towards matching geometrical configu-

rations in general and functional sites in particular.

(a) Exploration of the functional sites using spatial statistics tools. In Chapter 2,

we explore the spatial characteristics of functional sites using tools in spatial

statistics. Here we learn from point patterns analysis that functional sites tend

to be elongated tubular structures rather than isolated points in space. Au-

thor’s contributions include doing the computations and statistical analyses.

(b) In chapter 3, section 3.1.2 we propose an alternative to Aszodi and Taylor

(1994) method of simulating short virtual peptide chains. Aszodi and Tay-

lor (1994) method iterates between “distance space” and “Euclidean space”

(coordinates). We experiment with a similar but simplified method based on

coordinates only. Author’s contributions are

• Derivation of the formula for coordinates given the conformational angle.

• Strategy for modelling the hydrophobic effect.

(c) Goodness-of-fit statistics in section 4.1, Chapter 4 are used for comparing

quality of matches with different number of matched points. Using p-values

we arrive at similar conclusions as using the score proposed by Gold (2003).

Author’s contribution was to derive the approximate distribution for RMSD,

staring from the size-and-shape distribution (Dryden and Mardia, 1998) under

the isotropic Gaussian error model for matching configurations.

(d) In section 5.2 of Chapter 5 we formulate a mixture model with concomitant

information for matching functional sites. This is an extension of the method in

Kent et al. (2004). In section 5.3, we constrain matches in order to get better

solutions. Another contribution on matching is the framework for allowing

multiple transformations (section 5.4) in alignment. Author’s contributions

are

xviii

Page 22: Statistical approaches to protein matching in Bioinformatics

• Formulating a model and likelihood with concomitant information where

concomitant information is assumed to be independent of geometrical

information.

• Introducing techniques to constraint matches in order to get better match-

ing solutions.

• Formulating the model and likelihood for multiple transformation.

(e) In section 6.2 we extend Green and Mardia (2006) to matching configurations

of coupled points. Author’s contribution in this work is how to take depen-

dence between atoms from the same amino acid.

(f) We present a new method in Chapter 7 for matching protein functional sites

based on initial graph matching and followed by refinement using Markov chain

Monte Carlo (MCMC) procedure in Bayesian hierarchical modelling frame-

work.

Author’s contributions in this work include

• Formulation to account for side chains.

• The meta algorithm (add refinement step to graph theoretic).

• Extending software implementation to account for two atoms for MCMC.

• Modifying software for graph-theoretic to account for physico-chemistry

properties.

• Computations and statistical analysis of the results.

xix

Page 23: Statistical approaches to protein matching in Bioinformatics

Conference Papers

• Mardia, K.V., Green, P.J., Nyirongo, V.B., Gold, N.D. and Westhead, D.R.

(2006). Bayesian refinement of protein functional site matching. submitted.

• Mardia, K.V., Nyirongo, V. and Westhead, D.R. (2005). EM algorithm,

Bayesian and distance approaches to matching active sites. Mathematical and

Statistical Annual Meeting in Bioinformatics, Abstracts pp. 13-14. Rotham-

sted.

• Mardia, K.V. and Nyirongo, V. (2004). Procrustes statistics for unlabelled

points and applications In R.G. Aykroyd, S. Barber, and K.V. Mardia (Eds.),

Bioinformatics, Images, and Wavelets, p. 137. Department of Statistics,

University of Leeds.

• Mardia, K.V., Nyirongo, V. and Westhead, D.R. (2003). Protein Matching

Using Amino Acids Information. In R.G. Aykroyd, K.V. Mardia and M.J.

Langdon (Eds.), Stochastic Geometry, Biological Structure and Images, p. 147.

Department of Statistics, University of Leeds.

xx

Page 24: Statistical approaches to protein matching in Bioinformatics

Chapter 1

Introduction and Literature

Review

Matching and aligning of 3-dimensional protein structures are part of an active area

of research in Bioinformatics. This involves developing algorithms for matching, as

well as statistics and distributions of measures for quantifying quality of matching

and alignment. In this chapter we give a little background to the research problem

being addressed. In section 1.2 we also highlight current literature on the topic.

1.1 Introduction

The main matter of this research is to use statistical approaches to matching con-

figurations of points in 3-dimensional. The research problem is mathematically

formulated in section 1.1.1.

1.1.1 Mathematical Abstraction of the Problem

We have two point configurations, {µi} and {xj} in ℜd for i = 1, . . . , m and

j = 1, . . . , n. Without loss of generality we can assume n ≤ m (see section 5.1.1).

In addition to coordinates, the points have some attributes e.g. colour. The colours

of the points are the concomitant information. We require to match these config-

1

Page 25: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 2

urations in some defined optimal way. Matching in this case means finding cor-

responding points and rigid body motion required to bring the configurations into

registration or superimposition.

Optimal matching is the one with (in no any particular order of importance):

(a) maximised number of corresponding points q ≤ n ≤ m and

(b) minimised average distance between the corresponding points under rigid body

motion transformation of the coordinates.

(c) maximised number of corresponding points with similar or same colour (at-

tribute).

It is not certain how much importance to attach to each requirement for the

above vaguely defined optimality criteria. Informally, a match is regarded as best if

as many as possible corresponding points are as close as possible geometrically and

there is as many as possible similarly coloured corresponding points.

1.1.2 Motivation

This work is particularly motivated by an application in Structural Bioinformatics,

where pair-wise or multiple matching of 3-dimensional structures of proteins is of

interest. Sometimes matching just functional part of proteins (functional sites) is

of importance. We give a brief introduction to proteins and functional sites in the

next section.

1.1.3 Proteins

Proteins are essential for the functioning of the living organisms (Branden and Tooze,

1999; Lesk, 2000). Proteins perform a wide variety of functions in an organism. For

convenience, proteins can be divided into several major classes including but not

limited to structural proteins, transport proteins, messenger proteins and enzymes.

The most familiar of the structural proteins are probably keratins, which form

the protective covering of all land vertebrates: skin, fur, hair, wool, claws, nails,

Page 26: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 3

hooves, horns, scales, beaks and feathers. Equally widespread are actin and myosin

proteins of muscle tissue. Another group of structural proteins are the silks and

insect fibres. In addition, there are collagens of tendons and hides, which form

connective ligaments within the body and give extra support to the skin where

needed.

Transport proteins include serum albumin, haemoglobin and myoglobin. Serum

albumin transports water-insoluble lipids in the bloodstream. Haemoglobin carries

oxygen from the lungs to the tissue. Myoglobin performs a similar function in muscle

tissue, taking oxygen from the haemoglobin in the blood and storing it or carrying

it around until needed by the muscle cells.

Messenger proteins are one of the means by which cells in one part of the body

communicate with cells in another part of the body. Relatively, they are generally

quite small as proteins. Many are hormones. But not all hormones are proteins. Two

examples are oxytocin, which occurs in females and stimulates uterine contractions

during child birth, and vasopressin, whose major function is as an anti-diuretic.

Each function or use demands its own protein structure, and their interaction

depends on the 3-dimensional configuration which is the set of all 3-dimensional

coordinates of all atoms. However, there are four different levels of protein structure.

• primary:

the sequence of amino acids.

• secondary:

repeated patterns of local three-dimensional structure in the amino acids (α-

helix, β-sheet/β-strands).

• tertiary:

the full three-dimensional structure of a peptide chain, described as atomic

coordinates or conformational angles (φ and ψ).

• quaternary:

one or more peptide chains which together form the fully functional protein.

Page 27: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 4

The main challenge is how to infer the structure components as well as the

function from the primary level of the amino acids. These are the problems of

protein structure and function prediction. There are various approaches to study

proteins.

(a) Biophysical approach:

simulate the action of the physical laws that operate when the polypeptide

chain folds into the 3-dimensional structure. Look for all possible combinations

and among them for those with lowest energy.

(b) Sequence based approach:

use the information from the sequence of amino acids to match directly.

(c) Homology approach:

proteins with homologous structure have a similar 3-dimensional structure and

function but serious exceptions exist.

(d) Combination:

combine (a), (b), (c) + physico-chemistry properties/evolutionary relation-

ships.

Following Mardia et al. (2003), here we will simply define

Protein = {C1, C2, . . . , Ck}

as an unordered set of k peptide chains Ci, where Ci = {si1, . . . , siNi}, is an ordered

sequence of amino acid residues sij ∈ {P1, . . . , P20}, j = 1, . . . , Ni, and Pl = lth

amino acid type, l = 1, . . . , 20. Note that typically Ni = 200 − 2000.

An amino acid residue is a set of atoms (and covalent bonds). This atom set can

be partitioned into backbone atoms, B (same for every residue type) and side-chain

atoms Rl (differing between residue types). Figure 1.1 shows two amino acids (si

and si+1) joined by a “peptide bond”.

Pl = {B,Rl}, the peptide chain may be known only at the sequence level, where

the identities sij of the amino acid residues are known but there is no information

Page 28: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 5

OH

O

H

H

H

CN

C’

O

H

H

H

N C’

OH

O

H H

N

C’

CC

C’

OH

O

H

H

H

CN

+

2H O

Peptide bond

αα

α αψφ

Ri

Ri Ri+1

Ri+1

Si Si+1

Figure 1.1: Peptide bond joining two amino acids.

about three-dimensional structure. This is commonly the type of information that

emerges from genome sequencing projects. In a minority of cases, three dimensional

structure information may be available in the form of x, y and z Cartesian coordi-

nates for all the protein atoms. Information about the association of peptide chains

into complete proteins (quaternary structure) may be available in some cases.

The amino acids can be labelled by the side chain (Ri) which takes one of 20

types. For example with Ri = H we have glycine, with Ri+1 = CH3 we have alanine.

These are sometimes also referred to as peptide units. Each peptide unit can only

rotate around N −Cα and Cα−C ′ bonds; these angles φ and ψ are also of interest.

Amino acids have different physico-chemistry properties and can be grouped

according to shared properties e.g. hydrophobic or hydrophilic (see Table 5.4 for

one one possible grouping). Hydrophobic amino acids are those with side-chains

that do not like to reside in an aqueous (i.e. water) environment. For this reason,

these amino acids are generally buried within the hydrophobic core of a protein. On

the other hand non hydrophobic or hydrophilic amino acids tend to interact with

the aqueous environment and are predominantly found on the exterior surfaces of

proteins or in the reactive centres. This property is more important for transport

proteins. These proteins are often globular structures and are generally tightly

Page 29: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 6

packed (compact) with hydrophilic (polar) side chains on the outside to enhance

their solubility in water. They typically have hydrophobic (non-polar) side chains

folded to the inside to keep water from getting in and unfolding them. In section

3.1.2 we take into account hydrophobic/hydrophilic properties of the side chains in

order to simulate globular, compact structures. We also take into account physico-

chemistry properties in matching functional parts of proteins (see section 1.1.4) in

Chapters 5, 6 and 7.

The data bank Swiss-Prot contained sequence data of more than 212,425 proteins

as of 21st March, 2006. Protein 3-dimensional structures derived from X-ray diffrac-

tion and neutron-diffraction studies of crystallised proteins are housed at the Protein

Data Bank (PDB). There were about 35,813 (as of 28th March, 2006) structures

which can be accessed at web address http://www.rcsb.org.

1.1.4 Functional Sites

Although proteins are large molecules, in many cases only a small part (e.g. in

Figure 1.2) of the structure: a functional site - is functional, the rest existing only

to create and fix the spatial relationship among amino acids of the functional site.

The term functional site refers to both active sites and binding sites. An active

site is a protein part where chemical reactions occurs while a binding site refers to

a region which binds specific ligands (smaller molecules). For example, Figure 1.2

shows a functional site in 5-aminolaevulinate dehydratase protein structure.

1.1.5 The SITESDB Database

In this thesis all functional sites were taken from a database of known sites (SITESDB)

(Gold, 2003). SITESDB had 91,441 entries (functional sites) as of 28th March, 2006.

The median and mean for the number of amino acid was 10 and 16 respectively.

Lower and upper quartiles were 10 and 19. The range was from 1 to 120.

SITESDB entries were automatically formed from the PDB (Berman et al., 2000)

by locating the local protein environment (amino acids within 5A) around bound

Page 30: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 7

Figure 1.2: Functional site in 5-aminolaevulinate dehydratase protein structure.

ligands (identified by PDB HETATM records) and author annotated active sites

(identified by PDB SITE records). A protein may contain multiple functional sites

so unique identifiers for SITESDB entries were generated from the four letter PDB

identifier with an extra integer to distinguish sites from the same protein. For

example, the identifiers 1hdx 0 and 1hdx 1 were separate sites from the protein

with PDB identifier 1hdx.

The automatic extraction of sites results in multiple and incomplete representa-

tions of functional sites containing more than one bound ligand, or sites that are

both annotated with SITE records and contain bound ligands. In these cases a

better biochemical description of the site was obtained by merging component sites

without duplication of their amino acid contents. Sites were merged if ligand atoms

occurred within 5A of atoms in a second ligand (cf 5-5 rule in Park et al., 2001;

Dafas et al., 2004; Gong et al., 2005). In the absence of bound ligands, sites were

merged if they were found to contain common amino acid residues.

Page 31: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 8

Availability

SITESDB is accessible at http://www.bioinformatics.leeds.ac.uk (hosted by

the Institute of Molecular and Cellular Biology, University of Leeds). The database

currently contains more than 90,000 functional sites.

1.1.6 Structure comparisons

The 3-dimensional structure of protein is very important in understanding how pro-

teins function as other proteins with similar 3-dimensional structures are likely to

have related functions. Therefore comparing 3-dimensional protein structures is

very important. A newly determined protein structure with 3-dimensional structure

similar to a protein with a known function is likely to have a similar function. This

would facilitate predicting the function of a newly determined protein structure.

The other useful application is protein homology detection. Structure comparison

can complement sequence similarity which is commonly used for homology mod-

elling. Homology refers to proteins having descended from a common ancestor. The

importance of 3-dimensional comparisons cannot be overemphasised as these are

more conserved than the amino acid sequences in homologous proteins.

As much as overall protein structure comparisons are done and very useful in

some applications (see literature review in section 1.2), they have sometimes difficul-

ties in identifying situations where proteins share similar structures and are clearly

related in evolution, yet they have different functions. The reverse i.e. proteins with

functional similarity but having differences in their structure also present difficul-

ties e.g. overall fold comparison misses the functional similarity of subtilisin and

chymotrypsin (Blow et al., 1969; Wright et al., 1969). Thus to complement fold

comparisons we consider comparing functional sites of proteins.

1.1.7 Objectives in matching protein structures

To appreciate the difficulty involved in matching protein structures or part thereof,

consider configurations of functional site Cα atoms in Figure 1.3 from 17 − β

Page 32: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 9

hydroxysteroid-dehydrogenase and carbonyl reductase proteins. These functional

sites are related but which and how many atoms correspond are unknown. However

it is not always known apriori if the functional sites are related or not. Our aim is to

match atoms of these configurations. Functional sites matching has two objectives:

(a) To match the proteins geometrically so as to minimise the root mean square

error (RMSD),

r(x, y) =

[q∑

i=1

||xi − yi||2/q]1/2

where we have given q points for configuration {x} and the corresponding q

points for configuration {y}. The matched proteins should come as close as

possible (minimal RMSD) when configurations are superimposed on each other.

(b) The second objective is to maximise the matches of similar residues.

These objectives are often conflicting. Hence the question is how to optimise

this multi-objective matching problem.

a) 1a27 0 (63 atoms) b) 1cyd 0 (40 atoms)

Figure 1.3: RasMol (Sayle and Milner-White, 1995) ball representation of Cα for

functional sites of 17 − β hydroxysteroid-dehydrogenase (1a27 0) and carbonyl re-

ductase (1cyd 0).

Page 33: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 10

1.2 Literature Review

Whole domain structural comparison methods such as CE (Shindyalov and Bourne,

1998) and DALI (Holm and Sander, 1993) and databases such as FSSP (Holm et al.,

1992), CATH (Orengo et al., 1997) and SCOP (Hubbard et al., 1997) provide valu-

able insight into the functions of newly determined proteins. However, discovery of

proteins adopting similar folds but exhibiting a variety of functions i.e. superfolds

(Orengo et al., 1994) and proteins showing similar functions without common an-

cestry (Blow et al., 1969; Wright et al., 1969) poses problems for comparisons at the

fold level. Note that SCOP hierarchical classification consists of class, (super)fold

and (super)family.

Protein function is usually carried out by relatively small parts of protein surfaces

at ligand binding or catalytic sites and hence new structural comparison methods

focus on the precise structural nature of these functional sites (Artymiuk et al., 1994;

Binkowski et al., 2003; Kinoshita et al., 2002; Kleywegt, 1999; Shulman-Peleg et al.,

2004; Stark et al., 2003b; Wallace et al., 1997). These methods are based on the

idea that geometrically similar sites are likely to have similar functions since their

amino acids are conserved in precise orientations in order to perform their chemistry

or their similar shapes and physico-chemical properties may be selective for similar

small molecules such as substrates, inhibitors or cofactors. Hence, finding structural

similarity to functional sites of known and characterised proteins may facilitate

function prediction for newly determined protein structures even in the absence of

overall fold or sequence similarity.

Functional site comparison methods essentially fall into one of two categories.

The first category provides known templates of specific motifs of conserved amino

acids or atoms often involved in enzyme catalysis (Artymiuk et al., 1994; Kley-

wegt, 1999; Wallace et al., 1997). These are knowledge-based methods which aim

to discover new proteins with the same catalytic function. The second category

consists of similarity searching algorithms (Binkowski et al., 2003; Schmitt et al.,

2002; Shulman-Peleg et al., 2004; Stark et al., 2003b; Kinoshita et al., 1999) where

Page 34: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 11

prior knowledge of motifs is not required and site similarity is assessed by how

closely the sites align and/or the proportion of overlap. Partial similarity between

sites can be detected and hence much larger sites such as ligand binding sites can

be compared. Methods addressing this problem generally represent functional sites

or functional site surfaces as mathematical graphs for graph-theoretic or geomet-

ric hashing comparisons where graph vertex positions are placed using a variety of

methods. CavBase (Schmitt et al., 2002) , SiteEngine (Shulman-Peleg et al., 2004)

and PINTS (Stark et al., 2003b) for example use positions of pseudo-centres whereas

eF-site (Kinoshita et al., 1999) uses electrostatic potentials and surface curvature.

pvSoar (Binkowski et al., 2003) and SitesBase (Gold and Jackson, 2006) use alpha-

shapes and an all-atom model respectively. Recently, Green and Mardia (2006)

proposed a Bayesian hierarchical modelling approach using Cα atoms.

1.2.1 Matching and Superposition Algorithms

Finding the correspondence is intrinsically a combinatorial problem. Without geo-

metric constraints there are

min(n,m)∑

q=1

q!

(n

q

)(m

q

)ways of choosing corresponding pairs

from two configurations with m and n points. However with geometric constraints,

the solution space is tremendously reduced. Matching methods exploit geometric

constraints to solve for correspondence.

To show how geometric constraints make the correspondence problem feasible,

Kuhl et al. (1984) presented a naive brute force approach for matching a molecule

to a functional site.

A naive brute force method (Kuhl et al., 1984)

With the requirement that matching pairs are geometrically as close as possible, all

degrees of freedom are expended when three pairs of matches are made. Thus after

making three matches, simply check the coincidence of other points. Suppose two

configurations are {xj} and {µi}, j = 1 . . . n and i = 1 . . .m. Kuhl et al. (1984) in

their “DOCK” algorithm proceed as follows:

Page 35: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 12

(a) For each unique set of three pairings ({i1, j1},{i2, j2},{i3, j3}) of points from two

configurations:

i. Choose the first pair to superpose by translation b.

ii. Find rotation A1 to bring the second pair into optimal superposition.

iii. Find rotation A2 to superpose the third pair.

iv. Thus got xjl = A1A2µil + b, l = 1, . . . , 3. Matching pairs are coinciding

(closest and within a defined distance of each other) points of {x} and

{A1A2µ+ b}.

v. Calculate the number of matched pairs and the RMSD.

(b) The solution is the combination which gives the largest number of matches. In

the case of several solutions with the same number of matching pairs, the one

with smallest Procrustes distance may be taken.

Kuhl et al. (1984) algorithm goes through mn(m − 1)(n − 1)(m − 2)(n − 2)

combinations i.e. mn ways to choose b; (m − 1)(n − 1) ways to choose A1; and

(m− 2)(n− 2) ways to choose A2. Some of these combinations are unnecessary for

ordering is not important. There is need for just mn(m−1)(n−1)(m−2)(n−2)/3!

combinations.

There are a few more efficient approaches for solving the problem of matching and

superimposing in Bioinformatics applications in literature. These efficient matching

methods mainly fall in two categories:

(a) Algorithms iterating between solving for alignment and correspondence. Align-

ment and correspondence support each other, making the problem solvable in a

reasonable time space. These algorithms include the EM algorithm considered

by Kent et al. (2004) which is presented in section 5.1. Wu et al. (1998) also

use an iterative algorithm.

Also in this category is the approach by Green and Mardia (2006). Green

and Mardia (2006) take a Bayesian approach where-by they formulate a joint

Page 36: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 13

model for alignment and correspondence. Conditional models for alignment and

correspondence are updated in turn of each other. This framework is presented

in section 6.1.

(b) Combinatorial algorithms which utilise inter-point distance constraints. These

distance-based methods use graph theoretic algorithms to solve for correspon-

dence. Kuhl et al. (1984) proposed to use a graph algorithm of Bron and Ker-

bosch (1973) for matching a molecule binding to a functional site. Gold (2003)

implemented a parallelised database search tool on a Beowulf system, using ei-

ther Bron and Kerbosch (1973) or Carraghan and Pardalos (1990) graph clique

detecting algorithms to match functional sites.

Below we briefly describe the graph theoretic approach taken by Gold (2003). Also

for iterative algorithm category, we briefly describe the approach by Wu et al. (1998).

Graph method (Gold et al., 2003)

The principles of graph theory have been applied to matching biomolecular config-

urations for some time e.g. Kuhl et al., 1984 and Artymiuk et al., 1994. Consider

points as representing amino acid positions. These points could have attributes (con-

comitant information) representing amino acid groups or types. We require to match

two configurations of points {xj} and {yk} for j = 1, 2, . . . , m and k = 1, 2, . . . , n.

• Each configuration is represented by a mathematical graph.

• Vertices are placed at point positions.

• Each vertex is connected by an edge to every other vertex in the same graph.

• Each edge is labelled with the inter-point distance.

A search for the maximum similarity between two graphs G1 and G2 repre-

senting configurations {x} and {y} respectively; corresponds to finding the maxi-

mal common subgraph or a clique within the vertex product graph for G1 and G2

(Hv = G1 ◦v G2). The vertex product graph is defined as follows:

Page 37: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 14

Definition 1.2.1. If V1 and V2 are the sets of vertices for G1 and G2 respectively.

The vertex product graph Hv = G1 ◦v G2 includes the vertex set VH = V1 × V2, in

which the vertex pairs (xj , yk) with xj ∈ V1 and yk ∈ V2 have the same attribute.

An edge between two vertices vh = (xj , yk), vh′ = (xj′ , yk′) ∈ VH exists for j 6= j′

and k 6= k′ if the absolute difference between the distances |xj − xj′| and |yk − yk′|is less than some threshold, say δ = 1.5A.

Graph matches based on inter-point distances are not necessarily superimposable

(e.g. mirror image sites). Subsequently, a Procrustes algorithm (Kabsch, 1978) is

used to check that matched configurations are geometrically superimposable. The

Procrustes algorithm minimises the size-and-shape squared (least squares) distance

between two structures, say X1 and X2. The size-and-shape squared distance is:

d2S(X,µ) = inf

A∈SO(d)‖X2 − AX1 − b‖2.

Here d = 3 and SO(d) denotes a set of all d × d rotation matrices (orthogonal

matrices with the determinant equal to +1), b is the translation vector.

Basically the algorithm is a three step process:

(a) Construct a vertex product graph.

(b) Find a maximal clique within the product graph.

(c) Check the 3-dimensional superimposition using Kabsch (1978).

In the least restrictive case all vertices (points) are assumed to have the same

attribute and hence matching can occur between any two points and is only depen-

dent on inter-point distances. Alternatively points can be labelled with colours (con-

comitant information) to restrict matching points with the same colour i.e. colour is

treated as an attribute. Although concomitant information can be incorporated as

attributes, this approach is very rigid. When matching functional sites, Gold, 2003

take into account the amino acid type by introducing a score (presented in section

3.2.2). Bron and Kerbosch (1973) finds all common subgraphs in addition to the

clique. Concomitant information can be used to score all the common subgraphs in

Page 38: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 15

order to give preference to the solution with the highest score. Gold (2003) score all

complete subgraphs found by the Bron and Kerbosch (1973) and take the one with

a maximal score. However the algorithm of Carraghan and Pardalos (1990) finds

just the clique so it is not possible to use concomitant information with Carraghan

and Pardalos (1990). Gold (2003) uses the algorithm of Carraghan and Pardalos

(1990) because it is faster or optionally the algorithm of Bron and Kerbosch (1973)

can be used in order to account for concomitant information.

Iterative algorithm (Wu et al., 1998)

This method is for analysing multiple protein structures. The method allows to

perform superposition and averaging. The algorithm iterates between solving for

correspondence and superposition. Correspondence is solved by dynamic program-

ming and superposition by least squares regression.

Dynamic programming

Dynamic programming is used to align two sequences; specifically it finds corre-

spondence between two structures that minimises the overall distance between the

structures. Let i and j be the sequence indices of atoms in structures {µi} and

{xj} respectively for i = 1, 2, . . . , m and i = 1, 2, . . . , n. Let d(j, i) be some distance

metric between atoms {xj} and {µi}. Then we can find two collinear sequences of

atoms 1 ≤ j(1) < j(2) < · · · < j(q) ≤ n that minimise the function

∑qr=1 d(j(r), i(r)) + g(0, j(1)) +

∑q−1r=1 h(j(r), j(r+1)) + g(j(q), n + 1)+

g(0, i(1)) +∑q−1

r=1 h(i(r), i(r+1)) + g(i(q), m+ 1).(1.1)

where g(r, s) is the gap penalty for skipping from r to s at the end of either sequence,

and h(r, s) is the gap penalty for skipping from r to s in the middle of either sequence.

The algorithm makes two attempts to find correspondence within each iteration.

Firstly it uses curvature at each Cα as a distance metric for matching. Secondly it

matches using coordinates of Cα as distance metric.

Dynamic programming is used to find a correspondence between two structures

that minimises the overall distance between the structures. The premise behind

Page 39: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 16

the algorithm is that an optimal correspondence can be constructed by adding two

aligned elements to a previously obtained optimal alignment. This insight means

that it is not necessary to search all possible alignments in order to obtain the optimal

one (given two n-length sequences, this amounts to time proportional to n4n; rather,

dynamic programming sequentially adds elements to an optimal alignment that are

already constructed). This basically reduces time cost to just O(n2)

Mechanics of the iterative algorithm (Wu et al., 1998)

In general the algorithm allows superposition of multiple proteins. Let Xj be a

coordinate matrix of corresponding atoms in the jth protein structure. Each column

in Xj represents an atom in the protein structure. In the least-squares formulation,

they find an affine model X and transformation matrices Aj (forj = 1, . . . , J) that

minimise the objective function:

J∑

j=1

‖AjXj − X‖2. (1.2)

The algorithm consists of three steps:

(a) Compute a curvature function κ for each protein structure Sj. Find corre-

sponding landmarks X(1)j by matching curvatures to a reference structure and

obtain the affine model X(1) and transformation matrices A(1)j for j = 1, . . . , J .

(b) Find corresponding landmarks X(2)j by matching coordinates to a reference

structure, and obtain the affine model X(2) and transformation matrices A(2)j .

(c) Find corresponding landmarks Xj by matching coordinates iteratively to the

evolving affine model, and obtain the affine model X and transformation ma-

trices Aj .

The iterative algorithm of Wu et al. (1998) assumes and uses sequence order

information in addition to spatial information in terms of point coordinates while

the graph method of Gold (2003) uses spatial information in terms of inter point

distances. A common problem with these approaches is that they do not take into

Page 40: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 17

account concomitant information of amino acids in their matching and alignment in

a flexible way as to model the amino acid substitution phenomenon taking place in

proteins.

1.2.2 Extreme Values in Bioinformatics

Minimum RMSD in protein matching

Extreme values of RMSD in Bioinformatics protein matching applications are at

two levels. The first level is for each pair-wise matching of two configurations, say

{µ} and {x} with m and n points respectively. In this case optimal RMSD in some

sense is sought. Optimal RMSD could be defined to satisfy:

(a) the minimum RMSD among q! (mq ) (nq) values where q ∈ [2, . . . ,min(n,m)] is the

number of matched points;

(b) and require that after alignment, distances between matching points be within

a specified tolerance limit;

(c) and q is maximal.

In Chapter 5 and 6 we would look for corresponding points and alignment that give

the minimum RMSD.

The second level is when searching the database for a match. Here the interest

are matches with smaller RMSD. For example, best fitting matches are analysed

in Chapter 4 where we develop a method to rank best matches. In Chapter 7 we

follow Stark et al. (2003b) using Extreme Value Distribution (EVD) to quantify the

probability of matching by chance. In this set-up the null hypothesis is that match-

ing configurations are random and the matching is due to mere chance (random

matches). That is matching configurations are not related in any way whatsoever.

Extreme value distribution as null distribution

In Bioinformatics applications e.g. sequence matching or structure matching, the

sample space under the random matching hypothesis is practically infinite, diffi-

Page 41: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 18

cult to specify and calculate. Consider the question of two random 3-dimensional

structures. These can be of any sizes, say, m,n = 1, 2, . . . and give matches of size

q = 2, 3, . . . . Each structure with m > 2 or n > 2 points can take an infinite num-

ber of configurations. Following Stark et al. (2003b), a practical way to specify the

random distribution is by collecting a large enough database of non-redundant and

non-homologous configurations. The database distribution is then used as the null

distribution under the random hypothesis. The database has to be non-homologous

and non-redundant to correctly control for Type I error rate. Ideally the background

database size should be as large as possible.

Because the interest is in the extremes from an infinitely large database, limiting

Extreme Value Distributions (EVD) are used to model the background database

distribution. For example, let the distribution for RMSD, r be F (r) and denote

its limiting distribution by G(r). Due to weak reliance of limiting EVD on data-

generating distribution function, F (r), the null distribution can be easily modelled

reliably even in these cases where F (r) is difficult to calculate or let alone specify.

What is required is just to know how F (r) depends on m,n and q.

Limiting extreme value distributions

Extremal types distributions are limiting distributions used to model extreme devi-

ations from the mean of probability distributions for stochastic processes.

Two approaches exist today:

(a) most common at this moment is the tail fitting approach based on the second

theorem in extreme value theory (Theorem II Pickands, 1975; Balkema and de

Haan, 1974).

(b) Basic theory approach as described by Burry (1975).

In general this conforms to the first theorem in extreme value theory (Theorem

I Fisher and Tippett, 1928; Gnedenko, 1943). The difference between the two

theorems is due to the nature of the data generation. For theorem I the data are

generated in full range, while in theorem II data is only generated when it surpasses

a certain threshold (POT’s models or Peak Over Threshold).

Page 42: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 19

There are three classes of limiting distributions for extreme values:

Gumbel

G(r) = exp {− exp(−r)} for −∞ < r <∞ ;

Frechet

G(r) =

0 r ≤ 0;

exp(−r−α) r > 0, α > 0.

Negative Weibull

G(r) =

exp [−(−r)−α] r < 0, α > 0;

1 r ≥ 0.

These classes are unified by re-parametrisation to give the Generalised Extreme

Value distribution, GEV(µ, σ, ξ) with distribution function

G(r) = exp

{−[1 + ξ

(r − µ

σ

)]−1/ξ

+

}(1.3)

where x+ = max(x, 0) and σ > 0, so up to type the GEV distribution is

G(r) = exp[−(1 + ξr)

−1/ξ+

]. (1.4)

• Gumbel corresponds to ξ = 0 (taken as limit ξ → 0) i.e. GEV(0,1,0) =

Gumbel;

• Frechet corresponds to ξ > 0 i.e. GEV(α−1, α−1, α−1) = Frechet(α) ;

• Negative Weibull corresponds to ξ < 0 i.e. GEV(−α−1, α−1,−α−1) = Negative

Weibull(α).

Type identification

For particular, well-known F (r), the type of limiting distribution can be derived.

For example normal and log normal give rise to Gumbel while student’s t and uni-

form give Frechet and Negative Weibull respectively. In general, exponentially tailed

distributions give Gumbel type; algebraically tailed with a finite end-point distri-

butions give Frechet or Negative Weibull types. Frechet distribution is for positive

Page 43: Statistical approaches to protein matching in Bioinformatics

Chapter 1. Introduction and Literature Review 20

random variables while Negative Weibull is for negative random variables. This

classification facilitates an easy identification of the right type EVD to model the

scores or measures. For example, Frechet distribution is clearly the right type for

RMSD (Stark et al., 2003b). RMSD values are positive and have a heavy tailed

distribution attenuated at zero.

Adjusting for database size

Because of max-stability property of the GEV distribution, the modelled random

distribution can be used for searches in a database with a different size by correcting

normalising constants. In general, if for an extreme, Mn(ri, i = 1, . . . , n):

Mn − bnan

D−→ EVD(µ, σ, ξ) for a random database of size n (1.5)

then using domains of attraction principle, normalising constants for searching a

database of size n′ are 1 − F(b′n) = 1/n′ and a′n = h(b′n) = 1−F(b′n)f(r)

. However in

Chapter 7 we use an ad hoc method (Stark et al., 2003b; Torrance et al., 2005) to

adjust for sample space since F (r) is unknown.

Page 44: Statistical approaches to protein matching in Bioinformatics

Chapter 2

Exploratory Analysis of Protein

Geometry

In this Chapter we are interested in learning some properties of proteins in general

and functional sites in particular. We consider spatial information in terms of only

point coordinates for functional sites and protein structure atoms. We explore spa-

tial arrangement of atoms in both functional sites and whole proteins using spatial

statistics tools.

2.1 Background

We are interested in point patterns or spatial positions of points. We would like

to characterise say, whether the points in the configurations are clustered, regular,

random or if there are variations in intensity in different regions.

2.1.1 Inter-event distances

Consider a configuration of points, {xj}, j = 1, . . . , n. Inter-event or inter-point

distances are ||xi − xj || for i, j = 1, . . . , n and i 6= j. The inter-event distribution is

H(t) = P(||Xi −Xj|| ≤ t)

21

Page 45: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 22

Conditional on the number of events N(A) = n of a spatial point process N in

the region of observation A, where N(A) = N∩A, the empirical inter-event distance

distribution function (EDF) is written

H(t) =1

n(n− 1)

i6=jI {||xi − xj || ≤ t} ,

where xi are the events in the observed spatial point pattern and I{.} is an

indicator function. If the theoretical inter-event distance distribution function, say

H0(t) for a theoretical spatial point process is known, deviations of H(t) from H0(t)

can be used to test the hypothesis that an observed point pattern is a realisation

from the theoretical spatial point process.

In section 2.2 we visually (informally) compare H(t) for functional site Cα to

H0(t) for complete spatial randomness (CSR) model i.e. uniform distribution N

points in the region A where N(A) ∼ Poison(λ).

2.1.2 Point to nearest event distances

Another statistical tool for characterising spatial point processes is “point to nearest

event distance”. While for inter-event distance, we consider all the events in the

region, this type of analysis uses distances ti from each of m sample points in A to

the nearest of the n events. Thus point to nearest event distance summarises local

characteristics of the spatial point process.

The empirical distribution (EDF), F (t) = m−1#(ti ≤ t) measures the “empty

spaces” in A, in the sense that 1 − F (d) is an estimate of the volume (area) |Bt|of the region Bt consisting of all points in A a distance at least t from every one

of n events in A. Again F (t) can be compared to the theoretical distribution of a

particular spatial point process of interest.

2.1.3 The K-function

The K-function was introduced by Bartlett (1964) and its potential and importance

for analysing point patterns was realised and developed by Ripley (1976, 1977).

Page 46: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 23

For a stationary isotropic process, the K-function can be defined as

K(t) = λ−1E(number of points within distance t of a randomly chosen point), with t > 0,

where λ is the mean number of points per unit region.

A K-function provides a summary of spatial dependence over a wide range of

scales of a pattern, including all event-event distances, not just the nearest neigh-

bour distances. Since theoretical forms of the function are known for various possible

spatial point process models, the K-function can be used to explore spatial depen-

dence, in addition to suggesting specific models to represent the observed spatial

point process and to estimate the parameters of such models.

The estimator for K(t) is

K(t) = n−2|A|∑

i,j:i6=jω(xi, uij)It(uij).

where uij denotes the distance between the ith and the jth events in A, ω(xi, uij) is

the proportion of the surface of the sphere with centre xi and radius uij which lies

within A. It(uij) is an indicator function taking the value 1 if uij ≤ t, 0 otherwise.

We consider the K-function and inter-event distance for Cα atoms.

2.2 Functional Sites

We consider spatial distribution of Cα atoms of functional sites. We evaluate the

first and second order statistics for spatial processes:

(a) Three dimensional plot of an estimated density field.

(b) Point to nearest event distance frequency plot.

(c) The K-function. Plotted are normalised K-functions: K(t) − πt2.

(d) Inter-event distance cumulative function, H(t).

(e) Inter-event distance frequency plot.

Page 47: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 24

Figures 2.1 - 2.8 depict these statistics for a random sample of 9 functional

sites. From estimated density fields, we observe that there are elongated tubes

hence dependence in the spatial position of Cα atoms. This is confirmed when we

compare the data with homogeneous Poisson process. The points tend to fall in

elongated strings thus probably the residues of functional sites tend to come from

conserved motifs in a sequence.

The departure from homogeneous Poisson process is again apparent when we

compare with first and second order statistics for a homogeneous Poisson pro-

cess. The empirical estimates (red graph) in the Inter-event distance frequency

plots clearly show an inhibition distance of about 4.0A (actually adjacent Cα atoms

in a protein chain are about 3.8A). There is also a peak in inter-event distance

histograms between 5A and 6A. This peak reflects closest Cα atoms that are not

adjacent in sequence order. Thus these are Cα atoms which have come close due

to two parts of the chain folding close to each other. This inhibition distance for

atoms not forming a chemical bond is due to what is called van der Waals radius1.

Aszodi and Taylor (1994) found an average value of 5.5A for this distance between

Cα atoms in proteins. For simulations in Chapter 3, section 3.1.1 we conservatively

use 5A as inhibition distance to model van der Waal radius between Cα atoms.

Inter-event distance cumulative function estimates, H(t) are outside the 95% C.I.

envelope except for functional sites from carbonyl reductase (1cyd 1), 5-aminolaevulinate

dehydratase (1b4e 0) and aspartate aminotransferase (1ajr 0). The same observa-

tion is made for K-functions of these functional sites.

2.3 Protein Structures

We also consider spatial distribution of all Cα atoms in the structure for some

functional sites considered in section 2.2 above. Plotted in Figures 2.9 - 2.12 are

1The van der Waals radius of an atom is the radius of an imaginary hard sphere which can be

used to model the atom for many purposes. Van der Waals radii are determined from measurements

of atomic spacing between pairs of nonbonding atoms in molecules.

Page 48: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 25

Figure 2.1: The K-function and inter-point distance distribution for Cα atoms in

17 − β hydroxysteroid dehydrogenase functional site (1a27 0).

Page 49: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 26

Figure 2.2: The K-function and inter-point distance distribution for Cα atoms in

5-aminolaevulinate dehydratase (from E. coli) functional site (1b4e 0).

Page 50: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 27

Figure 2.3: The K-function and inter-point distance distribution for Cα atoms in

subtilisin functional site (1bfk 0).

Page 51: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 28

Figure 2.4: The K-function and inter-point distance distribution for Cα atoms in

carbonyl reductase functional site (1cyd 1).

Page 52: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 29

Figure 2.5: The K-function and inter-point distance distribution for Cα atoms in

1,3,8-trihydroxynaphtalene reductase functional site (1g0n 0).

Page 53: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 30

Figure 2.6: The K-function and inter-point distance distribution for Cα atoms in

mannitol dehydrogenase functional site (1h5q 0).

Page 54: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 31

Figure 2.7: The K-function and inter-point distance distribution for Cα atoms in

5-aminolaevulinate dehydratase (from Baker’s yeast) functional site (1h7o 0).

Page 55: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 32

Figure 2.8: The K-function and inter-point distance distribution for Cα atoms in

glutaminase-asparaginase functional site (3pga 0).

Page 56: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 33

first and second order statistics for spatial arrangement of Cα in protein structures

(parts a, c, d, e and f). These are three dimensional plot of an estimated density field,

point to nearest event distance frequency plot, the K-function, inter-event distance

cumulative function and inter-event distance frequency plot. Parts b are ribbon

representation of secondary structures in RasMol (Sayle and Milner-White, 1995).

Figure 2.13 is a plot for first and second order statistics for spatial arrangement of

all atoms in 17 − β hydroxysteroid dehydrogenase (1a27).

Figure 2.9: The K-function and inter-point distance distribution for Cα atoms in

17 − β hydroxysteroid dehydrogenase structure (1a27).

Page 57: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 34

Figure 2.10: The K-function and inter-point distance distribution for Cα atoms in

5-aminolaevulinate dehydratase structure (1aw5).

Page 58: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 35

Figure 2.11: The K-function and inter-point distance distribution for Cα atoms in

5-aminolaevulinate dehydratase structure from E. coli (1b4e).

Page 59: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 36

Figure 2.12: The K-function and inter-point distance distribution for Cα atoms in

carbonyl reductase structure (1cyd).

Page 60: Statistical approaches to protein matching in Bioinformatics

Chapter 2. Exploratory Analysis of Protein Geometry 37

Figure 2.13: The K-function and inter-point distance distribution for all atoms in

17 − β hydroxysteroid dehydrogenase (1a27).

Page 61: Statistical approaches to protein matching in Bioinformatics

Chapter 3

Simulation Design and Evaluation

of Algorithms

We will consider simulations to evaluate the performance of our approach and some

other known algorithms. Simulations are used to evaluate the correct correspondence

rate for matching methods in Chapters 5 and 6. We cover the simulation scheme

in section 3.1 while 3.2 covers topics on evaluation. Highlighted in section 3.1.2

are simulations of Aszodi and Taylor (1994), producing compact random structures

with a hydrophobic core.

3.1 Simulations

In this section we are concerned on how we simulate functional sites and proteins

to evaluate performance of different algorithms.

3.1.1 Functional Sites Simulations

Functional site pairs with varying sizes were simulated. Each pair consisted of {µ}and {x}. Size of {x}, n varied from 4 to 64 by steps of 4. (i.e. n = 32, 36, . . . , 64).

Size of {µ} was taken to be m = ⌈1.1n⌉, with additional 10% of the points in {µ}having no corresponding points in {x}. The choice of 10% should provide enough

38

Page 62: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 39

noise in the system to evaluate our matching algorithms. Luo and Hancock (2001)

added up to about 10% of non corresponding points to evaluate different matching

algorithms.

Point set configurations.

For size, n = 4, 36, . . . , 64

(a) Hardcore simulate a configuration of m = ⌈1.1n⌉ points constituting {µ} in a

353 cube. We uniformly sample points inside the cube and reject if it is within

5 units from any other point. The inhibition distance of 5A is to model van

der Waals radius in molecules. Aszodi and Taylor (1994) observed an average

value of about 5.5A for van der Waals radius between Cα atoms in protein

molecules. We also observed an inhibition distance between 5A to 6A in inter-

event distance histograms for functional sites in section 2.2. The distance of

5A was chosen as this is a conservative threshold for interaction between two

atoms, where the atoms are either Cα atoms or atoms in side chains (Park

et al., 2001).

(b) We then randomly generate colours for these points according to frequencies

of amino acids in Table 3.1.

(c) Choose randomly(without replacement) n points from {µ}.

(d) From each of the chosen n points of {µ} simulate a point, x ∼ MN(~µ, 0.5I3).

There is no preferred direction for the x, y and z coordinates hence we assume

isotropic Gaussian. It is also biologically plausible to assume independence

between the coordinates.

Colour of x is

i. First approach: Just take the colour of µ (no mutation of colour).

ii. Second approach: Simulate mutation to get colour of x (see simulation of

mutational process below).

Page 63: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 40

A set of m points in {µ} and n points in {x} constitute a pair of functional

sites. These pairs are used for evaluating the performance of matching methods as

outlined in section 3.2.1. We also use these configurations for studying the minimum

RMSD distribution in section 4.1.1 of Chapter 4.

Table 3.1: Frequencies of amino acids in the Accepted Point Mutation (PAM) Data.

N Asn 0.040 H His 0.034

S Ser 0.070 R Arg 0.040

D Asp 0.047 K Lys 0.081

E Glu 0.050 P Pro 0.051

A Ala 0.087 G Gly 0.089

T Thr 0.058 Y Tyr 0.030

I Ile 0.037 F Phe 0.040

M Met 0.015 L Leu 0.085

Q Gln 0.038 C Cys 0.033

V Val 0.065 W Trp 0.010

Evolution of amino acid classes.

We consider a model for evolutionary change in proteins of Dayhoff et al. (1978).

Dayhoff et al. (1978) model for amino acid interchanges is applicable for functional

site amino acids as well since it assumes that amino acid mutation is sequence inde-

pendent. However the actual frequencies of substitutions are different in functional

sites. We assume that amino acid mutation is also independent of spatial positions.

Accepted point mutations

An accepted point mutation is a replacement of one amino acid by another which

is again accepted by natural selection. To be viable, the new amino acid usually

must function in a way similar to the old one. Chemical and physical similarities are

Page 64: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 41

found between the amino acids that are observed to interchange frequently. In the

evolutionary change model, the likelihood of amino acid c replacing k is the same as

that of k replacing c. As a result, no change in amino acid frequencies over evolution

distance will be detected.

The probability that each amino acid will change in a given small evolutionary

interval is called the “relative mutability” of the amino acid. Thus relative mutability

of each amino acid is proportional to the ratio of changes to occurrences. Table 3.2

gives these relative mutabilities computed by Dayhoff et al. (1978).

Table 3.2: Relative mutabilities of amino acidsa.

N Asn 134 H His 66

S Ser 120 R Arg 65

D Asp 106 K Lys 56

E Glu 102 P Pro 56

A Ala 100 G Gly 49

T Thr 97 Y Tyr 41

I Ile 96 F Phe 41

M Met 94 L Leu 40

Q Gln 93 C Cys 20

V Val 74 W Trp 18aThe value for Ala has been arbitrarily set at 100.

Substitution matrix

Information about individual kinds of mutations and about the relative mutability

of amino acids is combined into one time-dependent “mutation probability matrix”.

An element of this matrix, mij , gives the probability that the amino acid in row i

will be replaced by the amino acid in column j after a given evolutionary interval.

Evolutionary distance between proteins is measured in PAM (Percent Accepted

Mutation). 1 PAM corresponds to an evolutionary distance of one amino acid change

in every 100 amino acids. Dayhoff et al. (1978) in addition to calculating mutabilities

Page 65: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 42

for amino acids, also compiled data on Accepted Point Mutation. PAM substitution

matrices are computed from these pieces of information. Shown in a Table 3.3 is a

1PAM matrix.

Simulation of mutational process

The mutation probability matrix provides the information with which to simulate

any degree of evolutionary change in an unlimited number of proteins. Further, we

can start with one protein and simulate its separate evolution in duplicated genes

or in divergent organisms. By considering large numbers of such related sequences,

a measure is readily obtained of the expected deviations due to random fluctuations

in the evolutionary process.

Let us simulate the effect of 1 PAM of evolutionary change on a particular

amino acid set. To determine the fate of the first amino acid, say alanine, we

obtain a uniformly distributed random number between 0 and 1. The first row of

the mutation probability matrix (Table 3.3) gives the relative probability of each

possible event that may befall alanine (neglecting deletion for simplicity). If the

random number falls between 0 and 0.9867, Ala is unchanged. If the number is

between 0.9867 and 0.9868, it is replaced with Arg, if it is between 0.9868 and 0.9872,

it is replaced with Asp, and so forth. Similarly, a random number is produced for

each amino acid in the set, and action is taken as dictated by the corresponding row

of the matrix. The result is a simulated mutant set. Any number of these can be

generated; their average distance from the original is 1 PAM.

The effects on the set of a longer period of evolution may be simulated by suc-

cessive applications of the matrix to the set resulting from the last application.

Alternatively, the matrix may be multiplied by itself repeatedly and applied once to

the sequence. The two procedures produce mutant sequences of the same average

PAM distance from the initial set. Simulations in this thesis e.g. in section 3.1.1

use PAM120 i.e. the matrix in multiplied by itself 120 times.

For simulations in which a predetermined number of changes are required, a

two-step process involving two random numbers for each mutation can be used.

Page 66: Statistical approaches to protein matching in Bioinformatics

Chapter

3.

Sim

ula

tion

Desig

nand

Evalu

atio

nofA

lgorith

ms

43

Tab

le3.3:

Substitu

tion(m

utation

prob

ability

)m

atrixfor

the

evolution

arydistan

ce

of1

PA

M.A

nelem

ent

ofth

ism

atrix,mij

,gives

the

prob

ability

that

the

amin

oacid

inrow

iw

illbe

replaced

by

the

amin

oacid

incolu

mnj

aftera

givenevolu

tionary

interval,

inth

iscase

1accep

tedpoin

tm

utation

per

100am

ino

acids.

Thus,

there

isa

0.56%prob

ability

that

Asp

(D)

will

be

replaced

by

Glu

(E).

To

simplify

the

appearan

ce,th

eelem

ents

aresh

own

multip

liedby

10,000.Taken

fromD

ayhoff

etal.

(1978).

A R N D C Q E G H I L K M F P S T W Y V

A 9867 1 4 6 1 3 10 21 1 2 3 2 1 1 13 28 22 0 1 13

R 2 9913 1 0 1 9 0 1 8 2 1 37 1 1 5 11 2 2 0 2

N 9 1 9822 42 0 4 7 12 18 3 3 25 0 1 2 34 13 0 3 1

D 10 0 36 9859 0 5 56 11 3 1 0 6 0 0 1 7 4 0 0 1

C 3 1 0 0 9973 0 0 1 1 2 0 0 0 0 1 11 1 0 3 3

Q 8 10 4 6 0 9876 35 3 20 1 6 12 2 0 8 4 3 0 0 2

E 17 0 6 53 0 27 9865 7 1 2 1 7 0 0 3 6 2 0 1 2

G 21 0 6 6 0 1 4 9935 0 0 1 2 0 1 2 16 2 0 0 3

H 2 10 21 4 1 23 2 1 9912 0 4 2 0 2 5 2 1 0 4 3

I 6 3 3 1 1 1 3 0 0 9872 22 4 5 8 1 2 11 0 1 57

L 4 1 1 0 0 3 1 1 1 9 9947 1 8 6 2 1 2 0 1 11

K 2 19 13 3 0 6 4 2 1 2 2 9926 4 0 2 7 8 0 0 1

M 6 4 0 0 0 4 1 1 0 12 45 20 9874 4 1 4 6 0 0 17

F 2 1 1 0 0 0 0 1 2 7 13 0 1 9946 1 3 1 1 21 1

P 22 4 2 1 1 6 3 3 3 0 3 3 0 0 9926 17 5 0 0 3

S 35 6 20 5 5 2 4 21 1 1 1 8 1 2 12 9840 32 1 1 2

T 32 1 9 3 1 2 2 3 1 7 3 11 2 1 4 38 9871 0 1 10

W 0 8 1 0 0 0 0 0 1 0 4 0 0 3 0 5 0 9976 2 0

Y 2 0 4 0 3 0 1 0 4 1 2 1 0 28 0 2 2 1 9945 2

V 18 1 1 1 2 1 2 5 1 33 15 1 4 0 2 2 9 0 1 9901

Page 67: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 44

Starting with a given sequence, the first amino acid that will mutate is selected:

the probability that any one will be selected is proportional to its mutability (Table

3.2). Then the amino acid that replaces it is chosen. The probability for each

replacement is proportional to elements in the appropriate row of the substitution

matrix. Starting with the resultant set, a second mutation can be simulated, and

so on, until a predetermined number of changes have been made. In this process,

superimposed and back mutations may occur.

Although these substitution matrices might not apply very well to functional

sites, the data we get serve us right to evaluate the methodology. For real pro-

tein matching application, one can consider other substitution matrices like the one

developed for spatially conserved locations (Naor et al., 1996).

3.1.2 Whole Structure Simulations

To generate random structures, Aszodi and Taylor generate a chain of points repre-

senting Cα atoms in the main chain which is folded into a 3-dimensional structure

by distance geometry methods.

Chain properties

Simulate a chain of Cα atoms with the following properties:

Chain geometry:

Figure 3.1 shows the geometry of a virtual chain.

• Virtual bond length: Inhibition distance of 3.8A between successive points in

the chain since adjacent Cα atoms are separated roughly by such a distance

in proteins (Aszodi and Taylor, 1994; Jeong et al., 2006).

• Non-coincident of atom centres due to atom volume and van der Waals forces:

Inhibition distance of dbump = 2rvdW between two non-successive atoms. dbump

was set to 5.5A in the simulations. This distance was chosen so as to corre-

Page 68: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 45

spond to van der Waals radius for Cα atoms observed in protein molecules and

to give the correct average residue density.

• Virtual bond angles, β: The virtual angle at each carbon α formed by virtual

bonds from right and left Cα atom neighbours is β = 2 arcsin(d22l

) where d2

is the distance between right and left neighbours. In proteins β ∈ [π/2, π].

Aszodi and Taylor fix β = 2 arcsin( d22l

) where d2 is an average observed distance

between each Cα atom and its second neighbour. They argue for averaging d2

to avoid geometric bias towards secondary structure formation. Simulations

used d2 = 6.0A (observed d2 = 6.0 ± 0.4A from 84 protein structures).

• Virtual bond torsion angle, θ. This angle was allowed to randomly take any

value in the interval [−π, π].

These bonds and angles are “virtual” because they do not exist in proteins i.e. Cα

atoms are not directly connected in protein backbones.

Figure 3.1: Virtual distances and angles in a protein backbone.

Chain biochemistry:

Each Cα atom was randomly assigned a binary hydrophobicity property i.e. hy-

drophobic or not.

Folding

Instead of minimising an energy function (see Weiner et al., 1984; Pereira De Araujo,

1999; Chhajer and Crippen, 2002; Jaramillo et al., 2002), the idea here is to fold the

Page 69: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 46

chain in such a way as to achieve a target distance matrix for inter-residue (between

amino acid) distances. The target distance matrix specifies the preferred distances

between Cα atoms and accounts for hydrophobicity. Hydrophobic amino acids tend

to cluster together. This phenomenon is called hydrophobicity effect and is one

of the major driving forces for protein folding. The hydrophobicity effect is the

tendency to shield the hydrophobic amino acids away from the surface. A preferred

distance matrix is the one with shorter inter-residue distances among hydrophobic

amino acids.

The target density matrix used is given in Table 3.4:

Table 3.4: Target (desired) distances between Cα atoms in simulated short chains

of proteins.

Pair Type ddes(A) Strictness/Preference

Hydrophobic/hydrophobic 6.0 0.5. . . 1.0

Hydrophobic/hydrophilic 8.0 0.1

Hydrophilic/hydrophilic 10.0 0.1

Distance constraint

Chain geometry constraints that distance between Cα atoms should in general be

above dbump = 2rvdW =5.5A. The other constraint is that distance between the

ith and jth Cα atoms is maximal if the chain connecting them is in its extended

conformation i.e. all in-between torsion angles, θ equal π. The maximal distance,

say dmax,s depends on s = |i − j| only. Figure 3.2 illustrates the calculation of

distance constraints. The algorithm is initialised with three points (0,−d2/2, 0),

(0, d2/2, 0) and (√l2 − d2

2/4, 0, 0) where l = 3.8Aand d2 = 6.0A.

Recursively dmax,s is

dmax,1 = l

dmax,2 = d2

Page 70: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 47

If s is even:

dmax,s = sdmax,2/2

else if s is odd:

dmax,s =((dmax,s−1+ + l cos(π−β

2))2

+ l2 sin2(π−β2

))1/2

=(d2max,s−1 + 2dmax,s−1l cos(π−β

2) + l2 cos2(π−β

2) + l2 sin2(π−β

2))1/2

=(d2max,s−1 + 2dmax,s−1l cos(π−β

2) + l2

)1/2.

(3.1)

Figure 3.2: Distance constraints in a protein virtual backbone.

The hydrophobic effect was modelled by moving all hydrophobic amino acids

towards the centre e.g. by 20%, and all hydrophilic amino acids were moved outward

by a smaller amount e.g. 5%.

Algorithm

The algorithm has two phases. The first phase is in “Distance Space” whereby

inter-residue distances are updated:

d(new)ij = (1 − sij)d

(old)ij + sijd

(des)ij (3.2)

where sij is the level of strictness for target distance, d(des)ij between residues i and

j from Table 3.4. Here d(des)ij only depends on whether ith and jth Cα atoms are

for hydrophilic or hydrophobic amino acids. In this phase, distance constraints are

checked and any violations corrected. In the case of violation, d(new)ij is set to either

dbump or dmax,s (whichever is closer).

The second phase is in “Euclidean Space”. In this step, the distance matrix is

used to specify the 3-dimensional coordinates of the structure. Again after projecting

Page 71: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 48

into the 3-dimensional Euclidean space, distance constraints are checked and any

anomalies are corrected.

These steps are iterated until convergence. The criteria for convergence is based

on either distance or constraint scores:

Distance score

This is a sum of squared relative differences between targeted and actual distances,

weighted by strictness values:

Qdist =

√√√√∑

i<j sij((dij − d(des)ij )/d

(des)ij )2

∑i<j sij

. (3.3)

Constraint score

This is a sum of squared relative differences between targeted and actual distances,

weighted by strictness values:

Qcons =

√∑

i<j

(Ebump,ij + Emax,ij) (3.4)

where

Ebump,ij =

(dbump−dij

dbump

)2

, if dij < dbump

0, otherwise

Emax,ij =

(dmax,s−dij

dmax,s

)2

, if dij > dmax,s, s = |i− j|

0, otherwise.

The convergence criteria is met if either the absolute value or the relative change

of the score is below a preset minimum.

Comments

• The method is capable of reproducing important protein “non-random” fea-

tures like globularity and compactness.

• There is no mechanism to avoid forming knots in the structure.

Page 72: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 49

Alternative chain simulations

Both “distance space” and “Euclidean space” phases of the method by Aszodi and

Taylor (1994) are computer intensive. Here we propose a new algorithm based

on Aszodi and Taylor (1994) method but much simpler and mathematically more

flexible. We consider a chain of Cα atoms only (see Eidhammer et al., 2004, p. 254).

Figure 12.1 therein, gives the geometry. The chain is (Eidhammer et al., 2004, p.

173)

Cαi−1 − Ci−1 = Ni − Cα

i − Ci = Ni+1 − Cαi+1 − Ci+1 = Ni+2.

Note that Cαi − Ci = Ni+1 − Cα

i+1 − Ci+1 lies in a plane where “−” and “=” are

single and double bonds respectively.

• Relative to the plane of Ci−1 = Ni−Cαi , Ci has only freedom to rotate “around”

the bond Ni − Cαi . This angle is φi.

• Relative to the plane of Ni−Cαi −Ci, Ni+1 has only freedom to rotate “around”

the bond Cαi − Ci. This angle is ψi.

Simulating Cα atoms only

Three consecutive Cα atoms can be regarded to lie in a plane. Following Aszodi

and Taylor (1994), we start with a triangle where we take the base line going from

(0,−d1/2, 0) to (0, d1/2, 0). The vertex has coordinates (√l2 − d2

1/4, 0, 0) where l is

pre-specified to be 3.8A, and d1 lies around 6A with standard deviation of 0.4A.

Consider Figure 3.3; A,B,C denote consecutive Cα atoms lying in a plane. Now

we can take normal distribution for di with mean 6A and standard deviation 0.4A.

To generate a fourth atom, D, consider a point P1 in the same plane as A,B,C. The

next step consists of rotating the edge CP1 where the base line is AC by θ. Thus the

next triangle (containing the fourth atom D) gets rotated by θ. This angle between

CD and XY− plane, θ is related to dihedral angles φ and ψ in real proteins. We

take θ to have von Mises distribution with mean zero around edge AC rather than

uniform as in Aszodi and Taylor (1994). Unlike the uniform distribution, the von

Page 73: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 50

Mises distribution can be used to control the turning properties of the chain by

specifying mean direction and concentration parameters.

A

B

C

D

z

y

x

α α

π

2− α π

2− α

δ

β

η

d1

d2

P1

l

l

ll

θ

Figure 3.3: Orientation of Cα atoms in simulating protein short chains.

Let Dx denote the x− coordinate for D. Then it is found that:

Dx = l cos θ sin η + Cx

Dy = l cos θ cos η + Cy

Dz = l sin θ + Cz

(3.5)

where

η = π − δ − α

= 2 cos−1(d22l

)+ tan−1

(d1

2√l2−d21/4

)− π

2.

With B, C, D as new vertices in place of A, B, C respectively, the process

iterates until all N atoms are generated subject to the minimum distance of 5.5A

between any two non-neighbouring atoms to model van der Waals forces.

Page 74: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 51

Simulating the hydrophobic effect

Hydrophobic effect is the other major force driving protein folding. We take into

account hydrophobicity by favouring the bond angle which takes the hydrophobic

amino acid towards the centre of the configuration. Thus, at each Cα atom , simulate

bond angles as follows:

• Step 1: get three angles randomly, say from von Mises distribution.

• Step 2: choose the angle which takes the Cα atom for a hydrophobic amino

acid furthest towards the centre of mass.

• Step 3: for hydrophilic amino acid, choose the angle which takes the Cα atom

furthest away from the centre.

Using three angles in Step 1 was observed to efficiently give reasonable structures.

In principal more angles could be sampled. However more angles increases chances

of the chain crashing into itself as there are more chances of turning the chain closer

towards the centre of mass.

Results

Figures 3.4 and 3.5 are plots of typical chain realisations without and with modelling

hydrophobic effects respectively. Chains in Figures 3.5 a, b, c and d have 45, 55, 65

and 75% of their amino acids as hydrophobic. We observe that more hydrophobic

content gives more compact and globular structures.

Comments

• The algorithm has no in-built capability to avoid the chain making knots or

getting entangled. The simple implementation was just to restart assembling

the chain afresh if several (e.g. 100) attempts to generate a point fails due to

distance constraints as this is indicative of entanglement.

• As the folding direction is random, realisations leading to non compact and

non globular chains are possible as well especially when hydrophobic effects

are not taken into account.

Page 75: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 52

−40 −30 −20 −10 0 10 20 020

4060

80

−50−40

−30−20

−10 0

10 20

30

x

y

z

−30 −20 −10 0 10 20 0 2

0 4

0 6

0 8

010

0

−30−20

−10 0

10 20

30

x

yz−40 −30 −20 −10 0 10 20−

40−

20 0

20

40

60

−40−30

−20−10

0 10

20 30

x

y

z

−10 0 10 20 30 40−10

0 −80

−60

−40

−20

0

20

40

−10 0

10 20

30 40

50

x

y

z

−40 −30 −20 −10 0 10 20 020

4060

80

−50−40

−30−20

−10 0

10 20

30

x

y

z

−30 −20 −10 0 10 20 0 2

0 4

0 6

0 8

010

0

−30−20

−10 0

10 20

30

x

yz−40 −30 −20 −10 0 10 20−

40−

20 0

20

40

60

−40−30

−20−10

0 10

20 30

x

y

z

−10 0 10 20 30 40−10

0 −80

−60

−40

−20

0

20

40

−10 0

10 20

30 40

50

x

y

z

Figure 3.4: Typical chain realisations in short protein chain simulations without

hydrophobic effects.

−60 −40 −20 0 20 0 2

0 4

0 6

0 8

010

0

−60−40

−20 0

20 40

x

yz

a)

−10 0 10 20 30 40−80

−60

−40

−20

0 2

0

−10 0

10 20

30 40

50 60

x

y

z

b)

−20 −15 −10 −5 0 5 10−50

−40

−30

−20

−10

0 1

0 20

30

−25−20

−15−10

−5 0

5 10

15

x

yz

c)

−20 −10 0 10 20 30−30

−20

−10

0 1

0 2

0 3

0 4

0

−30−20

−10 0

10 20

x

y

z

d)

Figure 3.5: Chain realisations in short protein chain simulations with hydrophobic

effects.

3.1.3 Appropriateness of Simulated Data

In this chapter we considered how to simulate structures similar to protein and func-

tional sites. Any attempt to simulate a random structure has to consider intrinsic

Page 76: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 53

characteristics of proteins. However it is difficult to completely separate random and

deterministic properties of protein structures except for well known characteristics

like inhibition distance between atoms, compactness in globular proteins.

Figures 3.6 and 3.7 are plots for typical hardcore and short chain simulations for

functional sites and part of protein structure in sections 3.1.1 and 3.1.2 respectively.

Figures 3.6a shows that the density field for hardcore configurations are not as

tubular as for functional sites in SITESDB. The density field in Figure 3.7 for short

chain simulation looks very similar to the density field for the whole structure e.g.

Figure 2.13. In all these simulations, the minimum inter-event distance of 3.8A is

well reflected. Also the structures are compact.

Arguably, hardcore configurations may not entirely mimic functional sites with

respect to tubularity. Also because short chains have points completely connected,

they might not entirely reflect functional sites as well. Probably functional sites are

mid-way between these two cases. However we use hardcore configurations (e.g..

in sections 5.2.3, 5.4.4 and 6.1.8) and short chains (in section 6.1.8) to evaluate

matching algorithms. Although the motivation was matching functional sites, these

algorithms can be used for matching any type of configuration in some other appli-

cations e.g. matching steroids in chemoinformatics (Dryden et al., 2006).

3.2 Evaluation

Functional site pairs were simulated as in section 3.1 and used to evaluate the

performance of matching methods as outlined below (section 3.2.1). Quality of

matching real functional sites is evaluated in section 4.1.4 of Chapter 4 using scores

defined in section 3.2.2 in addition to goodness-of-fit p-values proposed in section

4.1 of Chapter 4.

Page 77: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 54

Figure 3.6: The K-function and inter-point distance distribution for a simulated

hardcore configuration. The K-function is normalised: K(t) − πt2.

Page 78: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 55

Figure 3.7: The K-function and inter-point distance distribution for a simulated

short chain configuration. The K-function is normalised: K(t) − πt2.

Page 79: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 56

3.2.1 Correct Correspondence

With simulated datasets, evaluating matching methods by correct points correspon-

dence proportion is possible because we know which point in {x} corresponds to

which point in {µ}.To evaluate the methods:

(a) For each set of n and m(n = 4 to 64), thirty pairs of functional sites {µi} and

{xj} are generated.

(b) Randomly permute the order of xi to xj . Thus, we no longer “know” corre-

sponding µi and xj points.

(c) Match points of {µ} to points of {x}. Find out correctly matched points.

Obviously, points in {µ} which do not correspond to any point in {x} have

correct correspondence if not matched.

(d) Calculate average correct correspondence proportion for each n from replica

datasets.

(e) Plot correct correspondence proportion against points set size (n) for each

method.

3.2.2 Scores

Gold (2003) introduces scores for assessing quality of a match. Three matching

scores are considered. These are

(a) Option 0 (free matching): Corresponding amino acids found by distance cri-

teria only score one. The reported raw score is the total number of matched

pairs.

(b) Option 1 (identity matching): Corresponding amino acids only add to the

score if they have the same amino acid identity.

Page 80: Statistical approaches to protein matching in Bioinformatics

Chapter 3. Simulation Design and Evaluation of Algorithms 57

(c) Option 2 (similarity matching): Corresponding amino acids score one if they

have the same amino acid identity and score half if in the same group but not

identical. In this thesis we use groups as defined in Table 5.4.

Thus the Option 2 score, S is

S =

q∑

i=1

s(µi, xi)

where amino acids µi and xi are geometrically equivalent amino acids in matched

parts of configurations {µ} and {x}; and

s(µi, xi) =

1 if ki = ci else

0.5 if Gi = Di

0 otherwise

where q is the number of matched pairs; ki and Gi denote amino acid identity and

group of µi; similarly ci and Di denote amino acid identity and group of xi.

Option 0 and 1 matching scores are easily expressed in the same form by appropri-

ately re-defining s(µi, xi).

Dividing the matching score by RMSD gives a final score. Thus, the final score

is a function of both geometrical and matching types measures. The matching score

is qualitative while RMSD is a quantitative measure. There is no rigorous statistical

interpretation of these scores (Gold, 2003). In Chapter 4 we consider goodness-of-fit

statistics for quantifying quality of matches. We compare quality indications by

using the scores and p-values from the distribution of the size-and-shape distance.

Page 81: Statistical approaches to protein matching in Bioinformatics

Chapter 4

Match Statistics

In this chapter we focus on match statistics and their distributions under “random”

and “non-random” configuration hypotheses. By non-random we mean the scenario

where we know or suppose that the configurations are related and we are interested in

“goodness-of-fit”. The random hypothesis is for assessing the probability of finding

a matching configuration by mere chance.

4.1 Goodness-of-fit Statistics for Rigid Body Su-

perpositions

We consider optimal RMSD under the isotropic Gaussian landmark Model for as-

sessing goodness-of-fit when matching two configurations. The quality of rigid body

superposition of two configurations is often assessed by RMSD after optimal align-

ment. In general, the smaller the RMSD is, the better is the superposition or

matching.

4.1.1 Minimum RMSD Distribution

We simulated configurations of {µ} and {x} as in section 3.1.1 of Chapter 3. As

in section 3.2.1, the order of xi are randomly permuted so that we do not “know”

corresponding µi and xj points.

58

Page 82: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 59

Figure 4.1a is a plot of RMSD after solving for optimal correspondence and

alignment for {xp} and {µp}. The graph theoretic method was used to solve for

correspondence and alignment. We observe that RMSD variability decreases with

increasing n. This is because chance good matchings and probably spurious worst

matchings as well are more likely with small n. This could be one of the reasons

for a well known decrease in RMSD for small n in matching proteins as only best

superpositions are of interest. Figure 4.1b is a plot for the smallest 10 RMSD values

for each n = 4, 8, . . . , 64. Figure 4.1c is a plot for the minimum RMSD for each

n = 4, 8, . . . , 64. These are typical plots when considering RMSD for best matches

in proteins.

Figure 4.1: RMSD against number of corresponding points with “loess” smoothing

curves. a) Optimal RMSD after graph matching against number of points. b)

Minimum 10 RMSD values for each number of corresponding points in (a). The

Minimum (best) RMSD for each value of n in (a) or (b).

Since RMSD depends on the number of corresponding points in the configura-

tions, one cannot directly compare two RMSD values from superimposing configu-

Page 83: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 60

rations with different number of corresponding points. To overcome this problem

in Bioinformatics applications, Carugo and Pongor (2001) find how RMSD depends

on the number of corresponding points, q. These authors propose to adjust RMSD

values to RMSD100 values i.e. interpolated RMSD values for q = 100. RMSD100 val-

ues are comparable but this adjustment ignores that variability for RMSD increases

with respect to q. A classical way to take this variability into account is to find the

distribution of RMSD. One can then directly compare the standardised RMSD or

goodness-of-fit p-values.

4.1.2 Distribution of Size-and-shape Distance

The size-and-shape distance is q × RMSD2 after optimal alignment of two rigid

body configurations. Let X and µ be coordinate matrices with columns in X and

µ representing corresponding points in the two configurations. The size-and-shape

squared distance is,

d2S(X,µ) = S2

X + S2µ − 2SXSµ cos ρ(X,µ) (4.1)

where S2X =

∑qj=1 ‖Xj − X‖2 is the squared centroid size of X. ρ is the Procrustes

distance (see Definition 4.1.4 below and Dryden and Mardia, 1998).

We first define the Helmert sub-matrix used to centre configurations at the origin.

The Helmert sub-matrix also scales the configuration to have a unit centroid size.

Definition 4.1.1. The jth row of the Helmert sub-matrix H is given by

(hj , . . . , hj,−jhj , 0, . . . , 0), hj = −{j(j + 1)}−1/2,

and so the jth row consists of hj repeated j times, followed by −jhj and then q−j−1

zeros, j = 1, . . . , q − 1.

For q = 3 the full Helmert matrix is explicitly

Hf =

1/√

3 1/√

3 1/√

3

−1/√

2 1/√

2 0

−1/√

6 −1/√

6 2/√

6

Page 84: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 61

and the Helmert sub-matrix is

H =

−1/√

2 1/√

2 0

−1/√

6 −1/√

6 2/√

6

.

For q = 4 the full Helmert matrix is

Hf =

1/2 1/2 1/2 1/2

−1/√

2 1/√

2 0 0

−1/√

6 −1/√

6 2/√

6 0

−1/√

12 −1/√

12 −1/√

12 3/√

12

and the Helmert sub-matrix is

H =

−1/√

2 1/√

2 0 0

−1/√

6 −1/√

6 2/√

6 0

−1/√

12 −1/√

12 −1/√

12 3/√

12

.

Definition 4.1.2. The pre-shape of a configuration X is all the geometrical infor-

mation that remains when location and scale effects are filtered out from the object.

That is the pre-shape of X is given by

Z =XHT

‖XHT‖where H is the Helmert sub-matrix.

The pre-shape of an object is invariant under translation and scaling of the

original configuration.

Definition 4.1.3. The pre-shape space is the space of all possible pre-shapes.

Formally, the pre-shape space Sqd is the orbit space of the non-coincident q point set

configuration in ℜd under the action of translation and isotropic scaling.

The pre-shape space Sqd ≡ Sd(q−1)−1 is a hypersphere of unit radius in d(q − 1)

real dimensions, since the centroid size of Z, ‖Z‖ = 1.

Definition 4.1.4. The Procrustes distance ρ(X,µ) is the closest great circle

distance between pre-shapes of X and µ on the pre-shape sphere. The minimisation

is carried out over rotations.

Page 85: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 62

From the distribution of size-and-shape we will derive the distribution of RMSD

which is a function of size-and-shape. RMSD is commonly used in Bioinformat-

ics applications while size-and-shape is mainly used in Morphometry. We use the

distribution of RMSD in a Bioinformatics application to rank best matches from a

database search in section 4.1.4.

Consider the distribution of RMSD, r = dS(X,µ)/√q under the isotropic Gaus-

sian model for corresponding points i.e. xj ∼ N(µi, σ2Id) where point µi corresponds

to point xj and d = 3 is the dimension. Thus RMSD is a function of two random

variables SX and ρ. Under our model, we assume Sµ is fixed while S2X is distributed

as non-central χ2ν(λ) with ν = dq − d and λ = S2

µ/σ2. After optimal superposition

of configurations with q points, the full Procrustes distance,

d2F = sin2 ρ(X,µ) ∼ τ 2

0χ2dq−d(d−1)/2−d−1

with τ 20 = σ2/S2

µ. We consider exact and approximate distributions for r in the

following sections. The approximation is when SX ≈ Sµ and variability of Sx is

small.

Exact Distribution

We first consider the distribution for d2S(X,µ). With

sin2 ρ(X,µ) ∼ τ 20χ

2dq−d(d−1)/2−d−1

the density for x = cos ρ is

f(x) =2xβα

Γ(α)(1 − x2)e−β(1−x2) (4.2)

where β = 1/2τ 20 and α = dq−d(d−1)/2−d−1

2. and the density for y = S2

X is

f(y) =1

2σ2

√2π√λy/σ2

e(λ+y/σ2)/2( y

λσ2

)−1/4 {e√λy/σ2

+ e−√λy/σ2

}(4.3)

Page 86: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 63

where β = 1/2τ 20 and α = dq−d(d−1)/2−d−1

2. Assuming independence between x =

cos ρ and y = S2X , the joint distribution for x and y is

f(x, y) = xβα

Γ(α)(1 − x2)e−β(1−x2)

× 1

σ2

q2π√λy/σ2

e(λ+y/σ2)/2(

yλσ2

)−1/4{e√λy/σ2

+ e−√λy/σ2

}.

(4.4)

Let v = S2µ + y − 2Sµx

√y and u = y. Inverse functions are y = u, x =

v−S2µ−u

2√uSµ

and

the Jacobian of transformation, |J | = 12√uSµ

. Hence the joint distribution for v and

u is

f(u, v) =(v−S2

µ−u)βα

4√uS2

µΓ(α)

{1 −

(v−S2

µ−u2√uSµ

)2}e

(−β(1−

„v−S2

µ−u

2√

uSµ

«2

)

)

× 1

σ2

q2π√λu/σ2

e(λ+u/σ2)/2(

uλσ2

)−1/4{e√λu/σ2

+ e−√λu/σ2

}.

(4.5)

It is not easy to integrate out u in order to get the distribution for v. Thus we

consider an approximation for size-and-shape distance.

Approximation

We consider the distribution for approximate size-and-shape distance when variabil-

ity of SX is so small or SX ≈ Sµ such that we can treat SX as a constant as well.

For example in Bioinformatics applications, interesting cases are where matching is

good hence configurations are of the same size i.e. SX ≈ Sµ. Thus

d2S(X,µ) ≈ 2S2

µ(1 − cos ρ(X,µ)). (4.6)

With sin2 ρ(X,µ) ∼ τ 20χ

2dq−d(d−1)/2−d−1, the approximate1 density for r is

f(r) =2qrβα

S2µΓ(α)

(2 − qr2

S2µ

)(qr2

S2µ

−(qr2

2S2µ

)2)α−1

e−β

qr2

S2µ−„

qr2

2S2µ

«2!

(4.7)

where β = 1/2τ 20 and α = dq−d(d−1)/2−d

2. We adjust degrees of freedom because we

do not allow scaling i.e. we multiply with S2µ. We only lose d(d− 1)/2 − d degrees

1This is the density for Sµ

√2(1 − cos ρ(X, µ))/q, an approximate size-and-shape distance in

closely fitting configurations.

Page 87: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 64

of freedom for rotation and translation as

d2S(X,µ) = inf

A∈SO(d)‖µ− AX − b‖2

where SO(d) denotes a set of all d× d rotation matrices (orthogonal matrices with

determinant equal to +1)

4.1.3 Simulations for RMSD Distribution

We simulate {µ} and {x} as in section 3.1.1. However here n = m = 20 and we

simulated 10,000 pairs. The order of xi are randomly permuted as in section 3.2.1

so that we do not “know” corresponding µi and xj points.

Figure 4.2a gives a histogram of RMSD after optimal superposition using graph

theoretic method. Superimposed on this histogram is the probability density func-

tion in equation 4.7. We observe that this approximate distribution is a good fit.

Figure 4.2b is a plot of empirical distribution function and the cumulative density

function of equation 4.7. We also observe a good fit here. Therefore a goodness-of-fit

p-value from our approximate distribution can be used.

4.1.4 Application

We did a database search with a functional site of 5-aminolaevulinate dehydratase

(1b4e 0) using the graph method of Gold (2003). The standard deviation, σ is esti-

mated to be around 0.3 for matching functional sites known to be related (functional

sites from 17 − β hydroxysteroid-dehydrogenase and carbonyl reductase proteins

shown in Figure 1.3) at a threshold of 1.5A. Thus we set σ = 0.3 for matching

distance tolerance of 1.5A(cf section 5.3 of Chapter 5).

Table 4.1 gives the results for best 50 matches sorted by goodness-of-fit p-values.

Also given are scores proposed by Gold (2003) and described in section 3.2.2. The

scores given in Table 4.1 are found by dividing values of the score option 2 (see

section 3.2.2) by the RMSD.

We observed that 1eb3 0 (No 13) has a higher p-value than 1i8j 2 (No 14) al-

though the later has a lower RMSD. There is an agreement between the p-value and

Page 88: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 65

Figure 4.2: Approximate RMSD distribution. a) Histogram of RMSD after opti-

mal superposition using graph theoretic method. b) Empirical and approximate

(equation 4.7) distribution functions for RMSD.

the score ranking. A better match has 21 corresponding amino acids compared to

8 for the other match. This justifies a better goodness-of-fit even though its RMSD

is higher than the other. This scenario is also observed for 1gjp 0 and 1l6s 2 (No 16

and 17); 1h7r 0 and 1l6y 3 (No 18 and 19).

4.1.5 Summary

A bigger challenge is to analytically work out the exact distribution for number of

matches and RMSD when matching random configurations. Unlike our attempt

to find a good approximating distribution for a Procrustes metric when matching

random configurations, Stark et al. (2003b) empirically modelled the distribution

for RMSD with the extreme value distribution. We follow Stark et al. (2003b) to

Page 89: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 66

Table 4.1: Best fitting functional sites in the database when matched against 5-

aminolaevulinate dehydratase functional site (1b4e 0).

No. SITE q Sµ SX RMSD SCORE P-value

1 1b4e 0 21 41.46 41.46 0.000000 NA 1.0000000

2 1h7n 0 21 41.46 42.19 0.325806 0.513 0.9999999

3 1i8j 4 15 32.56 32.84 0.264983 0.502 0.9999993

4 1l6s 6 15 32.56 32.87 0.275493 0.492 0.9999982

5 1l6y 6 18 37.31 37.90 0.325130 0.479 0.9999966

6 1l6s 7 15 32.56 32.87 0.285211 0.465 0.9999946

7 1i8j 5 15 32.56 32.79 0.280753 0.467 0.9999943

8 1h7p 0 21 41.46 42.53 0.394005 0.414 0.9999942

9 1ohl 0 21 41.46 42.33 0.376413 0.437 0.9999863

10 1l6y 0 20 40.27 41.07 0.371252 0.540 0.9999744

11 1i8j 0 8 16.40 16.60 0.204785 0.325 0.9999703

12 1h7o 0 21 41.46 42.57 0.412440 0.407 0.9999651

13 1eb3 0 21 41.46 42.38 0.391614 0.422 0.9999554

14 1i8j 2 8 15.34 15.65 0.239518 0.300 0.9998613

15 1l6s 4 8 15.34 15.64 0.242736 0.302 0.9998003

16 1gjp 0 20 40.27 41.48 0.447832 0.352 0.9995251

17 1l6s 2 8 15.29 15.56 0.250936 0.290 0.9995213

18 1h7r 0 20 39.97 41.16 0.473749 0.318 0.9939235

19 1l6y 3 7 12.98 13.14 0.315031 0.192 0.9617287

20 1b4k 0 20 40.27 41.00 0.457764 0.266 0.9577919

21 1e51 0 20 40.07 41.56 0.573448 0.279 0.8148693

22 1gzg 0 21 41.46 42.20 0.504140 0.234 0.7643607

23 1b4k 1 17 34.74 35.62 0.582555 0.203 0.2456983

24 1hrs 2 3 5.72 5.46 0.766603 0.020 0.0006607

25 1m7h 6 5 9.52 8.68 0.869824 0.019 0.0002631

Page 90: Statistical approaches to protein matching in Bioinformatics

Chapter 4. Match Statistics 67

calculate p-values for matching random (unrelated) configurations in our application

in section 7.4 of Chapter 7.

The distribution for number of matches can also be modelled by the extreme

value distribution e.g. Chen and Crippen (2005).

Page 91: Statistical approaches to protein matching in Bioinformatics

Chapter 5

EM Algorithm Alignment

The commonly used graph theoretic approach (reviewed in section 1.2.1) and other

related approaches e.g. geometric hashing (Wallace et al., 1997) require adjustment

of a matching distance threshold a priori according to the noise in atomic positions.

This is difficult to pre-determine when matching sites related by varying evolutionary

distances and crystallographic precision.

To avoid the problem of specifying matching distance threshold, in this chapter

we consider using an EM algorithm in the mixture model formulation of the prob-

lem to finding an alignment and point correspondences between two configurations.

Assume we are given two configurations {µi : i = 1, . . . , m} and {xj : j = 1, . . . , n}in ℜd. Suppose there are q ∈ {2, . . . n} corresponding points in these configurations

under rigid body transformation1. However we do not know

(a) which are the corresponding points;

(b) the number of corresponding points, q;

(c) as well as transformation parameters.

We reviewed some approaches for solving this problem in section 1.2.1. We review

a statistical approach by Kent et al. (2004) using a mixture model in section 5.1. In

1We are interested in matching at least two points.

68

Page 92: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 69

section 5.2, we consider using concomitant information to point coordinates in Kent

et al. (2004) mixture model framework.

5.1 Mixture Model

Given configurations {µi : i = 1, . . . , m} and {xj : j = 1, . . . , n} in ℜd with corre-

spondence and alignment unknown, Taylor et al. (2003) formulate a mixture model

to solve for both correspondence and alignment simultaneously. Correspondence is

considered to be missing data and EM algorithm is used. Expected values of mix-

tures indicator variables are calculated in the E-step; alignment parameters that

maximise the expected log likelihood are estimated in the M-step using Procrustes

analysis. This is known as “soft” matching because we use expected values of cor-

respondence indicator variables.

5.1.1 Soft Matching of Forms

Let {µi} have more points than {xj} i.e. n ≤ m in order to assume that {xj}has risen from {µi} through some transformation and possibly some points in {µi}not appearing in {xj}. This model is plausible for the motivating problem in pro-

tein functional sites and certainly in some applications in chemoinformatics as well

(Dryden et al., 2006). The restriction that n ≤ m is without loss of generality in

many applications because in practice there is no knowledge of which of the two

configurations to be matched gave rise to the other (parentage) i.e. {xj} and {µi}are exchangeable. Furthermore the parentage is of no practical use as far as match-

ing configurations is concerned. Indeed, for the Bayesian approach in Chapter 6,

configuration sizes do not matter even for formulating the methodological matching

framework and algorithm.

Let the map π(j) = i denote correspondence between points xj and µi. If

π(j) = i then assume

xj = ATµi + b+ εi,

εi ∼ IN(0, σ2)

Page 93: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 70

where σ2 is unknown and A is an orthogonal matrix. That is, for fixed j, we take

for i = 1, . . . , m,

φ(xj|π(j) = i) =

(2πσ2)−d/2 exp

{−1

2‖xj − ATµi − b‖2/σ2

}if i 6= 0

1‖W‖ if i = 0.

(5.1)

The convention π(j) = 0 is used to classify a point xj which does not correspond to

any point µi. These points are referred to as coffin bin points. Coffin bin points are

assumed to be uniformly distributed in region W ∈ ℜd i.e.

xj |(π(j) = 0) ∼ Uniform(W ).

The marginal distribution of xj is given by the mixture model

xj ∼

m∑

i=1

P (π(j) = i)N(Aµi + b, σ2Id) + P (π(j) = 0)Uniform(W ) (5.2)

where P (π(j) = i), i = 0, . . . , m, are marginal membership probabilities and

m∑

i=0

P (π(j) = i) = 1.

Alternatively we can assume normal distribution for coffin bin points i.e.

xj |(π(j) = 0) ∼ N(µ0, σ20Id)

where µ0 can be taken to be the centre of mass for {µ} and σ20 is large.

5.1.2 Model Likelihood

Let X = (x1, . . . , xn)T , L be a set of labels. Given L, the likelihood is

Q(X|L) =

m∏

i=0

n∏

j=1

pI[π(j)=i]i φ(xj |π(j) = i)I[π(j)=i]

where I is an indicator function such that

I[π(j) = i] =

1 if π(j) = i

0 otherwise

and pi = P (π(j) = i) is the mixing probability for any x to be with label i.

Page 94: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 71

Hence

logQ(X|L) =

m∑

i=0

n∑

j=1

{I[π(j) = i] log pi + I[π(j) = i] log φ(xj|π(j) = i)} . (5.3)

With the labels unknown, let

pi = P (π(j) = i), j = 0, 1, . . . , n;

m∑

i=0

pi = 1

be prior probability of label π(j) to be i. The posterior probability is

pji = P (π(j) = i|xj) =P (xj|π(j) = i)

P (xj)pi

and (pji) is an n× (m+ 1) matrix. Note that

P (xj) =

m∑

i=1

piφ(xj|π(j) = i) + p0φ(xj|π(j) = 0)

and P (xj|π(j) = i) ≡ φ(xj|π(j) = i).

5.1.3 The EM Algorithm

In summary form, the algorithm involves:

• E-step: calculating assignment probabilities (expectation of correspondence

indicator variables).

• M-step: finding transformation and nuisance (variance, σ) parameters which

maximise the expected log likelihood given current assignment probabilities.

Procrustes fit is used to find transformation parameters.

• Repetition of E and M steps until convergence of residual sum of squares.

Algorithm mechanics

Let pi be given, with starting values p(0)i = 1/(m+ 1), say. Then E-step is:

p(r+1)ji =

P (xj|π(j) = i)

P (xj)p

(r)i .

Page 95: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 72

Substituting pji for I[π(j) = i] the log likelihood is

m∑

i=0

n∑

j=1

{pji log pi + pji logφ(xj |π(j) = i)} . (5.4)

Thus in M-step, we minimise:

f(A, b) =m∑

i=1

n∑

j=1

pji‖xj − ATµi − b‖2 (5.5)

using Procrustes fit for rigid body motion. A is an orthogonal matrix. If V ΓUT

is a singular value decomposition of B =∑m

i=1

∑nj=1 pji(µi − µ)(xj − x)T where

µ =Pm

i=1

Pnj=1 pjiµiPm

i=1

Pnj=1 pji

; xTr and yTr are rth rows of X and Y then A = V UT .

Thus for the (r + 1)th iteration we have

B(r+1) =m∑

i=1

n∑

j=1

p(r)ji (µi − µ)(xj − x)T , A(r+1) = (V UT )(r+1).

By minimising (5.5) w.r.t. b, we have

b(r+1) =

m∑

i=1

n∑

j=1

p(r)ji (xj − (A(r+1))Tµi)

m∑

i=1

n∑

j=1

p(r)ji

.

Finally update the mixing proportions:

p(r+1)i =

j

p(r)ji

ji

p(r)ji

=

j

p(r)ji

j

1=

j

p(r)ji

n.

E and M steps are repeated until convergence of residual sum of squares

m∑

i=1

(x(r+1)i − x

(r)i )T (x

(r+1)i − x

(r)i )

where x(r)i = A(r)µi + b(r+1). To ensure convergence of the correspondence matrix

(pji) as well, use x(r)i =

∑nj=1 pjiA

(r)µi+b(r+1). Another criteria of convergence could

be the log-likelihood (5.4).

Page 96: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 73

At the rth iteration, the correspondence probability weighted maximum likeli-

hood estimate of σ2 is

(σ2)(r) =

m∑

i=1

n∑

j=1

p(r)ji ‖xj − (AT )(r)µi − b(r)‖2

dm∑

i=1

n∑

j=1

p(r)ji

where d = 3 is the dimension. The unweighted estimator is

m∑

i=1

n∑

j=1

‖xj − (AT )(r)µi − b(r)‖2

d× n×m.

For normally distributed coffin bin points, the maximum likelihood estimate of σ20

is

(σ20)

(r) =

n∑

j=1

p(r)j0 ‖xj − (AT )(r)µ0 − b(r)‖2

d

n∑

j=1

p(r)j0

.

The unweighted estimate is

n∑

j=1

‖xj − (AT )(r)µ0 − b(r)‖2

d× n.

We take µ0 to be A(r)Tµc + b(r) where A(r) and b(r) are rth estimates for matrix A

and vector b respectively; µc is centre of mass for {µ}. Simulation studies show that

using the normal distribution for coffin bin points in this way, gives similar results

to using the uniform distribution.

5.1.4 Hardening of Soft Matches

After the algorithm converges, we need to turn (pji) into a “permutation matrix”,

(p′ji) with p′ji ∈ {0, 1}. This is to assign corresponding points and put non corre-

sponding points to the coffin bin. This is a typical linear assignment (LA) task.

There are a number of ways and considerations to accomplish this.

Page 97: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 74

Greedy algorithm

We get “hardened” matching probabilities, p′ji ∈ {0, 1} from pji using a greedy

algorithm. Here pji is set to 1 if it is the biggest value in its respective row and

column otherwise set it to 0. Column i and row j are removed if pji is set to 1.

However one has to consider how to treat the coffin bin. We suggest to exclude

the coffin bin column in the greedy algorithm then afterwards allocate all remaining

points to the coffin bin. Thus for i = 1, . . . , m (exclude the coffin bin column, i = 0)

and j = 1, . . . , n, get p′ji according to the following rule:

p′ji =

1 if pji = arg maxi p∗i = arg maxj pj∗

0 otherwise.

By leaving out the coffin bin in the greedy algorithm then only allocating the left-

overs to the coffin bin, we prioritise matching points over coffin bin allocation. The

only problem with the greedy algorithm is that there is no guarantee for a global

maximum assignment.

Dynamic programming (DP) and linear programming (LP)

Mathematically, linear assignment task is a problem of maximisation problem of

Z =

n∑

i=1

n∑

j=1

cjixji,

subject ton∑

i=1

xji = 1, (j = 1, . . . , n)

n∑

j=1

xji = 1, (i = 1, . . . , n)

xji = 1 or 0 ∀i, j

where C = (cji) is a given cost matrix, X = (xji) is the solution matrix. The

constraints enforce unique matching.

The linear assignment problem is a special type of linear programming problem.

Dynamic and linear programming guarantee a globally optimal assignment solution.

Page 98: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 75

These methods find a set of pairs {(j, i)} with unique j, i,= 1, . . . n which maximises

the objective function, Z subject to constraints. Among several efficient linear

assignment algorithms are variants of a Hungarian method (Kuhn, 1955; Hung and

Rom, 1980; Karp, 1980; Jonker and Volgenant, 1987; Wright, 1990; Murty Katta,

1968). Another class of linear assignment algorithms include the general simplex

algorithm and simplex-based algorithms2.

To accommodate for the coffin bin, we define a cost matrix

C = (cji), j = −(2m− n− 1), . . . , 0, 1, . . . , n, i = −(m− 1), . . . , 0, 1, . . . , m

with

cji =

pj′i′, for i = i′, j = j′ and i′ = 0, . . . , m; j′ = 1, . . . , n

pj′0, for i < 0, j = j′ and j′ = 1, . . . , n

0, for j ≤ 0.

Thus to allow the possibility of any xj to be assigned to the coffin bin, m−1 columns

are added to the matrix (cji). This matrix is made square by adding extra rows

(dummy xjs) with zeros. Then linear programming is used to solve for assignments

which maximise the objective function, Z. Linear programming is more efficient

than dynamic programming e.g. Karp (1980) gives a linear programming algorithm

with expected execution time of the order O(mn logn).

Threshold level

A threshold value is chosen for pji, say δ. p′ji is set to 0/1 according to

p′ji =

1 if pji ≥ δ

0 if pji < δ

This approach does not guarantee that we get the “permutation” matrix. In

theory there is a possibility to get more than one x matching a particular µ or vice

versa. This is especially true when δ is small and (pji) is not a doubly stochastic

2An example of linear programming simplex-based algorithm implementation is

LPSOLVE http://www.cs.sunysb.edu/ algorith/implement/lpsolve/implement.shtml

Page 99: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 76

matrix. This problem can be overcome by assigning such matches to the coffin bin

or arbitrarily breaking ties. A better approach would be to subject as a resultant

matrix to linear assignment algorithm to identify an optimal solution. However after

thresholding, the matrix has less information than the original one so it is better

just to pass the original matrix to a linear assignment algorithm. However if only

strong matches are desired the thresholding can be used at the same time with the

condition of being maximum in both a row and a column for the greedy algorithm.

Sinkhorn method

We can use Sinkhorn method of iteratively normalising rows and columns to get

a doubly stochastic matrix from (pji). The requirement for this method is that

n = m as the method applies to a square matrix, which is a drawback. There is

also to be a consideration of what to do with the coffin bin. Pedersen (2002) uses

a simple heuristic in “extended Sinkhorn” method whereby each entry in the coffin

bin is adjusted so as to have the row and column totals sum to one. The coffin

bin is used in normalisation as well. This approach also suffers from a possibility of

ties i.e. more than one xs matching a particular µ or vice versa. Rangarajan and

Gold (1996) uses a “winner takes all” approach when the matrix is not square. In a

“winner takes all approach”, i is assigned to j if pji is the maximum entry in the jth

row. This approach is a partial greedy algorithm (greedy algorithm assigns if and

only if the entry is a row and column maximum).

Another attempt to solve the problem of hardening matches is binarisation al-

gorithm by Pedersen (2002), which basically does thresholding; “winner takes all”

approach and allocates ties to a coffin bin on columns and rows separately. Fi-

nally use a greedy algorithm on thresholded (pji) i.e. matches are assigned only if

supported by both column-wise and row-wise operations.

Below we compare an improved Hungarian method by (Jonker and Volgenant,

1987), greedy and binarisation algorithms. Figure 5.1 is a plot of correct corre-

spondence proportions for these methods. As expected linear assignment method

outperforms both the greedy and Binarisation algorithm of Pedersen (2002). We do

Page 100: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 77

not consider the Sinkhorn method for its unsuitability for non-square matrices. We

leave out thresholding because there is no guarantee to resolve ambiguous matches

hence might require post-processing by the other methods considered. We do not

give results for “winner takes all approach” of Rangarajan and Gold (1996) as it is

a “partial greedy” algorithm already considered.

10 20 30 40 50 60

0.80

0.85

0.90

0.95

point−set size

corr

ect c

orre

spon

denc

e

greedyLAP bin

Figure 5.1: Correct correspondence proportions for greedy algorithm, linear assign-

ment - LA and binarisation algorithm of Pedersen (2002)- P bin.

5.2 Concomitant Information in the Mixture Model

Consider that concomitant information, say colour of points is available. We consider

ways of using this extra information to solve the problem of correspondence and

alignment.

Page 101: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 78

5.2.1 Concomitant Information Model

We observe c = (c1, . . . , cn)T in addition to point coordinates {xj} where cj is colour

of xj . We also have coordinates {µi} and their colours k = (k1, . . . , km)T . cj and ki

can take discrete values, say 1, 2, . . . , a for j = 1, . . . , n and i = 1, . . . , m. We would

have a = 20 and a = 4 respectively for amino acid types and groups in Table 5.4.

Denote the frequency of colour cj by fcj . Further, denote the transition proba-

bility of mutating from colour ki to cj by mkicj . For amino acid types, substitution

matrix in Figure 3.3 can be used for these transition probabilities. We assume that

the coordinate generating process X is independent of the colour generating process

C. As before, let L still be a set of labels. We consider a conditional colour substi-

tution model for the points given the labels, L: π(j) = i. Like {µi}, we assume k is

fixed. The model for C conditional on L is

ψ(Cj|π(j) = i) = mkicj

where mkicj is the probability of colour ki mutating to colour cj . ki, cj = 1, 2, . . . , a.

The marginal probability mass function is

P (Cj = cj) = fcj .

The likelihood is

Q(X,C|L) =m∏

i=0

n∏

j=1

{piφ(xj|π(j) = i)ψ(Cj |π(j) = i)}I[π(j)=i]

where I is an indicator function as before.

Hence the log likelihood is

logQ(X|L) =m∑

i=0

n∑

j=1

I[π(j) = i] {log pi + logφ(xj |π(j) = i) + logψ(Cj |π(j) = i)} .

(5.6)

Page 102: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 79

The posterior probability is

pji = P (π(j) = i|xj , Cj)=

P (xj ,Cj |π(j)=i)

P (xj ,Cj)P (π(j) = i)

=P (xj |π(j)=i)ψ(Cj |π(j)=i)

P (xj)P (Cj)pi

=P (xj |π(j)=i)mkicj

P (xj)fcjpi

=P (xj |π(j)=i)

P (xj)ωjipi

where ωji =mkicj

fcj.

(5.7)

Note that ψ(Cj |π(j) = 0) = fcj i.e. the marginal probability as colour for the

coffin bin can take all possible values. Hence ωj0 =fcj

fcj= 1.

Thus in EM algorithm, we only modify the E-Step. We can view ωji as the

weight we give for preferring a match of xj to µi based on colour information. The

more the likelihood of mutating from ki to cj , the more the weight. Also given the

same transition likelihood from ki to cj or cj′, the higher the natural abundance of

cj compared to cj′ , the less the weight we give to xj matching µi than xj′ matching

µi.

Thus we have devised one simple way of incorporating colour information in EM

algorithm through weights. Next, we study practical approaches to weighting a

match of xj to µi based on colour information.

5.2.2 Colour Weighting

As the motivating application for these methods is the matching of functional sites

in Bioinformatics, we consider practical ways of weighting the posterior probabilities

pji in the EM algorithm when matching functional sites. We consider amino acid

classes or types as colours. Let class of µi and xj be ki and cj respectively. We

investigate three weighting schemes.

(a) Amino acid substitution matrix weights.

(b) Ad hoc weights.

(c) Simple prior conditional probabilities.

Page 103: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 80

Substitution matrix weights

One might consider weighting the posterior probabilities in the EM algorithm pji

with substitution model transition probabilities. i.e. if type of µi is ki and that of

xj is cj then weight pji with mki,cj i.e. substitution matrix entry (ki, cj) in Table

3.3. Then normalised weighted pji values are used in the EM algorithm.

Ad hoc weights

The substitution model mcj ,kiin functional sites is not well characterised and is

different to the one in mutation data from which commonly used substitution ma-

trices are derived. Hence we develop ad hoc weights, (wji) to be used in matching

functional sites data. These weights are data-driven.

Our method is to weight the posterior probabilities pji as follows:

wj,i =

αα×sj+dj

if cj = ki

1α×sj+dj

if cj 6= ki

where

sj = # of points in {µ} with the same type as xj ;

dj = # of points in {µ} with type different from that of xj ;

sj + dj = m;

α controls how much more to weigh matching amino acids of

the same type compared to those of different types.

NOTE:∑m

i wi,j =α×sj

α×sj+dj+

dj

α×sj+dj= 1.

Then we use normalised weighted pji values in the EM algorithm. To illustrate

this, assume we have 8 and 6 coloured points from {µi} and {xj} point sets re-

spectively. Denote colour of µi and xj as ki and cj respectively. We consider two

examples.

Example 5.2.1. Lets suppose the observed colours are as in Table 5.2.

Page 104: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 81

Table 5.1: Example 5.2.1 observed colour.

i/j 1 2 3 4 5 6 7 8

k 1 2 1 1 1 3 4 3

c 1 3 1 1 4 3

For these data, with α = 2, the weight matrix is

W = (wji) =

0.17 0.08 0.17 0.17 0.17 0.08 0.08 0.08

0.10 0.10 0.10 0.10 0.10 0.20 0.10 0.20

0.17 0.08 0.17 0.17 0.17 0.08 0.08 0.08

0.17 0.08 0.17 0.17 0.17 0.08 0.08 0.08

0.11 0.11 0.11 0.11 0.11 0.11 0.22 0.11

0.10 0.10 0.10 0.10 0.10 0.20 0.10 0.20

Figure 5.2 is a graphical depiction of this weight function.

Figure 5.2: Illustrative example of data-driven weights for matching.

Page 105: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 82

Example 5.2.2. Suppose that all colours are also observed in {µ}.Let’s say c3 = k3 = 2 for data in Example 5.2.1. Hence we have:

Table 5.2: Example 5.2.2 observed colours.

i/j 1 2 3 4 5 6 7 8

k 1 2 2 1 1 3 4 3

c 1 3 2 1 4 3

For these data and α = 5, the weight matrix is

WT = (wji) =

0.25 0.05 0.05 0.25 0.25 0.05 0.05 0.05

0.06 0.06 0.06 0.06 0.06 0.31 0.06 0.31

0.06 0.31 0.31 0.06 0.06 0.06 0.06 0.06

0.25 0.05 0.05 0.25 0.25 0.05 0.05 0.05

0.08 0.08 0.08 0.08 0.08 0.08 0.42 0.08

0.06 0.06 0.06 0.06 0.06 0.31 0.06 0.31

Simple prior conditional probabilities

Here we consider formulating simple prior conditional probabilities as weights. Let

a be the number of colours. Define the prior probability P (Cj = ki|π(j) = i) = β/a.

Hence P (Cj 6= ki|π(j) = i) = 1 − β/a. By having a uniform prior conditional

probabilities on colours other than ki, the conditional mass function is:

P (Cj = cj|π(j) = i) =

βa

if cj = kia−βa2−a otherwise

with β = 1, . . . , a− 1 where a =# of colours and ki is colour of µi.

Comments

Weighting posterior probabilities can be seen as a weighted likelihood approach. We

are merely maximising the weighted likelihood:

Page 106: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 83

Lw =∏

π(j)=i

wjipji. (5.8)

As this is just a typical statistical model, it might not necessarily be the most

biochemically plausible despite having the highest likelihood. Other quantities apart

from the likelihood might illuminate biochemical plausibility better. For example

one might consider a score combining a count of colour matches (or other functions

of weights, match/mismatches, etc) and RMSD. RMSD would be a contribution

from geometrical matching. A weighted likelihood can also be viewed in the context

of a matching score; as weights are the contributions from colour matching while

pji values are geometrical matching contributions. On the other hand Gold (2003)

method entails a “strict” geometrical matching as a requisite then comes up with a

score which is a function of colour and geometrical (RMSD) matching measures.

5.2.3 Evaluation

Simulated data is used to assess performance of proposed methods for using concomi-

tant information. To evaluate this approach, the methods are tested on simulated

data to assess their performance. Correct correspondence proportion is used for eval-

uating the methods on simulated data as outlined in 3.2.1. We evaluated correct

correspondence proportions for varying values of m and n as size of configurations

affects efficiency of matching algorithms.

These methods are also applied on real data in section 5.2.4. We compare the

scores (see section 3.2.2) by these methods and graph method of Gold (2003).

Different weighting schemes

If exact substitution weights from substitution matrices were easy to come by, intu-

itively they would be appealing as they would parametrically model the underlining

data generating mechanism. Figure 5.3 gives results for various weighting schemes

in a simulation study. In this simulation, 4 colours (a = 4) were used. Using simple

Page 107: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 84

prior conditional probabilities and ad hoc weights gave comparable performances

while exact substitution weights gives the best performance as expected. For ad hoc

weights, α = 4 while β = a−1 = 3 was used for simple prior probabilities approach.

10 20 30 40 50 60

0.6

0.7

0.8

0.9

point−set size

corr

ect c

orre

spon

denc

e

No wgtsAd hoc wgtsSubs. ProbsBayesian

Figure 5.3: Correct correspondence proportions for various weighting schemes.

Bayesian: simple prior conditional probabilities.

Ad hoc weights

We evaluate how increasingly penalising matching points with discordant colours

affect the performance of the algorithm. Figure 5.4 is a plot of correct correspon-

dence proportions for various discordant colour match penalties. Performance is

measured in terms of proportion of correct correspondence identification. For the

results presented here, soft matches are hardened using greedy algorithm on the (pji)

matrix (excluding the coffin bin column). Although linear assignment gives the best

performance in Figure 5.1, the difference with the greedy algorithm is marginal and

linear assignment is computer intensive. Using the greedy algorithm for all values

of α will not affect the study for the effects due to different levels of α.

From simulations we observe that:

Page 108: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 85

10 20 30 40 50 60

0.80

0.85

0.90

0.95

point−set size

corr

ect c

orre

spon

denc

e

α = 1 = 2 = 3 = 4 = 5

Figure 5.4: Correct correspondence proportions for various α levels.

(a) Results when colour mutation is allowed and when not allowed are very close

to each other.

(b) Geometric information (with this hardcore model) is so rich

i. that with ground truth initial parameters for dispersion (i.e. σ2 = 0.5)

even without taking colour information into account EM Procrustes gives

correct correspondence proportions of greater than 0.96n. That is wrong

correspondence of at most 1 or 2 points only.

ii. however with a little bit of perturbation to initial parameters say, using

σ2 = 4 as an initial dispersion parameter estimate, the EM algorithm can

converge to local maxima in a few more cases (correct correspondence pro-

portion is around 0.93). This is a serious drawback because combinatorial

nature of the problem even in motivating applications will surely lead to

the likelihood function having many spikes. However using ad hoc colour

weights guides the algorithm to find global maxima in a few more cases

(correct correspondence is back to > 0.96). Figure 5.4 shows improve-

ments on matching when different values of α are used. It seems α = a,

Page 109: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 86

the number of colours gives quite substantial improvements. There are

marginal improvements if the value of α is further increased.

(c) What is interesting here is that the use of colour information in this way in-

creases the volume of a region for initial parameter estimates for which the EM

algorithm converges to a global maximum parameter vector. As an example,

Figure 5.5 shows parts of the square region of starting values over which the

algorithm converges to a global maximum parameter vector with and without

weights. Parameters θ1 and θ2 are 2 of the 3 rotation angles. For this evalu-

ation (see section 3.2), m = 48 and only a single dataset (worst case scenario

in our simulations) is used. In the dataset, we had 10%in {µ} with no cor-

responding points in {x} and σ = 2 for noise in {x} coordinates (see section

3.1.1).

5.2.4 Application on Matching Functional Sites

As stated in section 1.1.7, it is highly desirable to have a high number of same amino

acid (residue) matches. The higher the number of similar matches; the better is the

match. We use concomitant information in matching real functional sites. We

compare the results when using or not using concomitant information in the EM

algorithm and the graph method in Gold (2003).

We did pair-wise matching between functional sites from three different protein

families. These functional sites are:

5.2.5 Using Amino Acid Group Information

In this application, amino acid group was used as concomitant information in EM

algorithm. Four groups were used for matching and scoring purposes. As of now

there is no substitution matrix specifically for functional sites in the literature and

existing substitution matrices may not represent very well the substitution in func-

tional sites. Functional sites tend to be conserved more than the rest of the protein

Page 110: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 87

Figure 5.5: Convergence regions of starting values for EM algorithm. The algo-

rithm converges to a global optimum for pink values. We get some local optimum

convergence for lighter values otherwise no convergence at all.

Page 111: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 88

Table 5.3: Selected functional sites examples for comparing results when using or

not using concomitant information in the EM algorithm and the graph method.

Family/Fold Protein Functional site # of residues

Tim barrel superfold: 5-aminolevulinic acid 1b4e 0 21

5-aminolaevulinate dehydratase 1aw5 5 6

Tyrosine dependent 17 − β hydroxysteroid 1a27 0 63

dehydrogenase

oxidoreductase: NADP-dependent mannitol 1h5q 0 88

dehydrogenase

Trihydroxynaphtalene reductase 1g0n 0 43

Carbonyl reductase 1cyd 1 40

SER-HIS-ASP Subtilisin carlsberg 1bfk 0 38

catalytic triad: Aspartate aminotransferase 1ajr 0 28

Glutaminase asparaginase 3pga 0 63

(Sanchez and Sali, 1998). We used ad hoc weights (section 5.2.2) with α = 2 when

matching with concomitant information. Amino acids were grouped into hydropho-

bic, charged, polar and glycine (see Table 5.4). The difference in centres of mass for

the configurations and the identity matrix were taken to be the starting values for

the translation vector and rotation matrix respectively.

Table 5.4: Groups of amino acids (Branden and Tooze, 1999, p. 6).

Symbols: A C D F G H I K L M N P Q R S T V W Y

Group 1 (hydrophobic) A F I L M P V

Group 2 (charged) D E K R

Group 3 (polar) C H N Q S T W Y

Group 4 (glycine) G

Table 5.5 summarises the results when using EM algorithm with and without

amino acid group information in matching a functional site of 17−β hydroxysteroid-

dehydrogenase (1a27 0) against other functional sites. Table 5.6 summarises the

results obtained by the graph method. We use three scoring options as defined in

Page 112: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 89

section 3.2.2. The final score (Score*) is got by dividing the option 2 raw score

by the RMSD. The rule of thumb is, the bigger the score the better the solution.

All scores by the EM algorithm using colour are bigger than when not using colour

information. EM algorithm using colour also find better matches for 1bfk 0, 1cyd 1

and 3pga 0 than the graph methods. EM algorithm did not converge for 1h5q 0

which has ridiculously large RMSD. In Table 5.7 we give a solution for 1h5q 0 after

proper convergence and using distance constraining techniques in section 5.3 to

improve the EM algorithm.

Table 5.5: Comparison of with and without colour matching results when matching

a functional site of 17 − β hydroxysteroid-dehydrogenase (1a27 0) against other

functional sites. Relative weight of (α = 2) was used for similar amino acids when

using colour information.

No colour Colour

Raw Score Raw Score

Option Option

site 0 1 2 RMSD Score* 0 1 2 RMSD Score*

1ajr 0 7 0 1.0 5.13 0.19 13 2 5.5 4.06 1.35

1b4e 0 12 1 2.0 2.79 0.72 13 1 3.0 2.67 1.12

1bfk 0 12 1 2.5 5.24 0.48 18 4 6.5 3.36 1.93

1cyd 1 32 11 15.0 1.82 8.24 31 12 16.0 1.81 8.85

1g0n 0 19 2 4.0 3.33 1.20 22 5 9.5 4.05 2.35

1h5q 0 20 1 7.0 9.35 0.75 22 4 21.0 8.99 2.34

3pga 0 13 1 3.0 5.89 0.51 24 4 11.5 4.82 2.39

Score*= Option 2 Raw Score divided by the RMSD.

Figure 5.7 illustrates that increasing the weight for same group residues (amino

acids), increases the number of same group matches. Figure 5.6 shows superim-

position of 17 − β hydroxysteroid-dehydrogenase on carbonyl reductase when EM

algorithm method is used. There are 27 common matches between colour and no

colour methods. There are 3 pairs exclusively matched when using colour informa-

Page 113: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 90

Table 5.6: Results when matching a functional site of 17 − β hydroxysteroid-

dehydrogenase (1a27 0) against other functional sites using Gold (2003) method.

Raw Score

Option

site 0 1 2 RMSD Score*

1ajr 0 12 0 3.0 1.85 1.62

1b4e 0 10 2 5.0 4.19 1.19

1bfk 0 12 1 3.5 2.44 1.43

1cyd 1 27 14 18.5 3.31 5.59

1g0n 0 31 13 21.0 3.07 6.84

1h5q 0 33 16 22.0 2.72 8.09

3pga 0 15 1 4.5 3.20 1.41

Score*= Option 2 Raw Score divided by the RMSD.

tion but not without colour information. On the other hand, 2 pairs are exclusively

matched when not using colour and not matched when using colour. Two of the 3

pairs matched exclusively when using colour are for identical amino acids, the other

pair is for same group amino acids. However amino acids from different groups are

matched in the two exclusive pairs when not using colour information.

Page 114: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 91

Figure 5.6: Superposition of carbonyl reductase and 17 − β hydroxysteroid dehy-

drogenase sites when matching with EM algorithm. Amino acid classes information

not used in (a) but used in (b).

Advantages of using amino acid grouping

Use of amino acid group as concomitant information increases

(a) The number of same group/residue matches.

(b) The volume of a region for initial parameter estimates for which the EM algo-

rithm converges to a global maximum parameter vector.

Challenges of using amino acid group information

Using amino acid group information in this way to increase the number of same

residue matches might be at the expense of an overall number of geometrical matches.

Increasing the number of same residue matches sometimes also lead to an increase

in RMSD as seen in Figure 5.7 for matching 17 − β hydroxysteroid dehydrogenase

and 5-aminolevulinic acid.

5.2.6 Summarising Comments

• Use of amino acid type information through weights improves on the quality

of match.

Page 115: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 92

• From experimentation, if total number of colours is a then setting α = a

for ad hoc weights and β = a − 1 for simple prior conditional probabilities

gives optimal results. Heavy weights for similar residues is at the expense

of geometrical matching (RMSD) and the gain in class matching is marginal

(for each pair of sites there is a maximum number of possible class matches).

Typical scenario is shown in Figure 5.7.

• It is seen from simulation studies that with the use of concomitant information

we are able to find a set of good starting values for the EM algorithm and the

algorithm converges faster. Figure 5.5 shows good starting values with and

without colour information use.

• To overcome the problem of starting values, a simple approach would be to try

several random starting values. However we consider a more comprehensive

approach using Markov chain Monte Carlo (MCMC) technique in a Bayesian

framework in Chapter 6.2.

5.3 Distance Constraints

It is observed that EM algorithm in sections 5.1 and 5.2 tends to match more

points and hence with larger RMSD than graph method. In the graph method,

matching all inter-point distances enforces strict geometrical matching constraints.

Here we consider more techniques to enforce matching points to be closer in the EM

algorithm.

In addition to using a posterior probability weighted variance estimated at each

iteration of the algorithm for the mixture model, we incorporate three techniques

to ensure smaller distances between matched points:

(a) Variance cooling. If the variance increases from that of the previous estimate at

iteration t of N total number of iterations allowed then use:

Page 116: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 93

1 2 3 4 5 6 7

0.4

0.6

0.8

1.0

1.2

weight: α

scor

e

option 1option 2

a)

1 2 3 4 5 6 7

24

68

1012

weight: α

mat

ches

same residue matchessame group matchestotal geometrical matches

b)

1 2 3 4 5 6 7

34

56

weight: α

RM

SD

c)

Score:

option 1 = No. identity matches

RMSD

option 2 = option 1 +No. similar matches

2 x RMSD

Figure 5.7: Match scores and RMSD against α (relative weight for similar amino

acids). Matched sites are 17−β hydroxysteroid dehydrogenase and 5-aminolevulinic

acid. a) Option 1 and 2 scores. b) Total number of pairs matched, pairs with the

same amino acid and pairs with the same group. c) RMSD.

Page 117: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 94

σ2 = A0

(ANA0

)t/N

where A0 and AN are desired variance values at t = 0 and t = N respectively.

From an application in section 5.3.1 we observe easy convergence and better

RMSD values for A0 = 100 and A200 = σ2g = 0.32 in most cases. We choose

0.3 to correspond to the threshold value of 1.5A for matching distances in the

graph method (see section 4.1.4 in Chapter 4). Furthermore, under the Gaussian

model, the width of a 85% C.I. for matching distances in graph method is

2 × 1.04√

3 × 2σ2g . Equating this to threshold value of 1.5A gives σg = 0.297.

Kent et al. (2004) independently found out that using σ = 0.3 for EM algorithm

gives similar results to graph method when matching 17 − β hydroxysteroid

dehydrogenase and carbonyl reductase functional sites. And indeed, conversely,

using the graph solution when matching 17 − β hydroxysteroid dehydrogenase

and carbonyl reductase functional sites we estimate σ to be around 0.3A.

(b) Fixing the variance for the coffin bin, σ20 . This value is calculated from the

volume of {µ}, W . Consider a sphere with volume W i.e. W = 43πR3. Let

2σ0 = R then σ20 = 1

4

(3W4π

)2/3.

(c) In linear programming, rule out correspondences with probability less than

cL = φ(r, σ2cI3) where φ is a standard normal density; r and σ2

c are applica-

tion specific values to be specified by the user. We use a probability threshold

value of 0.038 for r = 1 and σ2c = 1.019 which seem to give reasonably good

results. Probability thresholding is similar to the Bayesian approach considered

in section 6.1.5.

5.3.1 Results

Here we consider both query and templates from tyrosine-dependent oxidoreduc-

tases family. We compare a functional site of 17 − β hydroxysteroid dehydrogenase

(1a27 0) to representative sites from each of the 33 domains in this family. Sites

Page 118: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 95

in the first column of Table 5.7, were chosen as representatives for their respective

domains.

As in section 5.2.5, we used ad hoc weights with α = 2 when matching with

concomitant information (colour). Amino acids were also grouped into hydrophobic,

charged, polar and glycine. The difference in centres of mass for the configurations

and the identity matrix were taken to be the starting values for the translation

vector and rotation matrix respectively.

Reported in Table 5.7 are RMSD values for graph and EM algorithm with and

without colour information use. Also reported are differences in rotations (A) used

to match the sites by graph and EM algorithm methods. If A and A being rotation

matrices in graph and EM algorithm respectively, A is such that the trace of the

orthogonal matrix taking A to A is approximately equal to 1 + 2 cos A (Green and

Mardia, 2006). Thus A = cos−1(

tr(AAT )−12

).

Results show that these distance constraining techniques considerably lower the

number of matching points and RMSD. Solutions for 1h5q 0 when using the EM

algorithm are now comparable to the graph theoretic solution unlike in Table 5.6

where the EM algorithm did not possibly converge. In general the higher the number

of matching points (q) and the lower the RMSD, the better the solution. RMSD

and q are combined into a single score e.g. in section 5.2.4 (Tables 5.5 and 5.6) to

rank the matches. Alternatively p-values e.g in Chapter 4, section 4.1.4 (Table 4.1)

can be used. However here we just informally note a number of cases with clearly

better solutions by the EM algorithm compared to solutions by the graph method

(cases italicised in Table 5.7). Obvious cases are solutions with many more matching

points with RMSD of similar magnitude or solutions with much lower RMSD but

with comparable matching points.

5.4 Multiple Transformations

For simplicity we consider a situation where the configuration {x} is related to {µ}through two different transformations. The extension to many transformations is

Page 119: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 96

Table 5.7: Matching statistics for 17 − β hydroxysteroid dehydrogenase functional

site (1a27 0) against representative functional sites using EM algorithm method

with and without colour information use and Graph method. Italicised cases have

qualitatively better solutions by the EM algorithm compared to graph solutions.

Colour No Colour Graph

Site A RMSD q A RMSD q RMSD N

1udc 0 1.700 3.658 31 1.112 4.749 31 3.605 17

1bxk 0 1.024 2.692 26 0.452 3.631 29 4.203 19

1n2s 0 1.717 0.915 3 1.162 1.672 4 2.547 4

1e6u 0 0.601 3.627 45 3.099 5.648 34 1.159 23

1eq2 0 0.138 5.093 31 0.548 3.583 28 1.648 19

1i24 0 0.439 2.031 13 0.447 1.947 13 3.763 15

1k6x 0 1.661 2.985 18 1.319 6.381 12 2.100 10

1cyd 0 3.037 3.262 22 0.985 2.978 33 1.575 25

1oaa 0 1.131 3.638 35 0.534 3.194 32 1.686 23

1fdv 0 0.009 0.418 51 0.009 0.418 51 0.423 51

1fmc 0 2.218 3.688 35 1.331 5.120 24 1.000 25

1hdc 0 0.648 4.881 24 0.614 1.358 10 2.556 11

1fk8 0 0.578 2.779 25 0.645 2.722 23 0.956 24

1nff 0 1.576 2.806 12 2.766 4.864 10 2.356 28

1nxq 0 1.182 1.061 3 1.668 0.243 4 1.274 4

1bdb 0 0.312 2.561 23 0.313 2.561 23 0.874 40

1b16 0 1.412 1.404 12 2.463 5.388 25 0.823 38

1gco 0 0.515 3.052 26 0.541 3.181 28 0.773 25

1geg 0 0.600 1.034 37 0.428 2.770 33 0.959 30

1iy8 0 1.422 3.796 27 1.423 3.758 27 0.853 26

1h5q 0 2.206 2.347 19 2.206 2.593 21 2.723 33

1gz6 0 1.368 2.751 45 1.327 4.200 59 1.203 27

1edo 0 1.585 2.717 16 1.543 3.343 18 0.845 25

1eno 0 1.068 3.195 18 1.067 3.195 18 1.078 21

1ae1 0 2.190 3.257 19 0.864 4.457 17 0.890 24

1g0o 0 0.520 2.830 39 0.511 2.805 39 0.813 29

1ja9 0 1.011 3.986 31 0.752 3.517 21 2.173 28

1hdo 0 1.803 3.322 38 1.140 3.707 37 2.165 20

1e6w 0 0.173 2.567 57 0.223 4.341 57 2.632 39

1n5d 0 1.905 1.069 3 1.905 1.069 3 1.134 3

Page 120: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 97

straightforward.

In two transformations case, we assume the set of points {x} is divided into two

distinct sets:

S1 = {xj}: collection of points with the first transformation.

S2 = {x′j}: collection of points with the second transformation.

On the other hand set {µ} is divided into three distinct sets:

G1 = {µi}: collection of points corresponding to points in S1.

G1 = {µ′i}: collection of points corresponding to points in S2.

G3 = {µ′′i }: collection of points with no corresponding points in S1 ∪ S2.

However the set membership for all the points is not known.

5.4.1 Soft Matching Model

As in one transformation case, let {µi}, i = 1, 2, . . . , m, and {xj}, j = 1, 2, . . . , n;

m ≥ n, be two sets of sites in ℜd of a region W. Let π(j) = i for xj where i =

{0, 1, . . . , m} be a map of correspondence between xj and µi. Now we define a map

H(j) = s if xj arises from µi, i = 1, . . . , m through transformation s = 1, 2. For

compact notation, we introduce a joint map γs(j) = i iff π(j) = i and H(j) = s. If

γs(j) = i assume

xj = ATs µi + bs + εi

εi ∼ IN(0, σ2) where σ2 is unknown and As is an orthogonal matrix. That is, for

fixed j, we take for i = 1, . . . , m,

φ(xj |π(j) = i, H(j) = s) = φ(xj |γs(j) = i)

=

(2πσ2)

−d2 exp

{−1

2‖xj − ATs µi − bs‖2/σ2

}if i 6= 0

1‖W‖ if i = 0.

(5.9)

As per convention, π(j) = 0 or γs(j) = 0 is used to classify a point xj which does not

correspond with any of the points µi. In this case, we suppose that xj is uniformly

distributed on W i.e.

xj |(γs(j) = 0) ∼ Uniform(W ).

Page 121: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 98

Alternatively we can assume normal distribution for coffin bin points (see section

5.1.1). We have experienced that the results are not sensitive to using either uniform

or Gaussian distribution for the coffin bin.

The marginal distribution of xj is given by the mixture model

xj ∼

2∑

s=1

m∑

i=1

P (γs(j) = i)N(Asµi + bs, σ2Id) + P (γs(j) = 0)Uniform(W ) (5.10)

where P (γs(j) = i), i = 0, . . . , m, are marginal membership probabilities and

2∑

s=1

m∑

i=1

P (γs(j) = i) + P (γs(j) = 0) = 1.

5.4.2 Model Likelihood

LetX = (x1, . . . , xn)T , L, S be sets of labels for map functions π(j) = i and S(j) = s.

Given L,H , the likelihood is

Q(X|L, S) =2∏

s=1

m∏

i=0

n∏

j=1

pI[γs(j)=i]i φ(xj|γs(j) = i)I[γs(j)=i]

where I is an indicator function such that

I[γs(j) = i] =

1 if γs(j) = i

0 otherwise

and pi is the mixing probability for any x to be with label i i.e. be mapped to µi

under transformation s.

Hence

logQ(X|L, S) =

2∑

s=1

m∑

i=0

n∑

j=1

{I[γs(j) = i] log pi + I[γs(j) = i] logφ(xj |γs(j) = i)} .

(5.11)

With the labels unknown, let

pi = P (γs(j) = i), i = 1, 2, . . . , m,

2∑

s=1

m∑

i=1

pi + p0 = 1

Page 122: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 99

be the prior probability of the label π(j) to be i and label S(j) to be s. The posterior

probability is

pji = P (γs(j) = i|xj) =P (xj|γs(j) = i)

P (xj)pi

and (pji) is an n× (2m+ 1) matrix. Note that

P (xj) =

2∑

s=1

m∑

i=1

piφ(xj |γs(j) = i) + p0φ(xj|γs(j) = 0)

and P (xj|γs(j) = i) ≡ φ(xj |γs(j) = i).

It is straightforward to extend this model formulation to more than two groups

of transformation. For the model to be identifiable, obviously s should be much

smaller than n.

5.4.3 The EM Algorithm

A simple extension of an EM algorithm with a coffin bin to two separate transfor-

mations is considered.

Let pi be given, with starting values p(0)i = 1/(2m+ 1), say. Then the E-step is:

p(r+1)ji =

P (xj|γs(j) = i)

P (xj)p

(r)i .

Substituting pji for I[γs(j) = i] the log likelihood is

2∑

s=1

m∑

i=0

n∑

j=1

{pji log pi + pji logφ(xj |γs(j) = i)} . (5.12)

Thus in the M-step, we minimise:

f(A1, A2, b1, b2) =

2∑

s=1

m∑

i=1

n∑

j=1

pji‖xj − ATs µi − bs‖2 (5.13)

using Procrustes fit for rigid body motion. As is an orthogonal matrix. If VsΓUTs

is a singular value decomposition of Bs =∑m

i=1

∑nj=1 pji(µi − µs)(xj − xs)

T where

µs =

m∑

i=1

n∑

j=1

pjiµi

m∑

i=1

j=1

pji

; xTr and yTr are the rth rows of X and Y then As = VsUTs .

Page 123: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 100

Thus for the (r + 1)th iteration we have

B(r+1)s =

m∑

i=1

n∑

j=1

p(r)ji (µi − µs)(xj − xs)

T , A(r+1)s = (VsU

Ts )(r+1).

By minimising (5.13) w.r.t. bs, we have

b(r+1)s =

∑mi=1

∑nj=1 p

(r)ji (xj − (A

(r+1)s )Tµi)

∑mi=1

∑nj=1 p

(r)ji

.

Finally update the mixing proportions:

p(r+1)i =

∑j p

(r)ji∑

ji p(r)ji

.

E and M steps are repeated until convergence of residual sum of squares:

2∑

s=1

n∑

i=1

(x(r+1)i − x

(r)i )T (x

(r+1)i − x

(r)i−s)

where x(r)i = A

(r)s µi + b

(r+1)s . To ensure convergence of the correspondence matrix

(pji) as well, use x(r)i =

∑nj=1 pjiA

(r)s µi+b

(r+1)s . Another criteria of convergence could

be the log-likelihood (5.12).

At the rth iteration, the correspondence probability weighted maximum likeli-

hood estimate of σ2 is

(σ2)(r) =P2

s=1

Pmi=1

Pnj=1 p

(r)ji ‖xj−(AT

s )(r)µi−b(r)s ‖2

d×P2

s=1

Pmi=1

Pnj=1 p

(r)ji

where d = 3 is the dimension.

The unweighted estimate is∑2

s=1

∑mi=1

∑nj=1 ‖xj − (ATs )(r)µi − b

(r)s ‖2

2 × d× n×m.

We assumed these two transformations have the same nuisance parameter σ. It is

straightforward to extend the theory to different parameters case. The maximum

likelihood estimate of variance, (σ2s )

(r) becomes

Pmi=1

Pnj=1 p

(r)ji ‖xj−(AT

s )(r)µi−b(r)s ‖2

d×Pmi=1

Pnj=1 p

(r)ji

and the unweighted estimator is

Pmi=1

Pnj=1 ‖xj−(AT

s )(r)µi−b(r)s ‖2

d×n×m .

Page 124: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 101

5.4.4 Simulations

We did some simulations to evaluate the algorithm. We evaluated performance

for m = 24, n = 20 and sets S1, S2 having 10 points each. With equal number

of points in each set, we expect to have equal membership preference for either

transformations.

Table 5.8 gives correct correspondence proportions and rotation errors for A1 and

A2 i.e. measures of distance between true and estimated rotation matrices. Reported

are results for several runs using different starting values for A1, A2, b1 and b2 (The

EM algorithm is very sensitive to starting values, see Figure 5.5). For each run

we had 30 dataset replicates. Correct correspondence proportions for the algorithm

are around 0.7. Here a point has a correct correspondence if assigned to a true

corresponding point µi and under the true transformation s or rightly not assigned

to any other µi. As expected the performance is not as good as in a simpler case of

one transformation only. Rotation errors for the first transformation are around 0.05

radians while for the second transformation are in the range of 0.1 to 0.7 radians.

There is higher accuracy in estimating the rotation for the first transformation than

for the second. This is surprising considering that we had equal number of points in

each set. However since we estimated A1 first, the higher accuracy could be due to

the algorithm drifting quickly towards the first transformation as we started quite

near the true parameter setting (otherwise the designation of transformations as first

or second is arbitrary). Obviously, extensive simulations are required to conclusively

assess performance of the algorithm especially transformation errors.

Page 125: Statistical approaches to protein matching in Bioinformatics

Chapter 5. EM algorithm Alignment 102

Table 5.8: Proportions of correct correspondence and rotation errors when using

EM algorithm for matching forms with two transformations. A point has a correct

correspondence if assigned to a true corresponding point µi and under the true

transformation s

or the point is rightly not assigned to any other point µi.

Correspondence Rotation error

Run All Points Set 1 Set 2 A1 A2

1 0.681 0.692 0.669 0.055 0.767

(0.0057) (0.0071) (0.0075) (0.0070) (0.0340)

2 0.695 0.706 0.684 0.051 0.136

(0.0058) (0.0070) (0.0077) (0.0040) (0.0072)

3 0.681 0.692 0.670 0.054 0.702

(0.0057) (0.0072) (0.0074) (0.0053) (0.0336)

Given in parentheses are the std. errors.

Page 126: Statistical approaches to protein matching in Bioinformatics

Chapter 6

Bayesian Alignment

In this chapter we consider Markov chain Monte Carlo (MCMC) technique in a

Bayesian paradigm to overcome the problem of sensitivity to starting values for EM

algorithm in Chapter 5. We consider finding alignment and point correspondences

between two configurations using a full joint distribution for correspondence matrix

and transformation parameters. Using MCMC with detailed balance update and

drawing from the posterior of all parameters should stand a better chance of escaping

from local maxima for the model better than by simply trying several starting values

for the EM algorithm.

6.1 Bayesian Hierarchical Model

Green and Mardia (2006) build a hierarchical model to solve alignment and matching

of configurations, according to the Bayesian paradigm. This method gives a complete

distribution of probable matches and hence an opportunity to explore several other

solutions near the “optimal” solution.

103

Page 127: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 104

6.1.1 Point Process Model, with Geometrical Transforma-

tion and Random Thinning

Suppose there are two point configurations in d-dimensional space Rd: {xj , j =

1, 2, . . . , m} and {yk, k = 1, 2, . . . , n}. The points are labelled for identification, but

arbitrarily.

Both point sets are regarded as noisy observations on subsets of a set of true

locations {µi}, where the mappings from j and k to i is unknown. There may be a

geometrical transformation between the x-space and the y-space, which may also be

unknown. The objective is to make model-based inference about these mappings,

and in particular make probability statements about matching – which pairs (j, k)

correspond to the same true location?

The geometrical transformation between the x-space and the y-space is denoted

A; thus y in y-space corresponds to x = Ay in x-space. The notation does not

imply that the transformation A is necessarily linear. It may be a rotation or more

general linear transformation, a translation, both of these, or some non-rigid motion.

Regard the true locations {µi} as being in x-space.

The mappings between the indexing of {µi} and that of data {xj} and {yk} are

captured by indexing arrays {ξj} and {ηk}; specifically assume that

xj = µξj + ε1j (6.1)

for j = 1, 2, . . . , m, where {ε1j} have probability density f1, and

Ayk = µηk+ ε2k (6.2)

for k = 1, 2, . . . , n, where {ε2k} have density f2. All {ε1j} and {ε2k} are independent

of each other, and independent of {µi}.

6.1.2 Formulation of Poisson Process Prior

Suppose that the set of true locations {µi} forms a homogeneous Poisson process

with rate λ over a region V ⊂ Rd of volume v, and that there are N points realised in

Page 128: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 105

this region. Some of these give rise to both x and y points, some to points of one kind

and not the other, and some are not observed at all. Suppose these four possibilities

occur independently for each realised point, with probabilities parameterised so that

with probabilities (1−px−py−ρpxpy, px, py, ρpxpy) observe neither, x alone, y alone,

or both x and y, respectively. The parameter ρ is a certain measure of the tendency

a priori for points to be matched: the random thinnings leading to the observed x

and y configurations can be dependent, but remain independent from point to point.

Given N , m and n, there are L matched pairs of points in the sample if and

only if the numbers of these four kinds of occurrence among the N points are

(N −m− n+ L,m− L, n− L,L). Under the assumptions above these four counts

will be independent Poisson distributed variables, with means (λv(1 − px − py −ρpxpy), λvpx, λvpy, λvρpxpy). The prior marginal1 probability distribution of L con-

ditional on m and n is therefore proportional to

e−λvpx(λvpx)m−L

(m− L)!× e−λvpy(λvpy)

n−L

(n− L)!× e−λvρpxpy(λvρpxpy)

L

L!

so that

P (L) ∝ (ρ/λv)L

(m− L)!(n− L)!L!

for L = 0, 1, . . . ,min{m,n}. Here and later, use the generic P (·) notation for distri-

butions and conditional distributions in the hierarchical model.

The matching of the configurations is represented by the matching matrix M ,

where Mjk indicates whether xj and yk are derived from the same µi point, or not,

that is,

Mjk =

1 if ξj = ηk

0 otherwise.

(6.3)

Note that M is the adjacency matrix for the bipartite graph representing the match-

ing, and that∑

j,kMjk = L. Assume for the moment that conditional on L, M is

a priori uniform: there are L!(mL

)(nL

)different M matrices consistent with a given

1Integrated over N ,

∞∑

N=n+m−L

{λv(1 − px − py − ρpxpy)}N−m−n−L

(N − m − n + L)!= 1.

Page 129: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 106

value of L, and these are taken as equally likely. Thus

P (M) = P (L)P (M |L) ∝ (ρ/λv)L

(m− L)!(n− L)!L!

{L!

(m

L

)(n

L

)}−1

∝ (ρ/λv)L,

(where here and later “∝” means proportional to, as functions of the variable(s) to

the left of the conditioning |, in this case, M). Thus

P (M) =(ρ/λv)L

∑min{m,n}ℓ=0 ℓ!

(mℓ

)(nℓ

)(ρ/λv)ℓ

. (6.4)

Because of the choice of parameterisation for the probabilities of observing hidden

points, P (M) does not involve px and py.

µ

ξ η

M

X Y

σ

A

τ

Figure 6.1: Directed acyclic graph representing the model, showing all data and

parameters treated as variable.

6.1.3 Data Likelihood

Given M , the likelihood of the observed configurations of points is specified as

follows. Assume that A is an affine transformation: Ay = Ay + τ . From (6.1)

and (6.2), the densities of xj and yk, conditional on A, τ , {µi}, {ξj} and {ηk} are

f1(xj − µξj) and |A|f2(Ayk + τ − µηk), respectively, |A| denoting the absolute value

of the determinant of A.

The locations {µi} of the m − L points that generate an x observation but not

a y observation are independently uniformly distributed over the region V , so that

Page 130: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 107

the likelihood contribution of these m− L observations, namely {j :∑

k

Mjk = 0},

is∏

j:Mjk=0∀kv−1

V

f1(xj − µ)dµ.

Similarly, the contributions from the unmatched y observations, and from the matched

pairs are

k:Mjk=0∀jv−1

V

|A|f2(Ayk+τ−µ)dµ and∏

j,k:Mjk=1

v−1

V

f1(xj−µ)|A|f2(Ayk+τ−µ)dµ

respectively. These integrals all exhibit “edge effects” from the boundary of the

region V , which can be neglected if V is large relative to the supports of f1 and f2.

In this case these three expressions approximate to

v−(m−L), (|A|/v)n−L, and (|A|/v)L∏

j,k:Mjk=1

Rd

f1(xj − µ)f2(Ayk + τ − µ)dµ

respectively. The last expression can be written

(|A|/v)L∏

j,k:Mjk=1

g(xj −Ayk − τ)

where g(z) =∫f1(z + u)f2(u)du (the density of ε1j − ε2k).

Combining these terms together, the complete likelihood is

P (x, y|M,A, τ) = v−(m+n)|A|n∏

j,k:Mjk=1

g(xj − Ayk − τ). (6.5)

Multiplying (6.4) and (6.5), then

P (M,x, y|A, τ) ∝ |A|n∏

j,k:Mjk=1

{(ρ/λ)g(xj −Ayk − τ)}.

Note that the constant of proportionality involves m, n, λ, ρ, and v, but not A, τ ,

any parameters in f1 or f2, or M of course.

By further making assumptions of spherical normality for f1 and f2:

xj ∼ Nd(µξj , σ2xI) and Ayk + τ ∼ Nd(µηk

, σ2yI),

with σx = σy = σ, say, then

g(z) =1

(σ√

2)dφ(z/σ

√2)

Page 131: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 108

where φ is the standard normal density in Rd, and the final joint model is

P (M,A, τ, σ, x, y) ∝ |A|nP (A)P (τ)P (σ)∏

j,k:Mjk=1

(ρφ({xj − Ayk − τ}/σ√2)

λ(σ√

2)d

).

(6.6)

Note that not only px and py but also v does not appear in this expression, principally

from the choice of parameterisation, and that only the ratio ρ/λ is identifiable.

The directed acyclic graph representing this joint probability model, including the

variables (µ, ξ and η) that have been integrated out, is displayed in Figure 6.1.

6.1.4 Prior Distributions and Computations

We assumed the existence of true but unobservable locations {µi} from a Poisson

process just to conveniently formulate the mathematical framework and simplify the

algebra. The assumption of Poisson points would not exactly represent the model for

functional sites (see Chapter 2). However in section 6.1.8 we do sensitivity analysis

for the Poisson assumption and find that violations of the assumption do not impede

the effectiveness of the algorithm.

Green and Mardia (2006) treat ρ and λ as fixed, and consider inference for the

remaining unknowns M , τ , σ2 and sometimes A, given the data {xj} and {yk}.Markov chain Monte Carlo methods are used for the computation.

Suppose that prior information about τ , σ2 and A will be at best weak and use

generic prior formulations that facilitate the posterior analysis. Prior assumptions

are therefore discussed in parallel with MCMC implementation. Note that the for-

mulation has some affinity with mixture models, the matching matrix M playing a

similar role to the allocation variables often used in computing with mixtures; see,

for example, Richardson and Green (1997). As in that paper, this full Bayesian anal-

ysis aims at simultaneous joint inference about both the discrete and continuously

varying unknowns, in contrast to frequentist approaches.

This model has another similarity with a mixture formulation, in that as M

varies, the number of hidden points needed to generate all the observed data also

varies, and thus there seems to be a “variable-dimension” aspect to the model. How-

Page 132: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 109

ever, the approach of integrating out hidden point locations eliminates the variable-

dimension parameter, so that reversible jump MCMC is not needed.

Priors and MCMC updating for a rotation matrix

From equation 6.6, the full conditional distribution for A given data and values for

all other parameters is

P (A|M, τ, σ, x, y) ∝ |A|nP (A)∏

j,k:Mjk=1

φ({xj − Ayk − τ}/σ√2).

Viewing this as a density for A, there is still freedom to choose the dominating

measure for P (A) arbitrarily. Then the full conditional density will be with respect

to the same measure.

In matching functional sites, we would only consider rigid body transforma-

tion other than a general (linear) transformation. Thus considering only rotations

(orthogonal matrices A with positive determinant) and expanding the expression

above:

P (A|M, τ, σ, x, y) ∝ P (A) exp

j,k:Mjk=1

−0.5(||xj −Ayk − τ ||/σ√2)2

∝ P (A) exp

(1/2σ2)∑

j,k:Mjk=1

(xj − τ)TAyk

∝ P (A) exp

tr

(1/2σ2)∑

j,k:Mjk=1

yk(xj − τ)TA

.

There is (conditional) conjugacy – if P (A) has the form P (A) ∝ exp(tr(F T0 A))

for some matrix F0. That is the posterior has the same form with F0 replaced by

F = F0 + (1/2σ2)∑

j,k:Mjk=1

(xj − τ)yTk . (6.7)

This is known as the matrix Fisher distribution (Downs, 1972; Mardia and Jupp,

2000, p. 289). Here for symmetry we use uniform prior with F0 = 0.

Page 133: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 110

Sampling the matrix Fisher distribution

We will review how to sample from the matrix Fisher distribution in the 3-dimensional

case.

For 3-dimensional case, A can be represented as a product of 3 elementary rota-

tions

A = A12(θ12)A13(θ13)A23(θ23) (6.8)

as in Raffenetti and Ruedenberg (1970), and Khatri and Mardia (1977). For i < j,

Aij(θij) is the matrix with mii = mjj = cos θij , −mij = mji = sin θij , mrr = 1 for

r 6= i, j and other entries 0. Each of the generalised Euler angles θij is sampled

in turn, conditioning on the other two angles and the other variables (M, τ, σ, x, y)

entering the expression for F .

The joint full conditional density of the Euler angles is

∝ exp[tr{F TA}] cos θ13

for θ12, θ23 ∈ (−π, π) and θ13 ∈ (−π/2, π/2). The cosine term arises since the natural

dominating measure, corresponding to uniform distribution of rotation, has volume

element cos θ13dθ12dθ13dθ23 in these coordinates.

By substituting the representation (6.8) and simplifying, the trace can be written

variously as

tr{F TA} = a12 cos θ12 + b12 sin θ12 + c12 + a13 cos θ13 + b13 sin θ13 + c13

+a23 cos θ23 + b23 sin θ23 + c23

where

a12 = (F22 − sin θ13F13) cos θ23 + (−F23 − sin θ13F12) sin θ23 + cos θ13F11

b12 = (− sin θ13F23 − F12) cos θ23 + (F13 − sin θ13F22) sin θ23 + cos θ13F21

a13 = sin θ12F21 + cos θ12F11 + sin θ23F32 + cos θ23F33

b13 = (− sin θ23F12 − cos θ23F13) cos θ12 + (− sin θ23F22 − cos θ23F23) sin θ12 + F31

a23 = (F22 − sin θ13F13) cos θ12 + (− sin θ13F23 − F12) sin θ12 + cos θ13F33

b23 = (−F23 − sin θ13F12) cos θ12 + (F13 − sin θ13F22) sin θ12 + cos θ13F32

Page 134: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 111

and the cij can be ignored, combined into the normalising constants. Thus the full

conditionals for θ12 and θ23 are von Mises distributions. These can be updated by

Gibbs sampling or an efficient rejection method, Best/Fisher algorithm (see Mardia

and Jupp, 2000, p. 43).

However the distribution of θ13 is proportional to

exp[a13 cos θ13 + b13 sin θ13] cos θ13.

Mardia and Gadsden (1977) studied this distribution without discussing how to

simulate a sample from it. Green and Mardia (2006) use a random walk Metropolis

algorithm, with a perturbation uniformly distributed on [−0.1, 0.1], to sample from

this distribution.

Priors and updating for other parameters

Here τ and σ−2 are taken to have respectively prior Gaussian and Gamma distri-

butions. These priors are computationally convenient and most importantly also

plausible for τ and σ in matching functional sites. Thus

τ ∼ Nd(µτ , σ2τI)

and

σ−2 ∼ Γ(α, β).

Under the assumptions of (6.6), there is conjugacy for τ and σ, and the explicit full

conditionals:

τ |M,A, σ, x, y ∼ Nd

(µτ/σ

2τ +

∑j,k:Mjk=1(xj −Ayk)/2σ

2

1/σ2τ + L/2σ2

,1

1/σ2τ + L/2σ2

I

),

σ−2|M,A, τ, x, y ∼ Γ

α + (d/2)L, β + (1/4)∑

j,k:Mjk=1

||xj − Ayk − τ ||2

and Gibbs sampler is used to update these parameters.

Page 135: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 112

Updating M

The matching matrix M is updated in detailed balance using Metropolis-Hastings

moves that only propose changes to a few entries: the number of matches L =∑

j,kMjk can only increase or decrease by 1 at a time, or stay the same. The

possible changes are

(a) adding a match: changing one entry Mjk from 0 to 1.

(b) deleting a match: changing one entry Mjk from 1 to 0.

(c) switching a match: simultaneously changing one entry from 0 to 1, and another

in the same row or column from 1 to 0.

These changes respect the constraint that there should be unique matches between

js and ks (0 ≤∑

j

Mjk ≤ 1 and 0 ≤∑

k

Mjk ≤ 1).

The proposal proceeds as follows: first a uniform random choice is made from all

m+n data points x1, x2, . . . , xm, y1, y2, . . . , yn. Suppose without loss of generality, by

the symmetry of the set-up, that an x is chosen, say xj . There are two possibilities:

either xj is currently matched (∃k such that Mjk = 1) or not (there is no such k).

If xj is matched to yk, with probability p⋆ propose deleting the match, and with

probability 1 − p⋆ propose switching it from yk to yk′, where k′ is drawn uniformly

at random from the currently unmatched y points. On the other hand, if xj is not

currently matched, propose adding a match between xj and a yk, where again k is

drawn uniformly at random from the currently unmatched y points.

The acceptance probabilities for these three possibilities are easily derived from

the expression (6.6) for the joint distribution, since in each case the proposed

new matching matrix M ′ is only slightly perturbed from M , so that the ratio

P (M ′, τ, σ|x, y)/P (M, τ, σ|x, y) has only a few factors. Taking into account also

the proposal probabilities, whose ratio is (1/nu)÷p⋆, where nu = #{k ∈ 1, 2, . . . , n :

Mjk = 0∀j} is the number of unmatched y points in M , the acceptance probability

for adding a match (j, k) is

min

{1,ρφ({xj −Ayk − τ}/σ√2)p⋆nu

λ(σ√

2)d

}. (6.9)

Page 136: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 113

Similarly, the acceptance probability for switching the match of xj from yk to yk′ is

min

{1,φ({xj − Ayk′ − τ}/σ√2)

φ({xj − Ayk − τ}/σ√2)

}(6.10)

and for deleting the match (j, k) is

min

{1,

λ(σ√

2)d

ρφ({xj −Ayk − τ}/σ√2)p⋆n′u

}(6.11)

where n′u = #{k ∈ 1, 2, . . . , n : M ′

jk = 0∀j} = nu + 1. Since the changes effected are

so modest, typically make several moves updating M per sweep along with just one

at a time for each of the other updates.

6.1.5 Inference

Point estimates for M , A and τ are important in Bioinformatics applications. We

need to specify loss functions giving the cost incurred in declaring point estimates.

We consider estimators which minimise expected loss functions with respect to con-

ditional posterior distributions.

Match Matrix

Suppose that the loss when Mjk = a and Mjk = b, for a, b = 0, 1 is ℓab; for example,

ℓ01 is the loss associated with declaring a match between xj and yk when there is

really none, that is, a “false positive”. Then

E[L(M, M )|x, y]=∑

j,k

Mjkℓ11pjk +∑

j,k

Mjkℓ01(1 − pjk) +∑

j,k

(1 − Mjk)ℓ10pjk +∑

j,k

(1 − Mjk)ℓ00(1 − pjk)

=∑

j,k

Mjk(ℓ11pjk − ℓ01pjk − ℓ10pjk + ℓ00pjk + ℓ01 − ℓ00) +∑

j,k

(ℓ00 + ℓ10pjk − ℓ00pjk)

=∑

j,k

Mjk ((ℓ11 − ℓ01 − ℓ10 + ℓ00)pjk + ℓ01 − ℓ00) +∑

j,k

(ℓ00 + ℓ10pjk − ℓ00pjk).

The last sum is invariant to Mjk, hence interested in minimising the first part:

−(ℓ01 + ℓ10 − ℓ11 − ℓ00)∑

j,k:cMjk=1

(pjk −K)

Page 137: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 114

where

K = (ℓ01 − ℓ00)/(ℓ01 + ℓ10 − ℓ11 − ℓ00)

and pjk = P (Mjk = 1|x, y) is the posterior probability that (j, k) is a match, which

is estimated by the empirical frequency of this match from an MCMC run.

Thus M is a solution to a “linear assignment” problem with cost matrix (pjk−K).

This is exactly what is suggested in section 5.3 i.e. to use linear assignment with

thresholding to harden match probabilities.

First Ordered-Set and Linear Assignment

In practise, taking first non-duplicate matches with high probability or using linear

programming to find optimal matches give similar results.

For linear programming, LPSOLVE (an implementation of a linear programming

simplex-based algorithm2) can be used. In this approach the matrix (pjk) is made

square by adding extra rows (dummy x points) with zeros. Denote this square

matrix C = (cj′k), j′, k = 1, . . . , n.

Then linear programming is used to solve for assignments which maximise∑

j′k

(cj′k−K) subject to unique values of j′, k = 1, . . . , n in the solution set {(j′, k)}.

K ∈ (0, 1) is an arbitrarily chosen matching probability threshold.

Thus linear programming finds n pairs of one-to-one assignment. Afterward any

pair (j′, k) in the solution set with cj′k−K < 0 is removed. This is just thresholding

on pj′k as linear assignment tries to match all n pairs without regard to individual

cj′k − K values. Obviously, this also removes assignments involving dummy xj′

points (j′ > m).

Rotation Matrix and Translation Vector

For quadratic error loss function, the mean of the posterior distribution is used as

a point estimate. Green and Mardia (2006) compute element-wise averages of the

realisations from the posterior distribution to get A and τ . The later is used as a

2http://cran.r-project.org/src/contrib/Descriptions/lpSolve.html.

Page 138: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 115

point estimate for the translation. A point estimate of A is taken to be a positive

definite square root of ATA which is a proper rotation matrix3.

6.1.6 Using Concomitant Information

Concomitant information (e.g. colour) of points can also be used in Bayesian hier-

archical modelling. Green and Mardia (2006) give details on incorporating colour

distributions when the log probability can be expressed linearly in entries of M i.e.

the colour distribution is independent of the point process. In this case, the contri-

bution to the likelihood from colour information is multiplicative. In implementing

this modified likelihood, MCMC acceptance ratios in section 6.1.4 are modified ac-

cordingly.

6.1.7 Results for Graph Theoretic and MCMC

In this section we compare the performance of a full Bayesian alignment using hier-

archical model (Green and Mardia, 2006) with that of the graph method.

For graph matching we use an algorithm of Applegate and Johnson (1993) in the

implementation of Gold (2003). Bayesian solution is found using MCMC algorithm

of Green and Mardia (2006). The MCMC method is adaptive to different levels of

noise in positions of functional site atoms such that is able to find good matches

in distantly related proteins. Thus MCMC can be used to explore the relationship

between functional sites in a database and the query.

Parameters

The graph theoretic method requires a threshold value for matching distances (see

section 1.2.1). In the application, a threshold value of 1.5A was used. On the other

hand, MCMC requires initial estimates for λ/ρ, µτ , σ2τ , ν, κ, β and α. We took

λ/ρ = 0.0005, µTτ = (0, 0, 0), β = 1.5, α = 1 and ν = κ = 0. We have observed that

analyses are less sensitive to choice of ν and κ.

3(ATA)1/2 is the polar part of A (see Mardia and Jupp, 2000, pp. 286, 290).

Page 139: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 116

Data

We consider a functional site for 17-beta-hydroxysteroid dehydrogenase as a query.

This protein belongs to tyrosine-dependent oxidoreductases family of the Rossmann

fold (NAD(P)-binding domain). We match the query to one template from each

family of the following folds:

I. Rossmann fold: NAD(P)-binding domain.

II. FAD/NAD(P)-binding domain.

III. TIM beta/alpha-barrel.

All these folds are from α/β class.

We considered one randomly chosen functional site for each and every domain

in these folds except for TIM beta/alpha-barrel fold. In TIM beta/alpha-barrel fold

we considered one randomly chosen functional site for each and every domain in 2

of the 28 superfamilies. Thus one representative of each and every domain from the

following families were considered.

I. Fold: NAD(P)-binding Rossmann-fold domains.

a) 1.1 Alcohol dehydrogenase-like, C-terminal domain family.

b) 1.2 Tyrosine-dependent oxidoreductases family.

c) 1.3 Glyceraldehyde-3-phosphate dehydrogenase-like, N-terminal domain.

d) 1.4 Formate/glycerate dehydrogenases, NAD-domain.

e) 1.5 Siroheme synthase N-terminal domain-like.

f) 1.6 LDH N-terminal domain-like.

g) 1.7 6-phosphogluconate dehydrogenase-like, N-terminal domain.

h) 1.8 Aminoacid dehydrogenase-like, C-terminal domain.

i) 1.9 Potassium channel NAD-binding domain.

j) 1.10 AT-rich DNA-binding protein p25, C-terminal domain.

Page 140: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 117

k) 1.11 CoA-binding domain.

II. Fold: FAD/NAD(P)-binding domain.

a) 2.1 C-terminal domain of adrenodoxin reductase-like.

b) 2.2 FAD-linked reductases, N-terminal domain.

c) 2.3 GDI-like N domain.

d) 2.4 Succinate dehydrogenase/fumarate reductase flavoprotein N-terminal

domain.

e) 2.5 FAD/NAD-linked reductases, N-terminal and central domains.

III. Fold: TIM beta/alpha-barrel (2 out of 28 superfamilies).

Superfamily: NAD(P)-linked oxidoreductase.

a) 3.1 Aldo-keto reductases (NADP).

Superfamily: FAD-linked oxidoreductase.

b) 3.2 Methylenetetrahydrofolate reductase.

c) 3.3 Proline dehydrohenase domain of bifunctional PutA protein.

Results

MCMC identifies some functional sites distantly related to the query which otherwise

the graph theoretic method might have missed.

After finding corresponding Cα atoms, rotation matrix and translation vector

are re-estimated using Procrustes. We used q⋆MCMC = min(qMCMC , qg) pairs to

calculate RMSD in MCMC where qMCMC is number of non-duplicate pairs with

highest matching probabilities and qg is the number of matching pairs found by the

graph theoretic method. Reported in Table 6.1 are RMSD for graph and MCMC

solutions in cases where MCMC manages to find better matches which the graph

Page 141: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 118

theoretic method missed. For these 67 configurations out of 136 cases, MCMC finds

geometrically better solutions than the graph theoretic method.

Tables 6.4 and 6.5 give RMSD for graph theoretic and MCMC solutions in cases

where the graph finds a better solution. For these 69 cases, we evaluate RMSD for

MCMC solution using both q⋆MCMC = min(qMCMC , qg) and qMCMC pairs. In a few

cases even with qMCMC > qg, RMSD with qMCMC pairs gave a lower RMSD than

graph solution with qg pairs. Figure 6.2 shows corresponding amino acids found

by MCMC method matching 17 − β hydroxysteroid dehydrogenase functional site

(1a27 0) against functional sites of aldose reductase (1ads 0), 3 − α hydroxysteroid

dehydrogenase (1afs 0), aspartate β-semialdehyde dehydrogenase (1brm 0), CHO

reductase (1c9w 0), UDP-glucose dehydrogenase (1dlj 0), glucose 6-phosphate de-

hydrogenase (1dpg 0), dihydrodipicolinate reductase (1drw 0) and ketose reductase

(1e3j 0). RMSD for these are shown in Table 6.1 in rows 1, 3, 7, 9,11,12,13 and

14. MCMC finds solutions which are better but very different to graph solutions in

these cases. It might be worthy exploring biological significance of these solutions.

6.1.8 Sensitivity of Poisson Prior Assumption

Poisson point process might not be ideal for the motivating applications in Bioinfor-

matics. In this section we consider how the MCMC algorithm with a Poisson prior

fair when matching “hardcore” configurations or “short chains” simulated like in

Aszodi and Taylor (1994). We compare the algorithm performance to that of graph

theoretic method (Gold, 2003) and EM algorithm (Kent et al., 2004).

Data Simulations

We simulate data in three ways:

Hardcore Data: Dataset 1

We simulate a database of paired configurations, {µ} and {x} with hardcore points.

These are pairs of configurations as in section 3.1.1 except for the noise level in the

Page 142: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 119

a)

c)

b)

d)

e)

f) g)

h)

Figure 6.2: Corresponding amino acids found by MCMC method matching 17 − β

hydroxysteroid dehydrogenase functional site (1a27 0) against functional sites of a)

aldose reductase (1ads 0), b) 3−α hydroxysteroid dehydrogenase (1afs 0), c) aspar-

tate β-semialdehyde dehydrogenase (1brm 0), d) CHO reductase (1c9w 0), e) UDP-

glucose dehydrogenase (1dlj 0), f) glucose 6-phosphate dehydrogenase (1dpg 0), g)

dihydrodipicolinate reductase (1drw 0) and h) ketose reductase (1e3j 0).

Page 143: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 120

Table 6.1: Matching statistics for 17 − β hydroxysteroid dehydrogenase functional

site (1a27 0) against functional sites from family representatives using graph and

MCMC methods (cases with MCMC doing better). Continued as Table 6.2.

Graph MCMC

No. Site n RMSD qg RMSD qMCMC Common Pairs SCOP1

1 1ads 0 32 4.379 12 1.707 12 0 3.1

2 1ae1 0 33 0.796 24 0.773 24 21 1.2 ⋆

3 1afs 0 36 3.452 11 1.809 11 0 3.1

4 1b37 0 6 2.402 6 2.596 4 1 2.2

5 1b5t 0 5 2.475 5 0.895 3 0 3.2

6 1bdb 0 54 0.784 40 0.686 40 36 1.2 ⋆

7 1brm 0 39 1.919 12 1.229 12 0 1.3

8 1bxk 0 33 1.008 19 0.952 19 15 1.2 ⋆

9 1c9w 0 31 5.014 11 1.532 11 0 3.1

10 1d5t 0 7 1.719 6 1.095 3 1 2.3

11 1dlj 0 10 2.160 8 0.895 8 0 1.7

12 1dpg 0 8 2.195 7 1.036 6 0 1.3

13 1drw 0 59 4.567 15 2.512 30 0 1.3 †14 1e3j 0 17 3.707 11 1.227 11 3 1.1 †15 1ebf 0 10 3.664 7 3.659 5 1 1.3

16 1edo 0 35 0.670 25 0.669 25 24 1.2 ⋆

17 1eno 0 30 0.819 21 0.785 21 20 1.2 ⋆

18 1f8f 0 35 1.635 12 1.092 12 0 1.1

19 1f8r 0 4 0.899 4 0.152 2 0 2.2

20 1ff9 0 9 1.298 7 0.024 2 1 1.3 †21 1fmc 0 47 0.762 25 0.680 25 23 1.2

22 1foh 0 53 2.091 13 1.881 13 0 2.2 †23 1frb 0 45 7.719 13 2.528 32 0 3.1 †24 1gdh 0 6 1.503 6 0.891 4 0 1.4

25 1geg 0 38 0.710 30 0.654 30 28 1.2 ⋆

26 1gpj 0 104 7.940 16 2.335 16 0 1.8 †27 1gu7 0 82 14.684 15 2.900 15 0 1.1

28 1gve 0 120 2.078 16 1.847 37 0 3.1 †† Lower RMSD for MCMC even with qMCMC > qg pairs.

⋆ Same family as the query.

1 See family names in section 6.1.7.

Page 144: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 121

Table 6.2: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)

against sites from family representatives using graph and MCMC methods (cases

with MCMC doing better). Continuation of Table 6.1 and continued as Table 6.3.

Graph MCMC

No. Site n RMSD qg RMSD qMCMC Common Pairs SCOP1

29 1gz4 0 70 5.449 16 1.951 10 0 1.8 †30 1h6d 0 120 6.521 18 2.441 16 0 1.3 †31 1h7w 0 120 8.422 16 2.686 14 0 2.1

32 1hye 0 30 0.929 13 0.807 13 11 1.6

33 1hyu 0 44 2.849 13 2.152 24 0 2.5 †34 1i36 0 28 2.107 11 1.925 22 0 1.7 †35 1j3v 0 40 2.935 12 1.646 12 0 1.7

36 1j4a 0 8 5.403 7 1.970 6 0 1.4

37 1j5p 0 34 1.170 14 1.031 14 9 1.3

38 1jax 0 8 1.241 6 1.067 6 1 1.7

39 1jnr 0 50 6.481 14 1.937 13 0 2.4 †40 1jqb 0 6 2.427 6 0.635 3 0 1.1

41 1ju2 0 11 3.983 8 3.789 6 1 2.2

42 1k6j 0 5 1.239 5 0.182 3 0 1.2 †⋆43 1k87 0 40 3.920 13 2.109 12 0 3.3

44 1kdg 0 7 2.118 7 0.813 5 1 2.2

45 1kol 0 42 4.918 13 2.302 13 0 1.1 †46 1kss 0 38 5.110 12 1.739 12 0 2.4

47 1l0v 0 56 6.014 14 2.358 40 0 2.4 †48 1l0v 0 56 6.014 14 2.211 14 0 2.4

49 1lc0 0 10 2.944 8 2.401 3 0 1.3 †50 1lqa 0 35 5.046 12 1.824 12 0 3.1

51 1lss 0 25 8.583 13 1.538 9 0 1.9 †52 1lvl 0 66 6.057 14 2.791 29 0 2.5 †53 1nek 0 56 6.489 15 2.100 26 1 2.4 †54 1npd 0 9 7.195 7 2.972 7 0 1.8

55 1lnq 0 6 2.641 5 0.743 2 0 1.9

56 1nrh 0 20 4.824 11 1.938 9 4 1.6

† Lower RMSD for MCMC even with qMCMC > qg pairs.

⋆ Same family as the query.

1 See family names in section 6.1.7.

Page 145: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 122

Table 6.3: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)

against sites from family representatives using graph and MCMC methods (cases

with MCMC doing better). Continuation of Table 6.2.

Graph MCMC

No. Site n RMSD qg RMSD qMCMC Common Pairs SCOP1

57 1nvm 0 14 4.205 9 1.480 10 0 1.3 †58 1pj5 0 49 2.772 13 2.353 26 0 2.2 †59 1sez 0 92 5.827 14 2.430 31 0 2.2 †60 1trb 0 49 6.780 14 1.890 22 0 2.5 †61 1udc 0 58 1.007 17 0.905 17 13 1.2 ⋆

62 1uuf 0 18 3.504 10 1.461 13 1 1.1 †63 1vj0 0 9 2.854 7 1.099 7 0 1.1

64 1vj1 0 23 1.865 10 1.594 10 0 1.1

65 2dap 0 50 3.690 13 2.291 27 0 1.3 †66 2nac 0 7 3.982 7 0.608 3 0 1.4

67 2scu 0 49 6.094 13 2.047 26 1 1.11

† Lower RMSD for MCMC even with qMCMC > qg pairs.

⋆ Same family as the query.

1 See family names in section 6.1.7.

positions for {x}. Here xj ∼ N(µi, σ2I3) with σ2 = 2 and then with σ2 = 4. This

was just to explore the effect of increasing the noise level for the {x} coordinates.

Furthermore, simulated configurations have larger volumes here. We simulate {µ}in a 1003A3 cube. Thus

(a) Set {µ} consists of hardcore points in cube with volume, V = 100×100×100A3.

Inhibition distance, d = 5A.

(b) Set {x} consists of xj ∼ N(µi, σ2I3). σ

2 = 2 and σ2 = 4

(c) Order of points in {x} is permuted so that we do not “know” the map π(j) = i.

No rotation and translation is used.

Page 146: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 123

Table 6.4: Matching statistics for 17 − β hydroxysteroid dehydrogenase functional

site (1a27 0) against functional sites from family representatives using graph and

MCMC methods (cases with graph doing better). Continued as Table 6.5.

Graph MCMC

No. Site n RMSD qg RMSD q‡MCMC Common Pairs SCOP1

1 1a4i 0 26 0.986 13 2.466 12 0 1.8

2 1c1d 0 43 1.207 12 2.087 12 0 1.8 †2 1c1d 0 43 1.207 12 2.264 22 0 1.8

3 1cjc 0 42 1.025 13 1.228 13 1 2.1 †3 1cjc 0 42 1.025 13 1.691 22 1 2.1

4 1coy 0 61 1.052 14 1.885 14 0 2.2 †4 1coy 0 61 1.052 14 2.400 35 0 2.2

5 1cyd 0 40 0.716 25 1.111 36 25 1.2 ⋆

6 1d7y 0 42 1.135 13 1.905 13 0 2.5 †6 1d7y 0 42 1.135 13 1.971 22 0 2.5

7 1dhr 0 31 1.021 14 1.171 14 9 1.2 †⋆7 1dhr 0 31 1.021 14 1.570 25 13 1.2

8 1dxy 0 73 1.082 15 2.332 15 0 1.4 †8 1dxy 0 73 1.082 15 2.494 29 0 1.4

9 1e6u 0 120 0.973 23 2.129 23 1 1.2 †⋆9 1e6u 0 120 0.973 23 2.207 27 0 1.2

10 1e6w 0 111 0.735 39 2.361 38 0 1.2 †⋆11 1el5 0 50 1.358 14 2.356 14 0 2.2 †12 1eq2 0 36 0.791 19 2.133 18 0 1.2 †⋆12 1eq2 0 36 0.791 19 2.326 19 0 1.2

13 1exb 0 36 0.921 12 1.906 12 0 3.1 †13 1exb 0 36 0.921 12 2.008 15 0 3.1

14 1fcd 0 54 1.069 12 1.670 12 0 2.5 †14 1fcd 0 54 1.069 12 2.489 28 0 2.5

15 1fdu 0 60 0.377 58 0.414 59 58 1.2 ⋆

16 1fec 0 50 2.224 14 2.619 11 0 2.5 †17 1fk8 0 30 0.945 24 1.857 24 1 1.2 ⋆

18 1fmc 0 47 0.762 25 1.117 37 25 1.2 ⋆

q‡MCMC is either qMCMC or q⋆MCMC

† MCMC RMSD with q⋆MCMC pairs.

⋆ Same family as the query.

1 See family names in section 6.1.7.

Page 147: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 124

Table 6.5: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)

against sites from family representatives using graph and MCMC methods (cases

with graph doing better). Continuation of Table 6.4 and continued as Table 6.6.

Graph MCMC

No. Site n RMSD qg RMSD q‡MCMC Common Pairs SCOP1

19 1g0o 0 42 0.599 29 0.599 29 29 1.2 †⋆19 1g0o 0 42 0.599 29 2.901 36 29 1.2

20 1gco 0 33 0.729 25 2.069 25 0 1.2 †⋆20 1gco 0 33 0.729 25 2.140 26 0 1.2

21 1gos 0 120 1.140 17 2.072 17 0 2.2 †21 1gos 0 120 1.140 17 2.518 24 0 2.2

22 1gr0 0 44 2.018 13 2.328 27 3 1.3

23 1gz6 0 120 0.942 27 2.245 25 0 1.2 †⋆24 1h5q 0 88 0.955 33 3.124 29 0 1.2 †⋆25 1h6v 0 120 1.278 17 1.846 14 0 2.5

25 1h6v 0 120 1.278 17 1.937 17 0 2.5 †26 1hdc 0 24 0.867 11 2.050 20 0 1.2 ⋆

27 1hdo 0 109 0.852 20 1.219 4 0 1.2 ⋆

27 1hdo 0 109 0.852 20 1.953 20 0 1.2 †28 1heu 0 118 1.257 16 2.340 37 0 1.1

29 1hyh 0 37 0.802 13 1.880 13 0 1.6 †29 1hyh 0 37 0.802 13 2.282 19 0 1.6

30 1iy8 0 41 0.654 26 0.665 26 24 1.2 †⋆30 1iy8 0 41 0.654 26 3.132 37 26 1.2

31 1ja9 0 43 0.786 28 1.353 39 28 1.2 ⋆

32 1kyq 0 5 1.469 5 1.904 2 0 1.5 †33 1li4 0 49 1.452 14 2.111 25 0 1.4

34 1lj8 0 23 0.786 11 2.695 18 0 1.7

35 1lqt 0 82 1.360 15 2.476 25 0 2.1

36 1lsu 0 27 1.154 12 1.264 12 0 1.9 †36 1lsu 0 27 1.154 12 1.845 19 0 1.9

37 1m66 0 6 1.672 6 2.306 6 3 1.7 †38 1m6i 0 43 1.261 13 1.728 13 0 2.5 †

q‡MCMC is either qMCMC or q⋆MCMC

† MCMC RMSD with q⋆MCMC pairs.

⋆ Same family as the query.

1 See family names in section 6.1.7.

Page 148: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 125

Table 6.6: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)

against sites from family representatives using graph and MCMC methods (cases

with graph doing better). Continuation of Table 6.5 and continued as Table 6.7.

Graph MCMC

No. Site n RMSD qg RMSD q‡MCMC Common Pairs SCOP1

38 1m6i 0 43 1.261 13 2.141 25 4 2.5

39 1mg5 0 39 0.734 28 0.995 34 28 1.2 ⋆

40 1mld 0 14 0.919 9 0.965 9 0 1.6 †41 1mo9 0 44 0.968 14 2.067 14 0 2.5 †41 1mo9 0 44 0.968 14 2.201 21 0 2.5

42 1mv8 0 35 1.282 13 2.175 22 0 1.7

43 1mx3 0 37 1.145 12 1.295 12 0 1.4 †43 1mx3 0 37 1.145 12 2.117 24 0 1.4

44 1nff 0 35 0.726 28 0.712 28 26 1.2 †⋆45 1ng4 0 52 1.355 14 2.501 20 0 2.2

46 1nhp 0 37 1.195 12 1.853 12 0 2.5 †47 1npy 0 5 1.806 5 0.006 2 0 1.8 †48 1nyt 0 40 1.127 14 2.659 19 0 1.8

49 1o94 0 120 1.131 16 2.463 16 0 2.1 †50 1oaa 0 39 0.777 23 1.926 23 1 1.2 †⋆50 1oaa 0 39 0.777 23 2.028 30 0 1.2

51 1obb 0 120 0.750 17 2.812 16 0 1.6 †52 1og6 0 88 1.025 16 3.157 17 0 3.1

53 1ono 0 8 1.412 6 3.047 2 0 1.3

53 1ono 0 8 1.412 6 5.315 5 2 1.3 †54 1orr 0 35 0.895 17 1.931 16 0 1.2 †⋆54 1orr 0 35 0.895 17 2.209 18 0 1.2

55 1pbe 0 51 1.353 14 1.829 14 0 2.2 †55 1pbe 0 51 1.353 14 2.000 18 0 2.2

56 1pjq 0 17 1.057 9 1.005 2 0 1.5

56 1pjq 0 17 1.057 9 2.090 8 0 1.5 †57 1ps9 0 69 1.070 15 2.057 15 0 2.1 †

q‡MCMC is either qMCMC or q⋆MCMC

† MCMC RMSD with q⋆MCMC pairs.

⋆ Same family as the query.

1 See family names in section 6.1.7.

Page 149: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 126

Table 6.7: Matching statistics for 17−β hydroxysteroid dehydrogenase site (1a27 0)

against sites from family representatives using graph and MCMC methods (cases

with graph doing better). Continuation of Table 6.6.

Graph MCMC

No. Site n RMSD qg RMSD q‡MCMC Common Pairs SCOP1

58 1psd 0 76 0.971 16 2.194 16 1 1.4 †58 1psd 0 76 0.971 16 2.286 29 1 1.4

59 1px0 0 18 0.826 10 1.284 10 0 1.2 †⋆59 1px0 0 18 0.826 10 1.805 12 1 1.2

60 1q1r 0 41 1.073 14 1.495 14 4 2.5 †60 1q1r 0 41 1.073 14 2.067 25 8 2.5

61 1qmg 0 40 1.677 13 1.505 13 5 1.7 †61 1qmg 0 40 1.677 13 2.572 18 0 1.7

62 1qor 0 101 0.974 15 2.179 15 0 1.1 †62 1qor 0 101 0.974 15 2.280 22 0 1.1

63 1qp8 0 20 1.108 10 1.965 10 0 1.4 †64 1qrr 0 58 1.029 15 2.168 15 0 1.2 †⋆64 1qrr 0 58 1.029 15 2.710 28 1 1.2

65 1r72 0 30 0.822 14 1.164 14 11 1.1 †65 1r72 0 30 0.822 14 1.298 18 13 1.1

66 1vjt 0 40 1.371 12 1.910 12 4 1.6 †66 1vjt 0 40 1.371 12 2.187 20 6 1.6

67 2pgd 0 9 2.120 8 2.898 6 0 1.7

68 3grs 0 45 1.101 13 1.598 13 0 2.5 †68 3grs 0 45 1.101 13 2.305 33 0 2.5

69 9ldt 0 39 2.255 13 5.365 11 0 1.6 †q‡MCMC is either qMCMC or q⋆

MCMC

† MCMC RMSD with q⋆MCMC pairs.

⋆ Same family as the query.

1 See family names in section 6.1.7.

Page 150: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 127

(d) Set {µ} has 10% more points than {x}. Extra points in {µ} have no corre-

sponding points in {x}.

Poisson model data - dataset 2

We simulate a database of paired configurations according to Poisson model of

Green and Mardia (2006). We generate a pair of configurations as follows:

(a) Get number of points, N for the set of true locations {µi}, i ∈ {1, 2, . . . , N}.N ∼ Poi(λ).

(b) Uniformly sample N points in a region V ⊂ ℜ3 of volume v.

(c) Thus {µi} forms a homogeneous Poisson process with rate λ.

(d) Configurations {xj}, j = 1, . . . , n and {yl}, l = 1, . . . , m arise from {µi} such

that:

• With probabilities px, py, ρpxpy, 1 − px − py − ρpxpy, µi gives rise to xj

alone, yl alone, both and neither respectively. ρ is a certain measure

of the tendency a priori for points to be matched. We set py = 0.05;

px = 0.05; ρpxpy = 0.90 and ρ = ρpxpy

pxpy= 360.

• Thus ∀j : xj ∼ N(µi, σ2I3) and ∀l : yl ∼ N(µi, σ

2I3) for some i.

• We choose say 12 realisations for N .

• For each N we sample 30 configurations of true locations {µ}, from where

we get pairs of {x} and {y}.

• Set {x} consists of xj ∼ N(µi, σ2I3) and {y} consists of yk ∼ N(µi, σ

2I3).

σ2 = 2. No rotation and translation are used.

• Permute the order in {x} thus we do not “know” the correspondence

between {x} and {y}.

• Therefore our database consist of 360 (30 × 12) pairs of configurations.

Page 151: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 128

Short chains data - dataset 3

We simulate a database of paired configurations according to Aszodi and Taylor

(1994) model. We generate a pair of configurations as follows:

(a) Generate a chain {µ} with at most 50 points (see section 3.1.2 in Chapter 3).

(b) Generate a twin set {x} whereby xj ∼ N(µi, σ2I3). σ

2 = 0.5.

(c) Set {µ} has 10% of the points not corresponding with any xj .

(d) Order of points in {x} is permuted so that we do not “know” the map π(j) = i.

No transformation is used.

Comparing MCMC, graph matching and the EM algorithm

We match paired configurations with graph, MCMC and EM algorithms then eval-

uate correct correspondence proportion for each method. Figure 6.3 are graphs for

correct correspondence proportions for hardcore and Poisson model datasets. For

matching short chains data, MCMC and EM algorithms had correct corrrespon-

dence proportions of 0.898 and 0.997 respectively. Graph method could not match

this type of data; it became too computer intensive.

It is noted that graph theoretic method does poorly with Poisson model data.

On the other hand, MCMC does quite well in matching hardcore configurations

even though the prior is Poisson. MCMC is quite adaptable to large noise while

graph theoretic method becomes very computationally intensive and finds fewer

true matches when there is large noise in the coordinates for corresponding points.

Figure 6.4 shows proportions of true matches found by the EM algorithm, MCMC

and graph theoretic methods in a dataset with large variance for corresponding

points’ coordinates. The EM algorithm has the best performance. The performance

for MCMC and graph methods are similar for small configurations (4 to 20 points).

However the performance for the graph method degrades faster with more than 20

points.

Page 152: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 129

Figure 6.3: True correspondence proportions for MCMC, graph and EM algorithm

methods. a) Hardcore data. b) Poisson model data.

Parameters setting

When σ2 = 2:

For MCMC and EM algorithm, true values for translation (0), rotation (I3) and

σ2 = 2 were given as starting values. Threshold value of 1.5A was used for the

Page 153: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 130

10 20 30 40

0.60

0.65

0.70

0.75

0.80

0.85

point−set size

corr

ect c

orre

spon

denc

e

GraphMCMCEM

Figure 6.4: True correspondence proportions for MCMC and graph for hardcore

data with large variance, σ2 = 4. Note that graph method could not match more

than 32 points with large variance.

graph theoretic method. We took non-duplicate matches with highest probabilities

in MCMC while linear assignment for hardening soft matches was used for EM

algorithm results. Further, in linear assignment, we required matching probabilities

to be at least 0.1.

Other settings for MCMC are:

Model hyperparameters

λ/ρ = 0.001; µτ = 0; στ = 5; α = 1; β = 2; γ = 0; δ = 0.

κ = 0, ν = 0 (parameters for the prior on A).

Sampler control and parameters

p⋆ = 0.5.

# of updates for matching matrix M per sweep = 10.

Page 154: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 131

# of sweeps = 10000.

burning in period = 2000 sweeps.

Initial values

τ = 0; θ = 0; σ =√

βα.

For large variance, σ2 = 4:

β = 4 for MCMC and threshold=7 for graph theoretic method.

6.2 Using Two Atoms for each Amino Acid

In this section we extend the Bayesian alignment method of Green and Mardia

(2006) presented in section 6.1 to matching coupled points in a configuration. This

is motivated by the requirement in Bioinformatics to prefer matching amino acids

with similar orientation when matched configurations are superposed.

We take into account relative orientation of side chains by using Cα and Cβ

atoms in matching amino acids. Positions of these atoms from the same amino

acid are dependent. Let y1k and x1j denote coordinates for Cα atoms in the query

and functional site. We denote Cβ coordinates for the query and functional site by

y2k and x2j respectively. Thus x1j and x2j are dependent. Similarly, y1k and y2k

are dependent. We take into account the position of y2k by using the conditional

distribution given the position of y1k. Given x1j , y1k, it is plausible to assume that

f(x1j , x2j , y1k, y2k) = f(x1j , y1k)f(x2j , y2k|x1j , y1k),

x2j |x1j ∼ N(x1j , σ2oI3),

Ay2k|y1k ∼ N(Ay1k, σ2oI3)

or the displacement

x2j − Ay2k|(x1j , y1k) ∼ N(x1j −Ay1k, 2σ2oI3).

We assume for “symmetry” that f(x2j−Ay2k|x1j, y1k) depends only on the displace-

ment as in the likelihood in equation 6.6. Thus φ(.) in Green and Mardia (2006) is

replaced by φ(.) × φ({x2j − x1j − A(y2k − y1k)}/σo√

2) for the new full likelihood.

Now the final joint model is

Page 155: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 132

P (M,A, τ, σ, x1, y1, x2, y2) ∝|A|nP (A)P (τ)P (σ)×∏

j,k:Mjk=1

(ρφ({x1j − Ay1k − τ}/σ√2) × φ({x2j − x1j −A(y2k − y1k)}/σo

√2)

λ(σ√

2)d

).

(6.12)

Some probability mass for the distribution of x2j |x1j and Ay2k|y1k is unaccounted

for because there is inhibition distance between x2j and x1j and also between y2k and

y1k. Thus x2j − x1j is not isotropic. This is not expected to affect the performance

of the algorithm because relative contributions from each x2j − x1j is unaffected. In

other words the unaccounted probability mass can be attributed to the proportion-

ality constant.

6.2.1 Prior Distributions and Computations

The additional term in the new full likelihood does not involve τ hence the posterior

and updating of τ is unchanged.

Rotation Matrix

The full conditional distribution of A is

P (A|M, τ, σ, x1, y1, x2, y2) ∝|A|2nP (A)×∏

j,k:Mjk=1

φ

(x1j − Ay1k − τ

σ√

2

(x2j − x1j − A(y2k − y1k)

σo√

2

).

(6.13)

Page 156: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 133

Thus

P (A|M, τ, σ, x1, y1, x2, y2)

∝ P (A)×exp

(tr{

12σ2

∑y1k(x1j − τ)TA

}+ 1

2σ2o

∑(x2j − x1j)

TA(y2k − y1k))

∝ P (A)×exp

(tr{

12σ2

∑y1k(x1j − τ)TA

}+ tr

{1

2σ2o

∑(y2k − y1k)(x2j − x1j)

TA})

∝ P (A)×

exp

(tr

{(1

2σ2

∑y1k(x1j − τ)T +

1

2σ2o

∑(y2k − y1k)(x2j − x1j)

T

)A

})

where the summation is over j, k : Mjk = 1.

Similar to equation 6.7, with P (A) ∝ exp(tr(F T0 A)) for some matrix F0, the full

conditional distribution of A (given data and values for all other parameters) has

the same form with F0 replaced by

F = F0 + (1/2σ2)∑

j,k:Mjk=1

(x1j − τ)yT1k

+(1/2σ2o)

j,k:Mjk=1

(x2j − x1j)(y2k − y1k)T .

(6.14)

6.2.2 Updating M

Similar to expression 6.9, acceptance probability for adding a match (j, k) is

min

1,

ρφ({x1j−Ay1k−τ}/σ√

2)p∗nu

λ(σ√

2)d×φ({x2j−x1j−A(y2k−y1k}/σo

√2)

(σo√

2)d

ff.

Similarly, the acceptance probability for switching the match of xj from yk to yk′ is

min

1,

φ({x1j−Ay1k′−τ}/σ

√2)

φ({x1j−Ay1k−τ}/σ√

2)×φ({x2j−x1j−A(y

2k′−y1k′ }/σo

√2)

φ({x2j−x1j−A(y2k−y1k}/σo√

2)

ff

and for deleting the match (j, k) is

min

1, λ(σ

√2)d

ρφ({x1j−Ay1k−τ}/σ√

2)p∗nu× (σo

√2)d

φ({x2j−x1j−A(y2k−y1k}/σo√

2)

ff.

Page 157: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 134

6.2.3 Results

Figure 6.5 shows residues matched between 17 − β hydroxysteroid dehydrogenase

and carbonyl reductase functional sites when both Cα and Cβ atoms are used.

Graph theoretic method (with a simple adaptation) can also use both Cα and

Cβ atoms. Section 7.3.2 in Chapter 7 gives the adaptation required. Using both

Cα and Cβ atoms and δ = 1.5, the graph theoretic method finds 19 corresponding

pairs with RMSD=0.72. MCMC match gives RMSD=0.57 for 19 pairs with highest

matching probabilities.

a) MCMC

b) Graph

Figure 6.5: Corresponding amino acids in matching functional sites of 17 − β hy-

droxysteroid dehydrogenase (1a27 0) and carbonyl reductase (1cyd 1) using Cα and

Cβ atoms in MCMC and graph theoretic methods.

Figure 6.6 shows residues matched using only Cα atoms. Using only Cα atoms

and with a threshold δ = 0.98 for matching distances, graph also matches 19 pairs

with RMSD=0.634. The threshold was chosen to give the same number of matches

as when using Cα and Cβ atoms. On the other hand MCMC gives RMSD=0.594

for 19 highest probability matching pairs when only Cα atoms are used.

In this example, solutions found by graph and MCMC methods using singles i.e.

one atom for each amino acid or couples i.e. two atoms for each amino acid are

Page 158: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 135

a) MCMC

b) Graph

Figure 6.6: Corresponding amino acids in matching functional sites of 17−β hydrox-

ysteroid dehydrogenase (1a27 0) and carbonyl reductase (1cyd 1) using Cα atoms

only in MCMC and graph theoretic methods.

very similar despite a lower RMSD value for the MCMC method with two atoms.

However we had to search for the right threshold for graph matching using Cα atoms

only; otherwise RMSD value is higher4 for the threshold value of 1.5A. Table 6.8

gives the number of common pairs between these solutions.

Table 6.8: Number of same pairs in MCMC and graph solutions when using Cα

atoms only (single) and when using both Cα and Cβ atoms (couple).

Graph MCMC

single couple single couple

Graph:

single - 13 12 14

couple - - 13 15

MCMC:

single - - - 16

Note: each solution with 19 matched pairs.

4RMSD=0.778; # of corresponding amino acids=27.

Page 159: Statistical approaches to protein matching in Bioinformatics

Chapter 6. Bayesian Alignment 136

6.2.4 Comments

• It is observed that mostly with coupled points MCMC converges just after 104

number of sweeps while needs around 106 sweeps for uncoupled points.

• Using coupled points involves more matching constraints such that solutions

tend to have smaller RMSD and fewer number of corresponding amino acids.

• The parameter σ2o controls the flexibility in orientation for matching coupled

points. Smaller σ2o requires more similar orientation for matching couples.

Page 160: Statistical approaches to protein matching in Bioinformatics

Chapter 7

Bayesian Refinement of Graph

Solutions

In this chapter we consider augmenting a graph-theoretic method with an MCMC

refinement step in matching protein functional sites. Thus consider a method based

on initial graph matching followed by refinement using Markov chain Monte Carlo

(MCMC) procedure.

7.1 Introduction

MCMC refinement step can provide significant improvements over graph matching

techniques. With the Bayesian approach we are able to refine graph solutions to find

more biologically interesting and statistically significant matches between functional

sites.

In our application in section 7.4, we show that the MCMC refinement step is able

to significantly improve graph based matches. We apply the method to matching

FAD/NAD(P)(H) binding sites within single Rossmann fold families, between dif-

ferent Rossmann superfamilies and within different folds. Within families sites are

often well conserved, but there are examples where significant shape based matches

do not retain similar amino acid chemistry, indicating that even within families the

same ligand may be bound using substantially different physico-chemistry. We also

137

Page 161: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 138

show that the procedure finds significant matches between binding sites for the same

cofactor in different superfamilies and different folds. The results show the method

can be used to detect structural similarity between functional sites from proteins

with different folds.

7.2 Motivation

Graph theoretic approach requires to adjust the matching distance threshold apri-

ori according to noise in atomic positions which is difficult to pre-determine when

matching templates from a database with varying distance relatives and crystallo-

graphic precision. Furthermore, the graph method is unable to identify alternative

but sometimes important solutions in the neighbourhood of the distance based so-

lution because of strict distance thresholds. On the other hand, the graph theo-

retic approach is very fast, robust and can quickly give corresponding points from

which we can get rough estimates for rotation and translation. Using MCMC in

the Bayesian hierarchical modelling (starting from the rotation and translation esti-

mates by graph method) relaxes strict distance thresholds used in graph matching.

That is MCMC automatically adapts to the level of noise in functional site atomic

positions. Furthermore, using MCMC to sample from the full joint distribution in

equation (6.12) provides an extremely flexible basis for reporting aspects of the full

joint posterior that are of interest, including alternative matching matrices.

7.3 Method

We consider the Bayesian hierarchical approach to improve the graph based solution.

A graph theoretic matching algorithm is used to get an initial estimate of rotation

and translation followed by refinement using Bayesian hierarchical modelling.

Page 162: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 139

7.3.1 Representation and Matching

In this chapter we consider matching at the level of amino acid residues. However

we use both Cα and Cβ atoms of each residue (except glycine where we only use Cα

atoms; there is no Cβ atom in glycine). Note that since there are several examples of

similarities in protein functional sites from evolutionarily unrelated proteins, which

do not preserve the amino to carboxy terminal order of the matching residues,

methods in this thesis take no account of the sequential ordering of residues.

At the least restricted level, any residue is allowed to match any other, thus

producing matches considering only the form or shape of the sites, in terms of

the spatial arrangement of their constituent residues, irrespective of amino acid

identities and physico-chemical properties. A more restricted scheme is also con-

sidered where residues are only allowed to match within the same physico-chemical

class: hydrophobic (A,F,I,L,M,P,V), polar (C,H,N,Q,S,T,W,Y), charged (D,E,K,R),

or glycine (G). These groups (Branden and Tooze, 1999, p. 6), tabulated in Table

5.4 and also used in Chapter 5, are chosen to illustrate the value of the MCMC

procedure. However the procedure would be equally applicable to other possible

physico-chemical groupings.

We consider matching two functional site configurations {xj , j = 1, 2, . . . , m} and

{yk, k = 1, 2, . . . , n} in 3-dimensional space. The jth and kth amino acids in {x} and

{y} are represented by xj and yk respectively. We do not know the correspondence

between j and k. Possibly some js do not correspond with any k and similarly some

ks do not correspond with any j. Let x1j and y1k denote coordinates for Cα atoms

for the jth and kth amino acids in the functional sites. Similarly, we denote Cβ

coordinates for the kth and jth amino acids by y2k and x2j . Thus x1j and x2j are

dependent. Similarly, y1k and y2k are dependent.

7.3.2 Graph Theoretic Step

The graph matching method described in section 1.2.1 is used. However, here we

use two atoms i.e. Cα and Cβ for matching and superposition. In addition to the

Page 163: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 140

requisite for connecting vertices representing amino acids in the product graph in

Definition 1.2.1, the inter-point distances between Cβ atoms have to be within 1.5A

as well. That is all corresponding Cα to Cα and Cβ to Cβ distances in matched

configurations are within 1.5A of each other. Thus we define a vertex product graph

as follows

Definition 7.3.1. If V1 and V2 are the sets of vertices for G1 and G2 respectively.

The vertex product graph Hv = G1 ◦v G2 includes the vertex set VH = V1 × V2, in

which the vertex pairs (xj , yk) with xj ∈ V1 and yk ∈ V2 have the same attribute.

An edge between two vertices vh = (xj , yk), vh′ = (xj′ , yk′) ∈ VH exists for j 6= j′

and k 6= k′ such that

• the absolute difference between the distances |x1j −x1j′ | and |y1k− y1k′| is less

than 1.5A.

• also the absolute difference between the distances |x2j − x2j′ | and |y2k − y2k′|is less than 1.5A.

As before, in the least restrictive case all vertices (amino acids) are assumed to have

the same attribute and hence matching can occur between any amino acid and is

only dependent on inter-residue distances. Alternatively vertices can be labelled

with residue physico-chemical properties to restrict matching to amino acids in the

same group (Section 7.3.1).

We search for the maximum similarity between two graphs G1 and G2 represent-

ing {x} and {y} respectively. Thus we search for the maximal common subgraph or

a clique within the vertex product graph for G1 and G2 (Hv = G1 ◦v G2). Example

applications in this chapter use the clique detection algorithm of Carraghan and

Pardalos (1990) for graph matching.

7.3.3 MCMC Refinement Step

We use MCMC sampling in Bayesian hierarchical modelling described in section 6.2,

starting from the rotation and translation obtained from graph solution. We start

Page 164: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 141

from several random initial values for the noise parameter and monitor convergence

(the log posterior likelihood) and quality of the solution in terms of RMSD, statistical

significance and the number of corresponding amino acids.

7.3.4 Accounting for Physico-chemistry Properties

The method accounts for the 3-dimensional form of the site as well as the physico-

chemistry of constituent amino acids. Restrictive matching is used to account for

physico-chemistry in graph matching. Thus amino acid groups are used as vertex

attributes. An edge between two vertices vh = (xj , yk), vh′ = (xj′ , yk′) ∈ VH in

the product graph VH (Definition 7.3.1) are connected only if pairs (xj , yk) and

(xj′, yk′) represent amino acids in the same physico-chemistry group. A detailed

description on how to flexibly account for physico-chemistry in Bayesian hierarchical

modelling is given in Green and Mardia (2006) and a brief discussion is in section

6.1.6. However in order to compare graph theoretic and MCMC refinement results

(in the application in section 7.4) we have not used a “full prior” for ρ/λ as in

Green and Mardia (2006) when accounting for physico-chemistry. The matching

indicator Mjk in equation 6.3 is constrained to be zero in the probability model and

all algorithmic steps if jth and kth amino acids are in different physico-chemistry

groups. As in the graph theoretic approach when accounting for physico-chemistry,

this matches amino acids in the same group only.

7.3.5 Assessing Quality of Matches

Matches were assessed in terms of a number of parameters. First the number of

matched residues and and the root mean square deviation between matched posi-

tions (RMSD) which are very commonly used in the field. It is intuitively clear

that matches of lower RMSD over larger numbers of matching residues are more

statistically significant.

In Chapter 4, we considered p-values for RMSD and the score (Gold, 2003) for

ranking matches or assessing goodness-of-fit under the assumption that matched

Page 165: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 142

configurations are related. Here we consider assessing significance of the evidence

that our match is not by mere chance and there is some relationship between the

configurations. A measure of this significance has recently been suggested Stark

et al. (2003b), and was modified in this work to correct for number of amino acids

in functional sites being matched. Matchings were pair-wise and we calculated E-

values and P-values with a correction for number of amino acids in the functional

sites. Thus we used P-values formula:

P = 1 − e−E (7.1)

where

E = C(m,n)PaΦbqR2.93q−5.88M [yR2

M ]S[zR3M ]T , q ≥ 3

and E is the expected number of matches with this RMSD or better, P is the number

of binding sites that were matched, Φ is the product of percentage abundances of all

matched amino acids, RM is RMSD, q is the number of matched amino acids, S is

the number of amino acids with two atoms matched, T is the number of amino acids

with more than two atoms matched. In our applications T = 0. Incidentally, the

first exponent of RM : 2.93q− 5.88 ≃ 3q− 6 which is expected from Mardia-Dryden

distribution of size-and-shape (see Dryden and Mardia, 1998). We used empirically

derived constants a = 3.704 × 106, b = 1.790 × 10−3, y = 0.196 and z = 0.094 as in

Stark et al. (2003b). C(m,n) = 3!(n3

)(m3

)is a correction factor for number of amino

acids, m and n in the functional sites. The expected number of matches with this

RMSD or better by chance is factored by C(m,n) = 3!(n3

)(m3

)as matching 3 points

exhausts all degrees of freedom in optimal matching of rigid bodies (Kuhl et al.,

1984). Equation 7.1 is derived from the extreme value distribution (see section 1.2.2

of Chapter 1). Because RMSD is positive and the distribution shows a heavy tail

attenuated at zero, Frechet type distribution is used.

It is important to note that the MCMC procedure is not directly aiming to

optimise any of these measures, and it is equally important to appreciate that the

connection between statistical and biological significance is not straightforward. Ac-

cordingly the example applications in the section below were carefully chosen to be

Page 166: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 143

well understood cases where matches can be interpreted relatively easily in biochem-

ical terms.

7.4 Applications

The method uses matching schemes that are relatively unrestricted in terms of amino

acid identity (either with no restriction or matching in broadly defined physico-

chemical groups). As currently formulated it is therefore better suited to the study of

larger ligand binding sites, than smaller sites associated, for example, with enzymatic

catalysis. The former are more likely to be defined by shape and physico-chemical

properties, while the latter depend critically on precise amino acid residue identities.

For our example applications we have therefore chosen sites for the binding of some

very common biochemical ligands related to FAD (flavin adenine dinucleotide) and

NAD(P) i.e. nicotinamide adenine dinucleotide (phosphate). These ligands are

bound as cofactors by a large variety of enzyme domains many of which come from

the Rossmann family of protein folds. Importantly, there are many proteins of known

structure that bind these related cofactors ranging from close evolutionary relatives,

through very distant relatives to proteins of different fold and likely independent

evolutionary origin. For structural and evolutionary relationships SCOP (Andreeva

et al., 2004) was used.

We consider two binding sites, the NAD binding site from an alcohol dehydro-

genase structure (1hdx 1 in SITESDB), and a larger NADP binding site from a

17− β hydroxysteroid dehydrogenase (1a27 0 from SITESDB) which includes both

the cofactor and substrate binding regions. For these binding sites we performed

the following matching studies

1. A functional site of alcohol dehydrogenase functional site against NAD(P)(H)

binding sites from proteins in the same SCOP family as alcohol dehydrogenase

(alcohol dehydrogenase-like, N-terminal domain; SCOP: c.2.1.1).

2. A functional site of 17−β hydroxysteroid dehydrogenase (1a27 0) against NAD(P)(H)

Page 167: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 144

binding sites from proteins in the same SCOP family as 17 − β hydroxysteroid

dehydrogenase (tyrosine-dependent oxidoreductases; SCOP: c.2.1.2).

3. The alcohol dehydrogenase functional site in (1) against NAD(P)(H) binding

sites from proteins in the same SCOP superfamily as alcohol dehydrogenase but

different families (SCOP: c.2.1.x; for x 6= 1).

4. The alcohol dehydrogenase functional site against

FAD/NAD(P)(H) binding sites from proteins in FAD/NAD(P)-binding domain

(SCOP: c.3.1.x).

The first of these test cases is the most straightforward, involving matching the

NAD binding site against similar sites in closely related proteins. The second is sim-

ilar, but more challenging, because the larger 17− β hydroxysteroid dehydrogenase

site (1a27 0) also incorporates the substrate binding region. The associated family

(c.2.1.2) is functionally broad and members catalyse reactions on a variety of diverse

substrate molecules. Matching methods therefore need to identify matches in the

related cofactor binding region and ignore local site dissimilarities owing to sub-

strate variation. The third test case considers similarities in sites with more distant

evolutionary relationships (where sequence similarity between the protein domains

concerned is very low, but the structural similarity of the Rossmann fold remains).

The forth test case assess the ability of the method to locate site similarities between

different folds that bind the same or related ligands.

7.4.1 Case 1: Alcohol Dehydrogenase and Family

Figure 7.1a shows the results of using graph matching only where matching was

performed with and without amino acid property group information. First note

that in the less restricted matching scenario, without amino acid group informa-

tion, matches generally involve more residues or lower RMSD values, as would be

expected. Thus, in the figure, the lines connecting the restricted matches (green

circles) with the unrestricted matches (blue crosses) for each site family member

Page 168: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 145

often have a gradient that is negative or close to zero. In the case of matching with-

out property information, most sites in the family show a match with 1hdx 1 with

a low RMSD (< 1.5A) and a significant number of corresponding residues (> 8).

However, this is not the case when amino acid property information is taken account

of, and a minority of the matches show relatively high RMSD values, over generally

lower numbers of matching residues. Thus it appears that lower quality matches can

result from the use of amino acid property information, perhaps because these close

relatives have conserved the shape of the binding site but not the physico-chemical

characteristics. This may happen in binding site regions whose properties are not

crucial to ligand binding.

Figures 7.1b and 7.2 show the the effect of the MCMC refinement on the graph

only matches of Figure 7.1a. The same basic conclusions can be drawn from Figure

7.1b as from Figure 1a. However, from Figures 7.1b and 7.2 it is clear that in the

cases of 3 site matches with amino acid property information, the MCMC refinement

procedure produces significant improvements in RMSD values (RMSD is improved

from> 1.5A to less than 1A while also marginally increasing the number of matching

residues). Thus the refinement procedure is able to improve some matches, even the

the cases of closely related sites examined here.

The overall effect of the refinement procedure within this family can be consid-

ered in terms of the statistical significance of the matches obtained. This informa-

tion is summarised in Table 7.1. Without taking physico-chemical properties into

account, 142 of the 145 sites produced significant matches (p-value < 0.05).

MCMC refinement step significantly improves solutions in matching sites of

quinone oxidoreductase (1qor) and hypothetical protein YhdH (1o8c). Multiple

sequence alignment of 1hdx with family members shows that they share a common

dinucleotide binding motif GL-GGVG. For 1qor 0, before MCMC refinement step,

we match 2 glycines in dinucleotide binding motif GL-GGVG. On the other hand we

match 3 glycines in the motif after the MCMC refinement step. We match 3 glycines

before MCMC refinement step and all 4 glycines after the MCMC refinement step

in 1o8c 1. Figures 7.3 and 7.4 show corresponding amino acids before and after the

Page 169: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 146

0 10 20 30 40 50 60

01

23

45

Number of matched amino acids

RMSD

(Å)

a)

0 10 20 30 40 50 60

01

23

45

Number of matched amino acids

RMSD

(Å)

b)

Figure 7.1: Alcohol dehydrogenase NAD-binding site (1hdx 1) matching against

NAD(P)(H) binding sites of SCOP alcohol dehydrogenase-like family proteins

(Case 1). a) Graph matching prior to MCMC refinement step showing results

with/without amino acid property information. Each site is represented by a green

circle (with) and blue cross (without) connected by a straight line to highlight the

difference. b) MCMC refinement step of (a).

Page 170: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 147

0 10 20 30 40 50 60

01

23

45

Number of matched amino acids

RMSD

(Å)

Figure 7.2: Effect of MCMC refinement on graph matches of the NAD-binding

functional site of alcohol dehydrogenase (1hdx 1) against NAD(P)(H) binding sites

of SCOP alcohol dehydrogenase-like family proteins (Case 1) where corresponding

amino acids are restricted to others in the same group. Each site is represented by

a green circle (graph only) and blue cross (after MCMC refinement) connected by

a straight line to highlight the difference.

refinement in 1qor 0 and 1o8c 1 respectively. These are some of the cases probably

with several alternative solutions which the probabilistic approach is able to explore.

When taking physico-chemical properties into account, we find 132 out of 145

sites significant after the refinement step. There are only 125 significant matches be-

fore MCMC refinement step. Matches with 1qlh 1, 1hdy 1, 3hud 1, 1n9q 1 and sites

from 1pl6 are only significant after MCMC refinement step. Figure 7.2 is a plot of

RMSD against number of corresponding amino acids before and after MCMC refine-

ment step when accounting for physico-chemical properties in matching. This plot

shows that MCMC refinement step achieves better RMSD and more corresponding

amino acids in a number of cases.

Page 171: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 148

Graph:

MCMC Refinement Step:

Figure 7.3: Corresponding amino acids between the NAD-binding site of alcohol

dehydrogenase (1hdx 1) and NADP-binding site of quinone oxidoreductase (1qor 0)

before and after MCMC refinement step (Case 1). Amino acids with bold borders

are part of the dinucleotide binding motif GL-GGVG.

Graph:

MCMC Refinement Step:

Figure 7.4: Corresponding amino acids between the NAD-binding site of alcohol de-

hydrogenase (1hdx 1) and NADP-binding site of hypothetical protein YhdH (1o8c 1)

before and after MCMC refinement step (Case 1). Amino acids with bold borders

are part of the dinucleotide binding motif GL-GGVG.

Page 172: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 149

7.4.2 Case 2: 17 − β Hydroxysteroid Dehydrogenase and

Family

We took a functional site of 17 − β hydroxysteroid dehydrogenase (1a27 0) and

matched it against NAD(P)(H) binding sites belonging to members of the same

SCOP family (c.2.1.2) with and without taking into account the amino acid chem-

istry. The query, 1a27 0 binds NADP and oestradiol molecules.

When not accounting for physico-chemistry properties, MCMC step refines 70

matches to become statistically significant. Before and after MCMC refinement step,

248 and 318 sites respectively are significant. Some of these are 1udb 0, 2udp 1,

1uda 0, 1lrl 2, 1lrj 0, 1kvt 0, 1kvs 0, 1i3k 0, 1i3l 6, 1i3l 7, 1i3l 7, 1i3n 0, 1i3m 0,

1hzj 1, 1bxk 0. Figure 7.5 is a plot of RMSD against number of corresponding amino

acids before and after MCMC refinement step. Improvement after the refinement is

evident in many cases.

The comparison between matching with and without physico-chemistry proper-

ties gives the same pattern as in alcohol dehydrogenase. Figures 7.6a and 7.6b show

RMSD plotted against the number of corresponding amino acids when matching

with and without physico-chemistry before and after MCMC refinement step. Both

before and after MCMC refinement step, accounting for physico-chemistry restricts

the matching at the expense of RMSD and the number of corresponding amino

acids.

7.4.3 Case 3: Alcohol Dehydrogenase and Superfamily

We matched the same query as in Case 1 (1hdx 1) against other NAD(P)(H) binding

sites belonging to members of the same SCOP superfamily (c.2.1.x) but different

family as the query. No amino acid information is used in matching in this case.

MCMC refinement step achieves significant matches in, among other sites, 1nq5 1,

1nqo 1, 1nqo 3, 3dbv 0, 3dbv 2, 3dbv 3, 4dbv 0, 4dbv 1, 4dbv 2, 4dbv 3 and 1efl 12.

In all these cases, at least GGXG of the dinucleotide binding motif GXXGGXG is

matched. Graph theoretic solutions before MCMC refinement step are not signifi-

Page 173: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 150

Figure 7.5: Effect of MCMC refinement on graph matches of the NADP-binding

site of 17 − β hydroxysteroid dehydrogenase (1a27 0) against NAD(P)(H) binding

sites of SCOP tyrosine dependent oxidoreductase family proteins (Case 2) where

corresponding amino acids are not restricted to others in the same group. Each site

is represented by a green circle (graph only) and blue cross (after MCMC refinement)

connected by a straight line to highlight the difference.

cant. Figure 7.7 is a superposition of corresponding amino acids between functional

sites of glyceraldehyde-3-phosphate dehydrogenase (3dbv 3) and alcohol dehydroge-

nase (1hdx 1) after MCMC refinement step.

7.4.4 Case 4: Alcohol Dehydrogenase and FAD/NAD(P)-

binding Domain

We took a NAD-binding functional site of alcohol dehydrogenase (1hdx 1) and

matched it against FAD/NAD(P)(H) binding sites belonging to members of SCOP

FAD/NAD(P)-binding domain (c.3.1.x) without taking into account the amino acid

chemistry. A distance threshold value of 1.0A other than 1.5A was found to give

better matches for graph theoretic solution and was used in this case. A total of 338

Page 174: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 151

Figure 7.6: RMSD against number of corresponding amino acids for matching 17−β hydroxysteroid dehydrogenase NADP-binding site against NAD(P)(H) binding

sites of SCOP tyrosine dependent oxidoreductase family proteins (Case 2). a)

Graph matching prior to MCMC refinement showing results with/without amino

acid property information. Each site is represented by a green circle (with) and blue

cross (without) connected by a straight line to highlight the difference. b) MCMC

refinement of (a).

Page 175: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 152

Figure 7.7: Superposition of matching amino acids (Case 3) between alcohol dehy-

drogenase (1hdx 1; blue) and glyceraldehyde-3-phosphate dehydrogenase (3dbv 3;

red) binding sites after MCMC refinement (RMSD = 0.672; number of correspond-

ing amino acids = 12; p-value = 3.68e-05). The matched dinucleotide binding motif

is shown in ball-and-stick representation. Ligands are coloured in CPK colours.

pair-wise comparisons were made and 64 were significant before MCMC refinement

step. Sites from dihydropyrimidine dehydrogenase (1gth 13) and fumarate reductase

(1qla 5; 1qla 7; 1qlb 2; 1qlb 6) become statistically significant only after MCMC re-

finement step (p-values for 1gth 13, 1qla 5, 1qla 7, 1qlb 2 and 1qlb 6 before MCMC

refinement step: 0.3742, 0.6621, 0.6766, 0.6199 and 0.6199; after MCMC refinement

step: 0.0258, 0.0141, 0.0141, 0.0001 and 0.0001).

7.4.5 Assessing MCMC Refinement

Table 7.1 gives a summary on improvements achieved after MCMC refinement step

in the applications considered when matching with and without physico-chemistry

properties. In all considered cases, there are sites which give statistically significant

matches only after MCMC refinement step.

When not using physico-chemistry properties, much improvement (relative to

the number of sites considered) is registered in matching the query from 17 − β

hydroxysteroid dehydrogenase against sites from the same SCOP family members.

Page 176: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 153

There are also many improved cases in matching the query from alcohol dehydroge-

nase against sites from different families and fold (FAD/NAD(P)-binding domain).

Tables 7.2 and 7.3 compare RMSD and number of matched amino acids found be-

fore and after refinement when not using physico-chemistry properties. RMSD does

not change much after refinement. There are marginal RMSD mean increases for

matching functional sites of alcohol dehydrogenase and 17 − β hydroxysteroid de-

hydrogenase against sites of same SCOP family members. However there are also

marginal mean decreases for matching alcohol dehydrogenase against sites from dif-

ferent families and fold. The mean number of matched amino acids increases after

MCMC refinement except in matching functional sites of alcohol dehydrogenase and

members of the same superfamily but different families where there is a marginal

decrease.

There are even more improvements after MCMC refinement step when using

physico-chemistry properties. However there are less significant matches both before

and after refinement when matching with physico-chemistry properties compared

to matching without physico-chemistry properties. Tables 7.4 and 7.5 compare

RMSD and number of matched amino acids found before and after refinement when

using physico-chemistry properties. There are marginal mean decreases after MCMC

refinement in all cases. The mean number of matched amino acids increases after

MCMC refinement as well.

7.5 Comments

The examples given above make a clear case that MCMC refinement can improve

ligand binding site matches generated by graph matching, in terms of both the sta-

tistical and biological significance of the match. We attribute this success to the lack

of dependence on a strict matching tolerance, which is enforced in graph matching.

Statistical modelling in refinement of matches appears to have been successful in

automatically adapting to shape variations in ligand binding sites, which might be

due to different noise levels in atomic positions or protein phylogeny differences,

Page 177: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 154

Table 7.1: Assessment of statistical significance of functional site matching before

and after MCMC refinement step with/out amino acid property information.Without amino acid property With amino acid property

Case Total Sig. Graph Sig. MCMC Sig. Graph Sig. MCMC

alcohol dehydrogenase 145 142 142 125 132

and family

17 − β hydroxysteroid 326 248 318 159 236

dehydrogenase and family

alcohol dehydrogenase 897 200 324 33 222

and superfamily

alcohol dehydrogenase and 338 64 69 5 12

FAD/NAD(P)-binding domain

Sig. Graph: significant before refinement.

Sig. MCMC: significant after MCMC refinement step.

Table 7.2: RMSD(A) before and after MCMC refinement step without amino acid

property.Graph MCMC

Case Mean Std. Dev. Mean Std. Dev.

alcohol dehydrogenase and family 0.590 0.2350 0.619 0.2824

17 − β hydroxysteroid dehydrogenase and family 0.874 0.2208 0.958 0.1987

alcohol dehydrogenase and superfamily 2.093 1.6820 1.934 1.7367

alcohol dehydrogenase and FAD/NAD(P)-binding domain 1.723 1.3155 1.715 1.3188

Table 7.3: The number of matched amino acids before and after MCMC refinement

step without amino acid property.Graph MCMC

Case Mean Std. Dev. Mean Std. Dev.

alcohol dehydrogenase and family 33.7 12.26 34.6 13.11

17 − β hydroxysteroid dehydrogenase and family 17.0 6.83 21.8 7.04

alcohol dehydrogenase and superfamily 13.3 2.14 12.4 2.24

alcohol dehydrogenase and FAD/NAD(P)-binding domain 10.5 1.03 10.6 1.22

Page 178: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 155

Table 7.4: RMSD(A) before and after MCMC refinement step with amino acid

property.Graph MCMC

Case Mean Std. Dev. Mean Std. Dev.

alcohol dehydrogenase and family 0.805 0.8892 0.737 0.7944

17 − β hydroxysteroid dehydrogenase and family 1.047 0.6706 0.997 0.6226

alcohol dehydrogenase and superfamily 2.751 1.9127 2.459 2.0150

alcohol dehydrogenase and FAD/NAD(P)-binding domain 3.424 2.3234 3.337 2.3814

Table 7.5: The number of matched amino acids before and after MCMC refinement

step with amino acid property.Graph MCMC

Case Mean Std. Dev. Mean Std. Dev.

alcohol dehydrogenase and family 28.1 13.51 32.8 14.74

17 − β hydroxysteroid dehydrogenase and family 12.4 6.55 17.0 7.75

alcohol dehydrogenase and superfamily 8.3 1.21 8.7 1.69

alcohol dehydrogenase and FAD/NAD(P)-binding domain 8.7 0.81 9.0 1.06

among other factors. Refined matches usually retain a similar RMSD, and achieve

greater significance through expansion of the number of matching residues from the

core graph match. We have noted however that in some cases significant reductions

in the match RMSD are also achieved by refinement.

Dependence on a strict matching tolerance is not limited to graph matching,

but is also a feature of other matching methods commonly used in the field (e.g.

geometric hashing: Wallace et al., 1997). It is important to note that the MCMC re-

finement procedure can be applied to a starting match generated by any method; and

that the graph procedure chosen here was simply intended as an example. Equally

MCMC procedure can be applied to matching with no previously generated start-

ing match, for example by starting from randomly generated matches. That is, the

MCMC method provides a stand-alone algorithm for matching. Furthermore, the

method provides the full joint posterior distribution so that we have for example,

the posterior distribution for the matching matrix as well as the parameters of the

transformation simultaneously. However, we find that obtaining good matches by

Page 179: Statistical approaches to protein matching in Bioinformatics

Chapter 7. Bayesian Refinement of Graph Solutions 156

this method is very expensive in terms of computational time. While methods such

as graph matching can be applied to database searching, where a site is matched

against all members of a large database of sites, this would be impractical for match-

ing by MCMC alone. We suggest therefore that the MCMC procedure would be

most advantageous when applied to the best hits from a database search using a

faster method, and that in many cases it would increase the number of significant

hits.

We have made only a very basic study of the effect of including amino-acid residue

physico-chemical property information in matching, contrasting matches obtained

without restriction (any residue may match any other) with slightly more restrictive

matching (residues only allowed to match within relatively broadly defined groups).

It is interesting that even with very broadly defined groups, fewer statistically sig-

nificant matches are generally obtained than when matching is without restriction.

This could suggest that the physico-chemical properties of sites binding the same

or similar ligands can change significantly in evolution. It is however most likely to

reflect increased flexibility to change in peripheral residues that are less important

for binding, and needs further investigation. The main point of this work is that

MCMC refinement can improve matches under either matching regime. Indeed in

a few cases of matching with physico-chemical groups, we showed that some graph

matches without statistical significance were converted to significant matches by the

MCMC procedure, revealing that using graph matching alone could lead to some

erroneous conclusions in this respect.

Page 180: Statistical approaches to protein matching in Bioinformatics

Chapter 8

Conclusions and Further Work

In this Chapter we summarise important points from Chapters 3 to 7. We also

highlight potential areas for further work on the topics discussed in this thesis.

8.1 Conclusions

A few conclusions can already be drawn from work reported in this thesis.

8.1.1 Functional Sites

Exploratory analysis shows that functional sites in SITESDB tend to consist of short

contiguous segments (motifs) from the protein chain. Although in theory, side chains

from different parts of the chain can come together spatially to form an active site

or binding site, the automatic extraction of these sites in SITESDB leads to the

inclusion of all adjacent side chains (within 5A of residues annotated with SITE

RECORD in PDB or bound ligands). Presence of adjacent side chains is reflected

in the dataset but could be not part of the core binding or functional site. However

it has to be noted that currently most well known functional sites or active sites are

motifs.

157

Page 181: Statistical approaches to protein matching in Bioinformatics

Chapter 8. Conclusions and Further Work 158

8.1.2 Simulating Random Protein Structures

In section 3.1.2 we successfully simulated random protein Cα traces. A simpler

and more flexible approach similar to Aszodi and Taylor (1994) was used. With a

simple modelling of hydrophobic effects, the method produces compact and globular

structures.

8.1.3 Matching Algorithms

All the algorithms considered here (Graph theoretic, EM algorithm and MCMC) do

better when configuration points are further apart. As expected the performance

decreases with more cluttering of points and increasing positional noise.

The graph method

The graph theoretic is robust, fast but not very flexible to account for concomitant

information and different noise levels in functional sites. However the method can

also break down or become very computer intensive when matching configurations

with many inter-point distances of the same magnitude since the product graph

becomes very huge. Consequently the graph method might not be the ideal approach

in some applications like matching whole protein chains.

The EM algorithm

Concomitant information can flexibly be used in the EM algorithm. With good

starting values the EM algorithm does impressively well in finding corresponding

points. However the EM algorithm is sensitive to starting values. It is recommended

to try the algorithm from several starting values for rotation, translation and noise

parameters then monitor convergence. Simple match constraining techniques e.g.

variance cooling improves the algorithm to find better solutions.

Page 182: Statistical approaches to protein matching in Bioinformatics

Chapter 8. Conclusions and Further Work 159

The Bayesian hierarchical model (MCMC)

The algorithm is mostly not sensitive to starting values. Unlike the EM algorithm,

MCMC can easily escape local minima.

Although an assumption of a hidden homogeneous Poisson process was made

to formulate the model, the algorithm is not sensitive to this assumption. The

algorithm can match hardcore configurations, simulated short and virtual protein

chains and most importantly real functional sites.

The meta algorithm

There seems to be no silver bullet solution to matching functional sites. MCMC

does better than EM algorithm when starting values are far from true parameter

values in the EM algorithm. MCMC can escape local maxima. However MCMC is

very computer intensive and sometimes can drift away from the optimal solution.

There is need to monitor convergence in both EM and MCMC algorithms. On

the other hand the graph theoretic method is robust, fast but not very flexible to

account for concomitant information and different noise levels in functional sites.

Thus Bayesian modelling of the graph solution i.e. using MCMC method starting

from transformation parameter estimates by the graph method was suggested. This

meta algorithm is observed to be a good strategy. MCMC refinement step was able

to improve graph based matches to be more biologically significant.

8.1.4 Concomitant Information

Concomitant information (amino acid type) guides the EM algorithm to converge

(faster) to the true solution. However in most cases geometric information is so rich

such that the contribution from amino acid types information is marginal in both

EM and MCMC algorithms.

Page 183: Statistical approaches to protein matching in Bioinformatics

Chapter 8. Conclusions and Further Work 160

8.1.5 Hardening Soft Matches

Both EM and MCMC algorithms give probabilities of matching points in a pair-wise

alignment. Using linear programming to optimally assign unique matches gives best

results. However for small problems, just getting first non-duplicate set of matches

with high probabilities or using greedy algorithm gives practically similar solutions.

8.1.6 Assessing Significance of Matches

We have considered assessing the significance of matches that they are non-random

under the null hypothesis of random matches. We have also considered the goodness-

of-fit for matching related configurations.

Random versus non-random matches

Significance of matching two configurations under the null hypothesis of random

matches depends on RMSD, total number of amino acids in each configuration and

the number of amino acids matched. Extreme value distribution with empirically

derived constants for matching two random configurations can be used for evaluat-

ing p-values. The p-value calculation takes into account the RMSD, the number of

amino acids matched and the total number of amino acids in each of the configura-

tions.

Goodness-of-fit

We considered goodness-of-fit for matches known to be related (not matching by

chance) in section 4.1 of Chapter 4. P-values for assessing goodness-of-fit or ranking

matches from the RMSD distribution under the isotropic Gaussian error model

mostly agrees with the decision using the score suggested by Gold (2003).

Page 184: Statistical approaches to protein matching in Bioinformatics

Chapter 8. Conclusions and Further Work 161

8.1.7 Application: Matching NAD Binding Functional Sites

In Chapter 7, section 7.4, when using the meta algorithm of graph theoretic and

MCMC to match NAD(P)(H) binding sites, we find examples where significant shape

based matches do not retain similar amino acid chemistry. Matches were within

single Rossmann fold families, between different families in the same superfamily,

and in different folds. This indicates that even within families the same ligand may

be bound using substantially different physico-chemistry. We also showed that the

procedure finds significant matches between binding sites for the same cofactor in

different families and different folds.

In our basic study of the effect of including amino-acid residue physico-chemical

property information in matching, we contrasted matches obtained without restric-

tion (any amino acid could match any other) with slightly more restrictive matching

(amino acids only allowed to match within relatively broadly defined groups). It is

interesting that even with very broadly defined groups, fewer statistically signifi-

cant matches were generally obtained than when matching is without restriction. It

is also interesting to note that MCMC refinement improved matches under either

matching regime. Indeed in a few cases of matching with physico-chemical groups,

we showed that some graph matches without statistical significance were converted

to significant matches by the MCMC procedure, revealing that using graph matching

alone could lead to some erroneous conclusions in this respect.

8.2 Further Work

This work has also raised some issues which are interesting and need further work.

8.2.1 Simulating Random Protein Structures

Aszodi and Taylor (1994) and our alternative method in section 3.1.2 use fixed target

distances between Cα atoms, only taking into account the hydrophobic property

of amino acids. Further work could explore the idea of varying target distances

Page 185: Statistical approaches to protein matching in Bioinformatics

Chapter 8. Conclusions and Further Work 162

according to the target chain length. There would be need to explore how distances

between different types of amino acids in 3-dimensional structures vary with the

spacing in the sequence as well as the length of the chain.

In our method, the minimum of three random angles from the von Mises distri-

bution was used at each hydrophobic Cα atom in order to fold the chain towards

the centre of mass and create a hydrophobic interior core. Further work could in-

corporate using variable number of random angles, depending on the number of Cα

atoms (already) in the chain. This approach would control the level of structure

compactness. Furthermore, this approach would decrease chances for the chain to

crash into itself.

8.2.2 Matching Statistics

More research on the distribution for RMSD or size-and-shape distance and number

of matches when matching random configuration is required. The following are quite

interesting and very much open questions:

(a) What is the exact distribution for the number of matches q when matching two

random configurations {x} and {µ} with say n and m points?

(b) And what is the exact distribution of RMSD for matching q pairs of points for

two random configurations with m and n points?

Empirical approaches (Stark et al., 2003b; Chen and Crippen, 2005) have been

quite successful in answering these question. However the true analytical distribu-

tions have not been worked out. With analytical distributions, the adjustment for

database size would be straightforward. Empirical (model fitting) approximation by

the limiting distribution (EVD) in section 1.2.2 for minimum RMSD or number of

matches in a database search would not be required.

Page 186: Statistical approaches to protein matching in Bioinformatics

Chapter 8. Conclusions and Further Work 163

8.2.3 Matching Algorithms

In this thesis we have only considered matching pair-wise configurations. An impor-

tant extension to this approach is matching multiple configurations simultaneously.

The EM algorithm

There are still a number of questions of interest to be investigated with regard

to the EM algorithm alignment. For example, further work need to be done on

exploring the idea of multiple transformations approach. We observe that there

is asymmetry in the performance of the algorithm when matching configurations

with two transformations. The algorithm gave fewer errors with respect to the first

transformation compared to the second transformation. Further work needs to be

done in order to understand this observation.

Sensitivity Analysis for Multiple Transformations

Relevant questions in multiple transformation approach include:

(a) What are the effects of mis-specifying the number of transformations e.g. as-

suming the presence of two transformations when actually there is only one?

Simulations similar to those in section 5.4.4 but more extensive are required.

(b) How to infer on the number of transformations?

(c) When does the problem become over-parameterised? (number of transforma-

tions versus the number of points in the configurations).

Using Multiple Atoms for each Amino Acid

We will consider using more than one atom from each amino acid for matching

functional sites in the future. Some of the issues to be considered are:

(a) Which atoms to choose?

(b) How to account for dependence between atoms.

Page 187: Statistical approaches to protein matching in Bioinformatics

Chapter 8. Conclusions and Further Work 164

The Bayesian hierarchical model

In the future we will consider alternative formulations to relax the assumption of

conditional normal distribution for the second atom given the first atom when using

two atoms in an amino acid for matching functional sites.

Sequence ordering

All matching algorithms (MCMC, graph theoretic and EM algorithms) can be

extended to take into account the sequence ordering information especially when

matching whole protein structures. In addition to an enhanced capability to solve

alignment and correspondence for configurations with many points, this would speed

up running times of the algorithms. Sequence information would constrain the

matching further and dramatically reduce the solution space.

8.2.4 Application: Matching NAD Binding Functional Sites

There is a suggestion that the physico-chemical properties of sites binding the same

or similar ligands can change significantly in evolution. This was observed when

matching NAD(P)(H) binding sites within single Rossmann fold families, between

different families in the same superfamily, and in different folds in Chapter 7, section

7.4. It is however most likely to reflect increased flexibility to change in peripheral

residues that are less important for binding, and this needs further investigation.

Page 188: Statistical approaches to protein matching in Bioinformatics

Bibliography

Andreeva, A., Howorth, D., Brenner, S.E, Hubbard, T.J.P., Chothia, C. and Murzin,

A.G. (2004). SCOP database in 2004: refinements integrate structure and se-

quence family data. Nucl. Acid Res. 32 (1), D226–D229.

Applegate, D. and Johnson, D. An implementation of the Carraghan and Pardalos

algorithm. ftp://dimacs.rutgers.edu/pub/challenge/graph/solvers/ .

Artymiuk, P.J., Poirrette, A.R., Grindley, H.M., Rice, D.W. and Willett, P. (1994).

A graph-theoretic approach to the identification of three-dimensional patterns of

amino acid side-chains in protein structures. J. Mol. Biol. 243, 327–44.

Aszodi, A. and Taylor, W.R. (1994). Folding polypeptide α− carbon backbones by

distance geometry methods. Biopolymers 34, 489–505.

Bartlett, M.S. (1964). The spectral analysis of two-dimensional point processes.

Biometrika 51, 299–311.

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H.,

Shindyalov, I.N. and Bourne, N.E. (2000). The protein data bank. Nucleic Acids

Research 28, 235–242.

Binkowski, T.A., Adamian, L. and Liang, J. (2003). Inferring functional rela-

tionships of proteins from local sequence and spatial surface patterns. J. Mol.

Biol. 332, 505–26.

Blow, D.M., Birktoft, J.J. and Hartley, B.S. (1969). Role of a buried acid group in

the mechanism of action of chymotrypsin. Nature 221, 337–40.

165

Page 189: Statistical approaches to protein matching in Bioinformatics

Bibliography 166

Branden, C. and Tooze, J. (1999). Introduction to Protein Structure (2nd ed.). New

York: Garland Publishing, Inc.

Brenner, S.E. and Levitt, M. (2000). Expectations from structural genomics. Protein

Science 9, 197–200.

Bron, C. and Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undi-

rected graph. Communications of the ACM 16, 575–577.

Burry, K.V. (1975). Statistical Methods in Applied Science. New Jersey: John Wiley

and Sons, Ltd.

Carraghan, R. and Pardalos, P.M. (1990). Exact algorithm for the maximum clique

problem. Operations Research Letters, 9–375.

Carugo, O. and Pongor, S. (2001). A normalized root-mean-square distance for

comparing protein three-dimensional structures. Protein Science 10, 1470–1473.

Chen, Y. and Crippen, G.M. (2005). A novel approach to structural alignment using

realistic structural and environmental information. preprint .

Chhajer, M. and Crippen, G.M. (2002). A protein folding potential that places the

native states of a large number of proteins near a local minimum. BMC Structural

Biology 2.

Chothia, C. and Lesk, A.M. (1986). The relation between the divergence of sequence

and structure in proteins. EMBO J. 5, 823–826.

Cressie, N.A.C. (1993). Statistics for spatial data (Rev. ed.). Chichester ; New York:

John Wiley and Sons.

Dafas, P., Bolser, D.M., Gomoluch, J., Park, J., Schroeder, M. (2004). Using convex

hulls to extract interaction interfaces from known structures. Bioinformatics 20,

1486–1490.

Page 190: Statistical approaches to protein matching in Bioinformatics

Bibliography 167

Dayhoff, M., Schwartz, R. and Orcutt, B. (1978). A model of evolutionary change in

proteins. In M. Dayhoff (Ed.), Atlas of Protein Sequence and Structure, Volume 5,

pp. 345–352. Washington, D.C.: Natl. Biomed. Res. Found.

Deb, K. (2001). Multi-objective Optimization Using Evolutionary Algorithms

(1st ed.). Chichester ; New York: John Wiley and Sons.

Diggle, P.J. (1983). Statistical Analysis of Spatial Point Patterns. London: Academic

Press.

Downs, T.D. (1972). Orientation statistics. Biometrika 59, 665–676.

Dryden, I.L. and Mardia, K.V. (1998). Statistical Shape Analysis. Chichester: John

Wiley.

Dryden, I.L., Hirst, J.D. and Melville, J.L. (2006). Statistical analysis of unla-

belled point sets: comparing molecules in chemoinformatics. Under revision for

Biometrics .

Eidhammer, I., Jonassen, I. and Taylor, W.R. (2004). PROTEIN BIOINFORMAT-

ICS: An Algorithmic Approach to Sequence and Structure Analysis. New Jersey:

John Wiley and Sons, Ltd.

Ewens, W.J. and Grant, G.R. (2001). Statistical Methods in Bioinformatics : an

introduction. New York: Springer.

Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (Eds.) (1996). Markov chain

Monte Carlo in practice. London: Chapman and Hall.

Gold, N.D. (2003). Computational approaches to similarity searching in a functional

site database for protein function prediction. Ph.D thesis, Leeds University, School

of Biochemistry and Microbiology.

Gold, N.D., Pickering, S.J. and Westhead, D.R. (2003). Predicting protein function

from structure using sitesdb: evaluation of a method based on functional site-

similarity. Preprint .

Page 191: Statistical approaches to protein matching in Bioinformatics

Bibliography 168

Gold, N.D. and Jackson, R.M. (2006). Fold independent structural comparisons

of protein-ligand binding sites for exploring functional relationships. J. Mol.

Biol. 355 (5), 1112–1124.

Gong, S. and Park, C. and Choi, H. and Ko, J. and Jang I. and Lee, J. and Bolser,

D.M. and Oh, D. and Kim D. and Bhak, J. (2005). A protein domain interaction

interface database: InterPare. BMC Bioinformatics 6.

Gong, S.S. and , Yoon, G.S. and Jang, I.S. and Bolser, D.M. and Dafas, P. and

Schroeder, M. and Choi, H.S. and Cho, Y.B. and Han, K.S. and Lee, S.H.

and Choi, H.H. and Lappe, M. and Holm, L. and Kim, S.S. and Oh, D.H. and

Bhak, J.H. (2005). PSIbase: a database of Protein Structural Interactome map

(PSIMAP). Bioinformatics 21, 2541–2543.

Green, P.J. (2001). A primer on Markov chain Monte Carlo. In O. Barndorff-

Nielsen, D. Cox, and C. Kluppelberg (Eds.), Complex Stochastic Systems, pp.

1–62. London: Chapman and Hall.

Green, P.J. and Mardia, K.V. (2006). Bayesian alignment using hierarchical models,

with applications in protein bioinformatics. Biometrika in press.

Gumbel, E.J. (1958). Statistics of Extremes. New York: Columbia University Press.

Holm, L., Ouzounis, C., Sander, C., Tuparev, G. and Vriend, G. (1992). A database

of protein structure families with common folding motifs. Protein Sci. 1, 1691–8.

Holm, L. and Sander, C. (1993). Protein structure comparison by alignment of

distance matrices. J. Mol. Biol. 233, 123–38.

Hubbard, T.J., Murzin, A.G., Brenner, S.E. and Chothia, C. (1997). SCOP: a

structural classification of proteins database. Nucleic Acids Res 25, 236–9.

Hung, M.S. and Rom, W.O. (1980). Solving the assignment problem by relaxation.

Operations Research 28, 969–982.

Page 192: Statistical approaches to protein matching in Bioinformatics

Bibliography 169

Jaramillo, A., Wernischdagger, L., Hery, S. and Wodak, S.J. (2002). Folding free

energy function selects native-like protein sequences in the core but not on the

surface. Proc. Natl. Acad. Sci. 99(21), 13554–9.

Jeong, J.I., Jang, Y. and Kim, M.K. (2006). A connection rule for α-carbon coarse-

grained elastic network models using chemical bond information. Journal of

Molecular Graphics and Modelling 24, 296–306.

Jonker, R. and Volgenant, A.A. (1987). Shortest augmenting path algorithm for

dense and spare-linear assignment problems. Computing 38, 325–340.

Kabsch, W. (1978). A discussion of the solution for the best rotation to relate two

sets of vectors. Acta Cryst A A34, 827–828.

Karp, R.M. (1980). An algorithm to solve the m×n assignment problem in expected

time o(mn logn). Networks 10, 143–152.

Kent, J.T., Mardia, K.V. and Taylor, C.C. (2004). Matching unlabelled configu-

rations of unequal size with applications to bioinformatics. In R.G. Aykroyd,

S. Barber, and K.V. Mardia (Eds.), Bioinformatics, Images, and Wavelets, pp.

33–36. Leeds University Press.

Khatri, C.G. and Mardia, K.V. (1977). The von Mises-Fisher matrix distribution in

orientation statistics. Journal of the Royal Statistical Society. Series B (Method-

ological) 39 (1), 95–106.

Kinoshita, K., Furui, J. and Nakamura, H. (2002). Identification of protein functions

from a molecular surface database, eF-site. J. Struct. Funct. Genomics 2, 9–22.

Kinoshita, K., Sadanami, K., Kidera, A. and Go, N. (1999). Structural motif of

phosphate-binding site common to various protein superfamilies: all-against-all

structural comparison of protein-mononucleotide complexes. Protein Eng. 12,

11–4.

Page 193: Statistical approaches to protein matching in Bioinformatics

Bibliography 170

Kleywegt, G.J. (1999). Recognition of spatial motifs in protein structures. J. Mol.

Biol. 285, 1887–97.

Kuhl, F.S., Crippen, G.M. and Friesen, D.K. (1984). A combinatorial algorithm for

calculating ligand binding. Journal of Computational Chemistry 5 (1), 24–34.

Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Naval

Research Logistics Quarterly 2, 83–97.

Lesk, A.M. (2000). Introduction to protein architecture: the structural biology of

proteins. Oxford: Oxford University Press.

Lesk, A.M. (2002). Introduction to Bioinformatics. Oxford: Oxford University

Press.

van Lieshout, M.N.M. (2000). Markov point processes and their application. London:

Imperial College Press.

Luo, B. and Hancock, E.R. (2001). Structural Graph Matching Using the EM

Algorithm and Singular Value Decomposition. IEEE Trans. Pattern Analysis and

Machine Intelligence 23 (10), 1120–1136.

Mardia, K.V. (1972). Statistics of Directional Data. London and New York: Aca-

demic Press.

Mardia, K.V. and Gadsden, R.J. (1977). A small circle of best fit for spherical data

and areas of vulcanism. Applied Statistics 26 (3), 238–245.

Mardia, K.V. and Jupp, P.E. (2000). Directional Statistics. Chichester: John Wiley

and Sons Ltd.

Mardia, K.V., Nyirongo, V. and Westhead, D.R. (2003). Protein matching us-

ing amino acids information. In R.G. Aykroyd, K.V. Mardia, and M.J. Langdon

(Eds.), Stochastic Geometry, Biological Structure and Images, pp. 147. Leeds Uni-

versity Press.

Page 194: Statistical approaches to protein matching in Bioinformatics

Bibliography 171

Mardia, K.V., Taylor, C.C. and Westhead, D.R. (2003). Structural Bioinformatics

Revisited. In R.G. Aykroyd, K.V. Mardia, and M.J. Langdon (Eds.), Stochastic

Geometry, Biological Structure and Images, pp. 11–18. Leeds University Press.

Mardia, K.V. and Nyirongo, V. (2004). Procrustes statistics for unlabelled points

and applications. In R.G. Aykroyd, S. Barber, and K.V. Mardia (Eds.), Bioin-

formatics, Images, and Wavelets, pp. 137. Leeds University Press.

Mardia, K.V., Nyirongo, V. and Westhead, D.R. (2005). EM algorithm, Bayesian

and distance approaches to matching active site. Mathematical and Statistical

Aspects of Molecular Biology 15th Annual meeting, Abstracts pp. 13–14.

Murty Katta, G. (1968). An algorithm for ranking all assignments in order of

increasing cost. Operations Research 16, 682–687.

Naor, D., Fischer, D., Jernigan, R.L., Wolfson, H.J. and Nussinov, R. (1996). Amino

acid pair interchanges at spatially conserved locations. J. Mol. Biol. 256 (5), 924–

9382.

Orengo, C.A., Jones, D.T. and Thornton, J.M. (1994). Protein superfamilies and

domain superfolds. Nature 372, 631–4.

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B. and Thorn-

ton, J.M. (1997). Cath–a hierarchic classification of protein domain structures.

Structure 5, 1093–108.

Park, J. and Lappe, M. and Teichmann, S. (2001). Mapping protein family interac-

tions: intramolecular and intermolecular protein family interaction repertoires in

the PDB and yeast. J. Mol. Biol. 307 (3), 929–938.

Pedersen, L. (2002). Analysis of two-dimensional electrophoresis gel images. Ph. D.

thesis, Technical University of Denmark.

Pereira De Araujo, A.F. (1999). Folding protein models with a simple hydrophobic

Page 195: Statistical approaches to protein matching in Bioinformatics

Bibliography 172

energy function: The fundamental importance of monomer inside/outside segre-

gation. Proc. Natl. Acad. Sci. 96(22), 12482–7.

Raffenetti, R.C. and Ruedenberg, K. (1970). Parameterization of an orthogonal

matrix in terms of generalized Eulerian angles. International Journal of Quantum

Chemistry IIIS, 625–634.

Rangarajan, A. and Gold, S. (1996). A graduated assignment algorithm for graph

matching. IEEE Trans. Pattern Analysis and Machine Intelligence 18 (4), 377–

388.

Richardson, S. and Green, P.J. (1997). On Bayesian analysis of mixtures with an

unknown number of components (with discussion). Journal of the Royal Statistical

Society. Series B (Methodological) 59 (4), 731–792.

Ripley, B.D. (1976). The second-order analysis of stationary point processes. Journal

of Applied Probability 13, 255–266.

Ripley, B.D. (1977). Modelling spatial patterns (with discussion). Journal of the

Royal Statistical Society. Series B (Methodological) 39 (2), 172–212.

Sanchez, R. and Sali, A. (1998). Large-scale protein structure modeling of the

Saccharomyces cerevisiae genome. Biophysics 95 (23), 13597-602.

Sayle, R.A. and Milner-White, E.J. (1995). Rasmol: biomolecular graphics for all.

Trends in Biochemical Sciences 20 (9), 374.

Schmitt, S., Kuhn, D. and Klebe, G. (2002). A new method to detect related function

among proteins independent of sequence and fold homology. J. Mol. Biol. 323,

387–406.

Shindyalov, I. and Bourne, P.E. (1998). Protein structure alignment by incremental

combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–47.

Shulman-Peleg, A., Nussinov, R. and Wolfson, H.J. (2004). Recognition of functional

sites in protein structures. J. Mol. Biol. 339, 607–33.

Page 196: Statistical approaches to protein matching in Bioinformatics

Stark, A., Sunyaev, S. and Russell, R.B. (2003b). A model for statistical significance

of local similarities in structure. J. Mol. Biol. 326 (5), 1307–1316.

Taylor, C.C., Mardia, K.V. and Kent, J.T. (2003). Matching unlabelled configura-

tions using the EM algorithm. In R.G. Aykroyd, K.V. Mardia, and M.J. Langdon

(Eds.), Proceedings in Stochastic Geometry, Biological Structure and Images, pp.

19–21. Leeds University Press.

Torrance, J.W., Bartlett, G.J., Porter, C.T. and Thornton, J.M (2005). Using a

Library of Structural Templates to Recognise Catalytic Sites and Explore their

Evolution in Homologous Families. J. Mol. Biol. 347 (3), 565–81.

Wallace, A.C., Borkakoti, N. and Thornton, J.M. (1997). TESS: a geometric hashing

algorithm for deriving 3d coordinate templates for searching structural databases.

Protein Sci. 6, 2308–23.

Weiner, S.J., Kollman, P.A., Case, D.A, Singh, U.C., Alagona, G., Profeta Jr., S.

and Weiner, P. (1984). A new force field for the molecular mechanical simulation

of nucleic acids and proteins. J. Am. Chem. Soc. 106, 765–84.

Wright, C.S., Alden, R.A. and Kraut, J. (1969). Structure of subtilisin bpn’ at 2.5

angstrom resolution. Nature 221, 235–42.

Wright, M.B. (1990). Speeding up the Hungarian algorithm. Computers and Oper-

ations Research 17 (1), 95–96.

Wu, T.D., Schmidler, S.C., Hastie, T. and Brutlag, D.L. (1998). Modeling and

superposition of multiple protein structures using affine transformations: Analysis

of the globins. In Pacific Symposium on Biocomputing ’98, Maui, Hawaii, pp. 509–

520. World Scientific.

173

Page 197: Statistical approaches to protein matching in Bioinformatics

Appendix A

Computational Cost

Time and storage space for algorithms used in Bioinformatics applications is very im-

portant because usually many comparisons or huge amounts of data are processed.

In the future we would like to compare run times for graph, EM algorithm and

MCMC. This would require the methods to be implemented in the same program-

ming language. Presently run times are not directly comparable as graph method

is implemented in C, EM algorithm in R while MCMC is implemented in Fortran.

Nevertheless, we present estimated processor times in the next section: not for com-

parison purposes but to give an indication of time it would take to search a typical

database.

A.1 Processor Times

First 100 sites in the SITESDB were matched against a large functional site of 17−βhydroxysteroid dehydrogenase (1a27 0) with 63 amino acids. Sun Microsystems c©UltraSPARC II (360 MHz) and UltraSPARC-IIe (650 MHz) processors were used.

Table A.1 gives estimated times it takes to do 55,000 and 100,000 pair-wise com-

parisons by EM algorithm and graph methods.

For the Bayesian method, Green and Mardia (2006) reports run time of 2 sec-

onds on a 800MHz PC to match the same functional site of 17 − β hydroxysteroid

dehydrogenase (1a27 0) against another functional site with 40 points.

174

Page 198: Statistical approaches to protein matching in Bioinformatics

Table A.1: Estimated time (hrs) for EM algorithm and graph methods to do pair-

wise comparisons between a functional site of 17− β hydroxysteroid dehydrogenase

(1a27 0) and functional sites in SITESDB on 360 and 650MHz processors.

EM algorithm Graph

Database size 360MHz 650MHz 360MHz 650MHz

55,000 121 67 23 10

100,000 221 121 41 19

A.2 Comments

• Naive analysis of times in Table A.1 shows that the graph method is about

six times faster than the EM algorithm. Coincidentally C implementations

are roughly about six times faster than R implementations in general. Thus

implementing the EM algorithm in C will improve run times.

• This analysis involved relatively a big query site consisting of 63 points.

175