geometric and topological methods in protein structure...

GEOMETRIC AND TOPOLOGICAL METHODSIN PROTEIN STRUCTURE ANALYSIS

by

Yusu Wang

Department of Computer ScienceDuke University

Date:Approved:

Prof. Pankaj K. Agarwal, Supervisor

Prof. Herbert Edelsbrunner, Co-advisor

Prof. John Harer

Prof. Johannes Rudolph

Dissertation submitted in partial fulfillment of therequirements for the degree of Doctor of Philosophy

in the Department of Computer Sciencein the Graduate School of

Duke University

2004

ABSTRACT

GEOMETRIC AND TOPOLOGICAL METHODSIN PROTEIN STRUCTURE ANALYSIS

by

Yusu Wang

Department of Computer ScienceDuke University

Date:Approved:

Prof. Pankaj K. Agarwal, Supervisor

Prof. Herbert Edelsbrunner, Co-advisor

Prof. John Harer

Prof. Johannes Rudolph

An abstract of a dissertation submitted in partial fulfillment of therequirements for the degree of Doctor of Philosophy

in the Department of Computer Sciencein the Graduate School of

Duke University

2004

Abstract

Biology provides some of the most important and complex scientific challenges of our

time. With the recent success of the Human Genome Project, one of the main challenges in

molecular biology is the determination and exploitation of the three-dimensional structure

of proteins and their function. The ability for proteins to perform their numerous functions

is made possible by the diversity of their three-dimensional structures, which are capable

of highly specific molecular recognition. Hence, to attack the key problems involved, such

as protein folding and docking, geometry and topology become important tools. Despite

their essential roles, geometric and topological methods are relatively uncommon in com-

putational biology, partly due to a number of modeling and algorithmic challenges. This

thesis describes efficient computational methods for characterizing and comparing molecu-

lar structures by combining both geometric and topological approaches. Although most of

the work described here focuses on biological applications, the techniques developed can

be applied to other fields, including computer graphics, vision, databases, and robotics.

Geometrically, the shape of a molecule can be modeled as (i) a set of weighted points,

representing the centers of atoms and their van der Waals radii; (ii) as a polygonal curve,

corresponding to a protein backbone or a DNA strand; or (iii) as a polygonal mesh cor-

responding to a molecular surface. Each such representation emphasizes different aspects

of molecular structures at various scales, the choice of which depends on the underlying

applications. Characterizing molecular shapes represented in various ways is an important

step toward better understanding or manipulating molecular structures. In the first part of

the thesis, we study three geometric descriptions: the writhing number of DNA strands, the

level-of-details representation of protein backbones via simplification, and the elevation of

molecular surfaces.

The writhing number of a curve measures how many times a curve coils around itself

iii

in space. It describes the so-called supercoiling phenomenon of double stranded DNA,

which influences DNA replication, recombination, and transcription. It is also used to

characterize protein backbones. This thesis proposes the first subquadratic algorithm for

computing the writhing number of a polygonal curve. It also presents an algorithm that

is easy to implement and runs in near-linear time on inputs that are typical in practice,

including DNA strands, which is significantly faster than the quadratic time needed by

algorithms used in current DNA simulation softwares.

The level-of-detail (LOD) representation of protein backbone helps to extract its main

features. We compute LOD representations via curve simplification under the so-called

Frechet error measure. This measure is more desirable than the widely used Hausdorff

error measure in many situations, especially if one wants to preserve global features of

a curve (e.g, the secondary structure elements of a protein backbone) during simplifica-

tion. In this thesis, we present a simple approximation algorithm to simplify curves under

Frechet error measure, which is the first simplification algorithm with guaranteed quality

that runs in near-linear time in dimensions higher than two.

We propose a continuous elevation function on the surface of a molecule to capture

its geometric features such as protrusions and cavities. To define the function, we follow

the example of elevation as defined on Earth, but we go beyond this simpler concept to

accommodate general 2-manifolds. Our function is invariant under rigid motions. It scales

with the surface and provides beyond the location, also direction and size of shape features.

We present an algorithm for computing the points with locally maximum elevation. These

points corresponds to locally most significant features. This succinct representation of

features can be applied to aligning shapes and we will present one such application in the

second part of the thesis.

The second part of the thesis focuses on molecular shape matching algorithms. The

importance of shape matching, both similarity matching and complementarity matching,

iv

arises from the general belief that the structure of a protein decides its function. Efficient

algorithms to measure the similarity between shapes help identify new types of protein

architecture, discover evolutionary relations, and provide biologists with computational

tools to organize the fast growing set of known protein structures. By modeling a molecule

as the union of balls, we study the similarity between two such unions by (variants of) the

widely used Hausdorff distance, and propose algorithms to find (approximately) the best

translation under Hausdorff distance measure.

Complementarity matching is crucial to understand or simulate protein docking, which

is the process where two or more protein molecules bind to form a compound structure.

From a geometric perspective, protein docking can be considered as the problem of search-

ing for configurations with maximum complementarity between two molecular surfaces.

Using the feature information generated by the elevation function, we describe an efficient

algorithm to find promising initial relative placements of the proteins. The outputs can

later be refined to locate docking positions independently using a heuristic that improves

the fit locally, using geometric but possibly also chemical and biological information.

v

And indeed there will be time

To wonder, “ Do I dare? ” and “Do I dare? ”

Time to turn back and descend the stair,

With a bald spot in the middle of my hair

��

Do I dare

Disturb the universe?

— T. S. Eliot, The love-song of J. Alfred Prufrock

vi

Acknowledgements

I came to take of your wisdom:

And behold I have found that which is greater than wisdom.

— Kahlil Gibran, The prophet

It is not without regret that I am writing this acknowledgment — while being grateful

for all those who made my life in the past few years a joyful and fruitful one, I know sadly

that our lives will part soon. The path towards obtaining a PhD was a struggle for me in

many ways. I can’t imagine how it would have been without their support.

It has been a great opportunity to have worked under the supervision of Profs. Pankaj

K. Agarwal and Herbert Edelsbrunner. The experience helped shape my attitude and ap-

proaches towards research both in computational geometry and in general. Pankaj led

me into the world of computational geometry with his broad knowledge. Besides sup-

port and guidance, he gave me great freedom in doing research, and is always patient and

understanding. It is hard to overestimate how much I have benefited from the numerous

discussions with him. Herbert showed me the “friendly” side of computational topology,

with his deep insights accompanied by illustrative explanation. His philosophy and vision

in research have greatly influenced me. I am deeply indebted to both of them for their

guidance and inspiration throughout the course of this dissertation. I would also like to

thank Profs. John Harer and Johannes Rudolph not only for being on my committee of

this thesis, but also for various discussions and collaborations. Support for this work was

provided by NSF under the grant NSF-CCR-00-86013 (the BioGeometry project).

The Duke CS department is a wonderful place. In particular, I wish to thank Drs Lars

Arge and Ron Parr who are always open and ready to help on my career concerns. Dr.

Sariel Har-Peled, now a post-postdoc (well, an assistant professor) in UIUC, has been a

tremendous mentor and friend for me, especially during a period when I was swinging

vii

among various career choices, and at a time when I was learning to walk on the ropes of

research. I learned from him to have an approximate perspective towards problems both in

computer science and in life.

I would like to thank all the graduate students and postdocs in the theory group who

provided a vibrant research environment that I have enjoyed and benefited from so much,

especially Nabil Mustafa, Hai Yu, Peng Yin, and Vijay Natarajan. I had a lot of fun both in

research and in life with friends such as Vicky Choi, David Cohen-Steiner, Ho-lun Cheng,

Ashish Gehani, Sathish Govindarajan, Jingquan Jia, Tingting Jiang, Dmitriy Morozov,

Nabil Mustafa, Vijay Natarajan, Jeff Phillips, Nan Tian, Eric Zhang, Hai Yu, and Haifeng

Yu. Those inspiring discussions with Nabil, Sathish and David will always mark my mem-

ories of the PhD life. Special thanks to my best friend Peng Yin. His humor, energy,

understanding and advices accompany me to this day. I also want to thank his wife Xia

Wu for kindly feeding me uncountable times.

I wish to thank all staffs in the department for being so friendly and helpful, especially

Ms. Celeste Hodges and Ms. Diane Riggs.

Last, but not least, I would like to thank my family for their love, support and confi-

dence in me. My parents and grandparents encouraged me to pursue my own dream from

the childhood, and have never tried to pressure me into any life that others may consider

as successful. My sister and brother-in-law have always been there for me, full of under-

standing and support. What I have achieved was possible only because they were all by

my side. This thesis is dedicated to them.

viii

Contents

Abstract iii

Acknowledgements vii

List of Tables xii

List of Figures xiii

1 Introduction 1

1.1 Protein Structure and Geometric Models . . . . . . . . . . . . . . . . . . 2

1.2 Related Research Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Shape Analysis in Molecular Biology . . . . . . . . . . . . . . . . . . . 9

1.3.1 Describing Shapes . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 Matching Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Writhing Number 18

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Prior and New Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Writhing and Winding . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.1 Closed knots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.2 Open knots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Computing Directional Writhing . . . . . . . . . . . . . . . . . . . . . . 31

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

ix

2.6 Notes and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Backbone Simplification 39

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Prior and New Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Frechet Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Notes and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Elevation Function 58

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Defining Elevation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.1 Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.2 Height and Elevation . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3 Pedal Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Capturing Elevation Maxima . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4.1 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4.2 Elevation Maxima . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87


5 Matching via Hausdorff Distance 94

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

x

5.2 Collision-Free Hausdorff Distance between Sets of Balls . . . . . . . . . 100

5.2.1 Computing �� in 2D and 3D . . . . . . . . . . . . . . . 101

5.2.2 Partial matching . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3 Hausdorff Distance between Unions of Balls . . . . . . . . . . . . . . . . 105

5.3.1 The exact 2D algorithm . . . . . . . . . . . . . . . . . . . . . . 105

5.3.2 Approximation algorithms . . . . . . . . . . . . . . . . . . . . . 109

5.4 RMS and Summed Hausdorff Distance between Points . . . . . . . . . . 113

5.4.1 Simultaneous approximation of Voronoi diagrams . . . . . . . . . 113

5.4.2 Approximating �� . . . . . . . . . . . . . . . . . . . . . . 115

5.4.3 Approximating �� . . . . . . . . . . . . . . . . . . . . . . 117

5.4.4 Maintaining the 1-median function . . . . . . . . . . . . . . . . . 119

5.4.5 Randomized algorithm . . . . . . . . . . . . . . . . . . . . . . . 123


6 Coarse Docking via Features 126

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2.1 Scoring function . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2.2 Computing features. . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.3 Coarse alignment algorithm . . . . . . . . . . . . . . . . . . . . 134

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.4 Notes and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Bibliography 147

Biography 161

xi

List of Tables

2.1 Comparisons on protein data . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Table of singularities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Number of Maxima for 1brs . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Number of Max with different resolution . . . . . . . . . . . . . . . . . . 90

4.4 Covering density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.1 Index-k Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.2 Complex 1brs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.3 25 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4 Two-step for 25 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.5 Unbound Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

xii

List of Figures

1.1 Protein structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Protein models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Protein docking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 DNA supercoiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Sign of crossings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Worst case of writhe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Critical directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 Winding number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6 Spherical triangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.7 Open knot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.8 Oriented edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.9 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.10 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.11 Protein backbones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Frechet matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Frechet simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Comparison between Frechet and Hausdorff simplifications . . . . . . . . 47

3.4 Relation between �� and �� . . . . . . . . . . . . . . . . . . 49

xiii

3.5 Results of Frechet simplification . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7 Comparisons between DP and GreedyFrechetSimp algorithms . . . 55

3.8 Simplification of a protein . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1 Four types of maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Extended persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 Elevation on 1-manifold . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4 Pedal curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.5 Co-dimensional 2 singularities . . . . . . . . . . . . . . . . . . . . . . . 71

4.6 Discontinuity in elevation . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.7 Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.8 Mercedes star property . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.9 Another neighborhood pattern for a triple point . . . . . . . . . . . . . . 78

4.10 Other neighborhood patterns . . . . . . . . . . . . . . . . . . . . . . . . 78

4.11 Parameterization of Gaussian neighborhood . . . . . . . . . . . . . . . . 80

4.12 Height difference for 2-legged maximum . . . . . . . . . . . . . . . . . 82

4.13 Height difference for 3-legged maximum . . . . . . . . . . . . . . . . . 82

4.14 Decaying of maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.15 Top 100 maxima for 1brs . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.16 Elevation on 1brs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

xiv

5.1 Valid and forbidden regions . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2 Voronoi for union of balls . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3 Exponential grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1 Predict docking configurations . . . . . . . . . . . . . . . . . . . . . . . 126

6.2 Max types again. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.3 Coarse alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.4 Align features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.5 Align Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

xv

Chapter 1

Introduction

If a living cell is viewed as a biochemical factory, then its main workers are protein

molecules, acting as catalysts, transporting small molecules, forming cellular structures,

and carrying signals, among other roles. As Jacques Monod states in his book Chance and

Necessity: “ . . . it is in proteins that lies the secret of life. ”. Their functional diversity

is made possible by the diversity of their three-dimensional structures. Understanding or

simulating molecular processes involved in the formation of protein structures and their

biological functions is a major challenge of molecular biology. For most of the key prob-

lems involved in this challenge, such as protein folding, docking, structure classification,

and structure prediction, naturally, geometry and topology play important roles. However,

currently, geometric methods are not fully utilized and investigated when attacking these

key problems, partly due to a number of representational and algorithmic challenges.

To close this gap, in this thesis, we study shape analysis problems arising in molecular

biology by combining both geometric and topological approaches. In particular, we focus

on algorithms for describing and matching protein structures. Note that, in general, shape

characterization and matching are central to various application areas other than structural

biology, including computer vision, pattern recognition, and robotics [17, 131, 165]. Most

of the techniques that we have developed are applicable to these other fields as well.

In the remainder of this chapter, we first give a brief biological background on protein

structures and introduce some related research areas. More details can be found in standard

textbooks [34, 68, 125]. We then describe shape analysis problems, arising in molecular

biology, from the computational side. We state our main contributions at the end of this

chapter.

1

1.1 Protein Structure and Geometric Models

A protein is a polymer consisting of a long chain of small building blocks, called amino

acids or residues. All amino acids have a 3-atom backbone, � - �� - � , to which a side chain

(denoted by �� and �� in Figure 1.1 (a)) is attached. Besides the side chain, a hydrogen

atom is bonded to the backbone nitrogen atom, and an oxygen is doubly bonded to the

carboxy carbon. There are � standard amino-acids residues, distinguishable by their side

chains. The amino end ( � ) of an amino acid connects to the carboxy end ( � ) of the

preceding amino acid, forming a peptide bond. Thus the chemical structure of a protein

molecule can be viewed as a linear sequence of amino acids interconnected by peptide

bonds.

C

C

N

H

H

N

O� �

� �O

�

�

(a) (b)Figure 1.1: (a) Protein structure: with the backbone structure in the dotted boxes. (b) The foldedstate of a protein; each atom is modeled as a ball.

Though a linear sequence, a protein molecule folds into a compact and typically unique

three-dimensional structure under certain physiological conditions (see Figure 1.1 (b)).

This is the result of various atomic interactions, such as van der Waals and electrostatic

forces. The resulting structure, refered to as the native structure or the folded state, is how

a protein molecule exists in nature, and is the conformation in which a protein is able to

perform its physiological functions. In fact, given a protein molecule, its three-dimensional

structure decides its functionality to a large extent [68, 125]1. Therefore, knowledge of the�For example, disruption of the native structures of proteins is the primary cause of several neurodegenera-

2

protein structures is essential for understanding the principals that govern their functions

in nature. This three-dimensional structure is the focus of our study. In general, protein

structures are examined at different levels, referred to as protein structure architecture [34].

We present a brief description below.

� primary structure: the amino acid composition of the protein, i.e., the linear se-

quence of amino acids;

� secondary structure: common patterns in the conformation of protein backbones

observed in nature. There are four major types of secondary structure elements

(SSEs): � -helix,�

-sheet,�

-turn, and coils;

� supersecondary structure: the higher scale structures organized by secondary struc-

ture elements, e.g, how SSEs are connected.

� tertiary structure: the global folded three-dimensional structure of the protein;

� quaternary structure: the structure, or the complex of two proteins bound together.

How to model proteins appropriately is a crucial first step before we can visualize or

manipulate them. Several models have been proposed, depending on the objectives of

the underlying applications and/or what information one would like to emphasize. In the

literature, the term modeling refers to a broad collection of methods to describe not only

the geometric structures, but also the energetic aspects of a molecule2 [84, 125]. In this

thesis, we focus on geometric models of the three dimensional structure of proteins [64].

Geometric shapes typically refer to a finite set of points, a space curve, or a surface.

In the context of molecular biology, a set of points corresponds to the set of centers of

atoms, possibly weighted by the van der Waals radius of the atoms. Sometimes, points

tive diseases, such as Alzheimer’s disease and Parkinson’s disease.�For example, by using quantum mechanics, one can describe in detail the energy of any arrangement ofatoms and molecules in a particular system.

3

are connected by “sticks” that represent covalent bonds between atoms (Figure 1.2 (a)).

Such representations not only specify the positions of each atom in a molecule, but also

the chemical information.

(a) (b) (c)

Figure 1.2: A protein molecule represented as (a) the set of atom centers connected by sticks torepresent covalent bonds; (b) a space curve (the main chain representation); and (c) the (van derWaals) surface of the union of atoms, each represented by a ball.

Sometimes, the details presented in the above representation are not necessary, or even

undesirable. The main chain representation is often exploited in such situations, where a

protein molecule is modeled as a curve in �� following the trace of the backbone atoms of

the amino acids (see Figure 1.2 (b)). Such a representation emphasizes the linear nature

of protein molecules, and shows clearly how this linear sequence of amino acids folds in

space. It provides a much simplified representation of a protein, while still maintaining its

main structural features. Consequently, this representation is popular in many applications,

especially those of high computational complexity, such as protein folding and structure

classification.

A surface representation of proteins is useful when the object of study is the space occu-

pied by a protein molecule, or when the global structure of the molecule is more important

than its local geometry. There are many ways to represent the surface of a molecule. If

we model each atom by a ball in �� with its van der Waals radius, then the surface of the

union of these balls is refered to as the van der Waals surface (Figure 1.2 (c)). The solvent

4

accessible surface, originally proposed by Lee and Richard [126], is the surface traced out

by the center of a probe sphere (which typically represents the water molecule) rolling on

top of the VDW surface. The surface traced out by the inward-facing surface of this probe

sphere is called molecular surface. The skin surface developed by Cheng et al. [55] is

more complicated, but has many elegant (mathematical) properties.

1.2 Related Research Areas

Despite the important role that protein structures play in understanding life, early research

in computational biology, or bioinformatics, focused on sequence analysis, rather than on

protein structures. One of the key reasons is that it is significantly easier to both acquire and

manage sequence data.3 Nevertheless, with the tremendous success in sequence analysis,

the study of protein structures has become increasingly critical. For example, in the post-

genomic era, a major obstacle to the exploitation of the large volume of genome sequence

data is the functional characterization of the gene products (protein structures). Since

the three-dimensional structures of proteins are more conserved than the corresponding

sequences, many large-scale protein structure determination projects have been initiated

recently [169] to help to analyze the functionally unannotated protein sequences space.

These initiatives are widely referred to as structural genomics or structural proteomics. In

this subsection, we briefly mention several (not necessarily disjoint) research areas related

to protein structures.4

Protein folding and protein structure prediction. Predicting a protein’s structure

from its amino acid sequence is one of the most significant tasks tackled in computational

biology (Figure 1.3). Solving this problem will have enormous impacts on rational drug

design, cell modeling, and genetic engineering. It is therefore not surprising that it is�

For example, the Human Genome Project has made massive amounts of protein sequences data available,while the output of experimentally determined protein structures, typically by time-consuming and relative

5

Figure 1.3: From left to right we show snapshots of a periscope and staphylococcal pro-tein A, B domain (in mainchain representation), at different stage during the folding process(http://parasol.tamu.edu/dsmft/research/folding).

considered as the “ holy grail ” in the structural biology community; see e.g. [123].

Two issues are involved here: (i) to understand the mechanism behind protein folding

process, i.e., how does a protein fold in nature; and (ii) to predict the final folded confor-

mation given a sequence of amino acids. These two aspects are obviously related, but not

necessarily equivalent: several successful approaches to predict protein structures do not

mimic the folding process, but rely on the knowledge of known protein structures.

There has been a long history of tackling the folding problem. Current approaches

can be classified into two categories, which we mention without going into any detail

here. For more information, refer to [26, 123, 156]. The first class, including comparative

modeling and threading, start by using a known template structure or known folds. The

second class, de novo or ab initio methods, predict structure from sequence directly, using

principles of atomic interactions and protein architecture. Despite the success of these

methods for some cases, especially in predicting structures of small protein molecules, the

protein folding problem remains largely unsolved. Main reasons include that the structure

is defined by a large number of degrees of freedom5, and that the physical basis of protein

expensive X-ray crystallography and NMR spectroscopy, is lagging far behind.�

It is impossible to survey these areas in a comprehensive manner in this thesis — each would require awhole book to do so!

�

As highlighted by the Levinthal paradox which results from the observation that proteins are folded intotheir specific three-dimensional conformation, in a timespan significantly short (milliseconds) than whatwould be expected if the molecule actually searched the entire conformation space for the lowest energystate.

6

structural stability is not fully understood. The vigor of this field can be seen from the

high participation and great performance output of the Critical Assessment of Structure

Prediction (CASP) experiments (http://predictioncenter.llnl.gov).

Protein interactions. Two or more molecules interact with each other by forming an

intermolecular complex (either in a stable manner, or temporarily), the process of which

is called docking or receptor-ligand recognition and binding. Such interactions are critical

to various biological processes, such as cell-cell recognition and enzyme catalysis and in-

hibition. The target macromolecules (receptors) are usually large, mostly proteins, while

the ligands can be either large, such as proteins, or small, such as drugs or cofactors (see

Figure 1.4). The sites where binding happens are called active sites or binding sites. They

(a) (b)

Figure 1.4: Examples of (a) protein-small molecule docking: mainchain representation of HIV-1protease bound to an inhibitor (in VDW representation); and (b) protein-protein docking: humangrowth hormone.

are usually places on the surfaces of proteins where chemical reactions or conformational

changes happen. Hence knowledge of the interaction between molecules is crucial in un-

derstanding, and even manipulating, their functions. As an example, many drug molecules

work by acting as inhibitors: they bind to the receptor proteins to block the active sites,

thus stopping undesired chemical reactions or molecular processes from happening. As a

result, efficient docking algorithms of drug molecules to target receptor proteins is one of

the major ingredients in a rational drug design scheme [85, 102, 164].

The docking problem has attracted great attention from computer scientists as well

7

as biochemists, due to its strong geometric and algorithmic flavor [102, 81, 114]. Much

success has been achieved for docking a protein with a small molecule, or docking two

rigid proteins (i.e., each protein can only undergo rigid transformations). However, the

field remains rather open, especially in the case of protein-protein docking without the

“rigid” assumption [97]. In this case, as protein structures are complicated, modeling their

conformational changes introduces many degrees of freedom. Nevertheless, progress is

being made, and interested readers should refer to the results of the CAPRI experiments,

i.e., the Critical Assessment of PRediction of Interaction, for the newest advances in this

field (http://capri.ebi.ac.uk).

Protein structure comparison and classification. As a protein molecule with some

functional role evolves in the context of a living cell, its overall three-dimensional struc-

ture tends to remain unaltered, even when all sequence memory may have been lost [80].

This evolutionary resilience of protein three-dimensional structures is the fundamental rea-

son for comparing protein structures in molecular biology. Numerous comparison methods

have been proposed and developed in the past 20 years [122, 160]. The problem, however,

is difficult and remains unsolved. There is typically no clear definition for structural sim-

ilarity. Structural similarity is of interest at many levels: from the fine detail of backbone

and side-chain conformation at the residue level to the coarse similarity at the tertiary struc-

ture level. Besides, many situations require one to capture local similarity, which is hard

to describe.

Moreover, more and more protein structure data are becoming available: At the time of

this writing, there are more than � � proteins structures in the Protein Data Bank [31],

and the number almost doubles every 18 months. It is therefore crucial to bring certain

order into protein structures by classifying them into families. Other than organizing the

large structure database, such classification can aid our understanding of the relationships

between structures and functions. For example, it is shown that almost all enzymes have

8

the so-called �� folds [132], i.e., they have both � -helices and

�-sheets in their struc-

tures. Furthermore, while each sequence typically generates a unique three-dimensional

structure, multiple sequences may produce similar folded structures, or folds. A natural

question arising is then: how many different folds are there in nature? Classification helps

to answer this question, and its solution is useful in annotating the sequence space by struc-

tures, thus functions, which is a central aspect in structural genomics that we mentioned

earlier [26]. Classification also enables us to experimentally determine many fewer number

of protein structures — we can now afford to determine only those that possibly produce

novel folds [160].

Currently three most popular classifications are: SCOP [138], CATH [139, 141], and

FSSP [104], all of which are accessible via the world wide web. Similar to structure

comparison, one main difficulty of the classification problem arises from the fact that there

is no consensus on defining the organization of different categories. Thus how to classify

protein structures in a fully automatic way is still a daunting problem.

1.3 Shape Analysis in Molecular Biology

Above we have sketched some key research areas closely related to protein structures. Two

issues appear repeatedly there — how to describe and characterize structures and how to

develop efficient computational methods (algorithms). These two issues are obviously not

new to many research fields in computer science. In this section, we address these two

issues by identifying shape-analysis problems in molecular biology and describing them

from a computational perspective. Shape analysis problems have been studied extensively

in many fields including computer graphics, vision, geometric computing, robotics, and so

on. On the one hand, many of the techniques there can be adopted to attacking problems in

molecular biology directly. On the other hand, protein structure analysis has many unique

9

properties and new techniques are greatly needed.6

At a high level, we classify shape analysis problems into two broad categories, each

including many subtopics. Though our classification below are tailored towards molecu-

lar biological applications, one should note that techniques developed are not necessarily

constrained to biological applications. Once again, we will only sample a few techniques

exploited in attacking these problems, as a full enumeration will go beyond the scope of this

thesis. For surveys on a subset of topics in shape analysis in general, refer to [17, 131, 165].

1.3.1 Describing Shapes

Modeling flexibility. We have introduced some basic geometric representations for

three-dimensional structures of proteins in Section 1.1. In some applications, more sophis-

ticated representations are required: protein molecules are in constant motion (vibration)

in solution, and they might undergo significant conformational changes at times (such as

in a protein-protein docking process). Therefore, it is important to incorporate flexibility

in modelling protein structures. The question then is of course how to model flexibility,

which is typically complex as the protein structures have too many degrees of freedom.

On the one hand, special data structures are desirable to efficiently support changes in con-

formations: For example, in [6], a chain hierarchy has been proposed which can detect

collisions for deforming protein backbones efficiently. On the other hand, it is important to

characterize motions: what are the types of motions molecules undergo and where do they

happen. Techniques from robotics, motion planing and graph theory have been exploited

successfully in several cases for either identifying possible motions [112] or for reducing

the degree of freedoms of motions [85, 161].

Simplified representations. In some applications, simplified structures are needed to�

We remark here that the shape analysis problems are extremely hard for proteins structures due to the factthat the connection between protein structures and their functions is not yet well understood. In manysituations, it is not clear what aspects about structures give rise to a particular functionality.

10

help to manage complex problems. Hence many approaches use simplified protein struc-

tures, such as representing the backbone of a protein molecule as a set of fragments, each

corresponding to a secondary structure element [160, 155]. As another example, one model

proposed by Dill [49, 74] simplifies the protein backbone as beads chained together on a

unit lattice. The beads can either be hydrophobic or hydrophilic, with contacts between

hydrophobic beads being favored. Although fairly simplistic, this model yields results

surprisingly similar to those derived from experimental data when applied to the protein

folding problem [73, 162].

Shape descriptors/signature. One aspect of shape characterization is to extract key

features or information of a given shape. For example, this is central to many approaches

that compare protein structures: The key information is typically stored in a shape de-

scriptor or signature, and similarity between two shapes can now be measured by some

distance between the corresponding descriptors. As another example, in order to under-

stand protein-protein interaction, there have been much research on characterizing the in-

terface where the two proteins interact with each other [27, 130]. Features that contribute

to describe such interfaces include the buried surface areas, the tightness of the binding,

hydrophobicity and so on.

Extracting features and generating shape descriptors are widely used in graphics, vi-

sion, and robotics. Many techniques borrowed from there can be applied to applications in

molecular biology: statistical methods (such as histograms and harmonic maps) [118, 22],

geometric-based methods (such as turning angles) [59], and topological-based methods

(such as Connolly function) [66, 136] have all been exploited to generate shape descriptors

in structural biological applications.

11

1.3.2 Matching Shapes

Similarity matching. Measuring similarity between protein structures is essential to

protein structure classification, and is needed when applying comparative modeling meth-

ods for predicting protein structures. There are in general two types of approaches for this

problem. The first type of methods are alignment-based. They consider matching as an

optimization problem by finding the best alignment (i.e., relative placement) of two input

structures under some scoring function (where the score evaluates how similar structures

are). Example of this approach include DALI [103], STRUCTAL [129], and CE [149].

How to define the scoring function, or the distance between two structures, is an intriguing

problem in itself, and has received much attention [80, 158].

The second approach computes the similarity/distance between input structures di-

rectly, without producing any alignment to superimpose them. Many of methods in this

category exploit shape descriptors [22, 145]. Another example is the contact-map over-

lay approach, which converts each protein structure into some special type of graph, and

similarity is measured as the size of the largest congruent subgraph between two such

graphs [94].

In general, alignment-based methods involve searching in a large configuration space,

and thus have higher time complexity than the second type of approaches. They are also

less efficient when querying in a structure database. However, they are more reliable and

discriminative in measuring similarities. Refer to [150, 151] for a comparison between

current popular matching approaches in structural biology.

Complementarity matching. The main motivation to study complementarity match-

ing in molecular biology is to understand protein-ligand interactions. A simple geometric

formulation for protein-protein docking problem is the following: Given protein � and

� , find the best transformation of � such that they best complement each other. In other

12

words, this is a partial surface matching under the constraints that the two surfaces do not

intersect. Two main issues involved here are:7 (i) evaluate the alignments generated, i.e.,

find a good score function that produces few false-positives [97, 153]; and (ii) reduce the

complexity of the search procedure, e.g., by exploiting more efficient computation (such

as FFT or spherical harmonics) [53, 143], by better searching strategies (such as by ge-

netic algorithms) [39, 91], or by reducing the number of transformations inspected (such

as geometric hashing) [86].

Of course in nature both molecules can change their conformation during the dock-

ing process. For large protein-protein interactions, it is complex to model flexibility in

matching procedure, and this is one of the main focus of the current research [97, 153].

Classification and structure database. We mentioned protein structure classification

earlier for the purpose of organizing the rapidly-expanding collection of protein structures

available. There are also needs to manage structures into a database that can support effi-

cient queries (i.e, given a query structure, return one or more structures from the database

that contain it (in the case of motif query), or that are similar to it). The pairwise structure

comparison that we discussed above (in similarity matching) is obviously a fundamen-

tal component in classification and query problems, and a straightforward way to classify

protein structures is by all-against-all comparisons. This is the method adopted by most

current classifications of protein structures, such as CATH and FSSP [139, 104]. It is,

however, rather inefficient, especially when combined with the alignment-based pairwise

matching procedures. Part of the reason for the usage of this straightforward approach

despite its inefficiency is because most past focus is on how to classify protein structures

in a reliable and automatic way (the problem of which is still not satisfactorily solved

even now) [122]. With better understanding of protein structures, and with the number

of known structures increasing rapidly, efficient clustering techniques become essential.�

References are for molecular biological applications.

13

Several recently developed protein structure comparison techniques aim at developing a

similarity measure that satisfies triangle inequality [60, 47, 145], so that many known clus-

tering algorithms can be applied. In particular in [60], Choi et al. exploits techniques from

information retrieval in building their classification system.

1.4 Main Contributions

Our research touches both shape description and shape matching categories. We focus

on developing efficient computational methods for describing or matching structures. Our

approaches rely on both geometric and topological techniques 8. Softwares produced from

this thesis work are available at the BioGeometry website (http://biogeometry.duke.edu/).

Part I. Shape description.

(1) Writhing number: The writhing number of a curve measures how many times a curve

coils around itself in space. It characterizes the so-called supercoiling phenomenon of dou-

ble stranded DNA, which influences DNA replication, recombination, and transcription. It

is also used to characterize protein backbones. We establish a relationship between the

writhing number of a space curve and the winding number, a topological concept. This

enables us to develop the first subquadratic algorithm for computing the writhing number

of a polygonal curve. We have also implemented a simpler algorithm that runs in near-

linear time on inputs that are typical in practice [5], including protein backbones and DNA

strands, in contrast to the quadratic-time algorithms used by current softwares.

(2) Simplification: The level-of-detail (LOD) representation of protein backbone helps to

single out its main features. One way to obtain LOD representations is via curve simpli-

fication. We study the simplification problem under the so-called Frechet-error measure.�

We remark here that in the literature of molecular biology, the word topology is typically used in a differentmeaning from our usage: It mainly refers to the topology of the molecule itself, such as how the elementsof a molecules are interconnected; while we exploits knowledge and understanding from theories and toolsfrom classic topology, such as Morse theory, in our approaches.

14

This measure is more desirable than the widely used Hausdorff error measure in many sit-

uations, especially if one wants to preserve global features of a curve (e.g, the secondary

structure elements of a protein backbone) during simplification. We propose and imple-

ment a simple algorithm to simplify curves under Frechet error measure [7], which is the

first simplification algorithm that runs in near-linear time in dimensions higher than two

with guaranteed quality.

(3) Elevation function: Given a molecular surface, to capture geometric features such

as protrusions and cavities, we design a continuous elevation function on the surface, and

compute the points with locally maximum elevation [4]. The intuition of the function

follows from elevation on Earth, by which we identify mountain peaks and valleys. But

the concept is more technical to extend to general 2-manifolds. This function is scale-

independent and provides beyond the location, also direction and size of shape features.

By elevation function, we can describe the above geometric features in a reliable and suc-

cinct manner, the results of which can aid to attack the protein docking problem.

Part II. Shape matching.

(4) Matching via Hausdorff distance: By modelling a molecule as the union of a set of

balls (each representing an atom), we measure the similarity between two molecules by

variants of Hausdorff distance. In particular, we present algorithms that compute exactly

or approximately the minimum Hausdorff distances between two such unions under all

possible translations [8]. We also investigate the version in which we are constrained to

only translations where the two sets remain collision-free (i.e., no ball from one set inter-

sects the other sets).

(5) Docking via features: As mentioned earlier, from a geometric perspective, protein

docking can be considered as the problem of searching for configurations with the maxi-

mum complementarity between two molecular surfaces. Our goal is to efficiently compute

a small set of potentially good docking configurations based on the geometry of the two

15

structures. Given such a set, more sophisticate procedures can then be performed on each

of its members independently to locate the “real” docking configuration. To find such a

potential set, we would like to align cavities from one protein with protrusions from the

other, and these “meaningful” features are captured by the elevation function we have de-

signed. Our approach can compute important matching positions while inspecting many

fewer configurations than the exhaustive search or earlier geometric-hashing approaches.

16

Part I:Shape Description

17

Chapter 2

Writhing Number

2.1 Introduction

The writhing number is an attempt to capture the physical phenomenon that a cord tends to

form loops and coils when it is twisted. We model the cord by a knot, which we define to

be an oriented closed curve in three-dimensional space. We consider its two-dimensional

family of parallel projections. In each projection, we count��

or � � for each crossing,

depending on whether the overpass requires a counterclockwise or a clockwise rotation

(an an angle between 0 and � ) to align with the underpass. The writhing number is then

the signed number of crossings averaged over all parallel projections. It is a conformal

invariant of the knot and useful as a measure of its global geometry.

The writhing number attracted much attention after the relationship between the linking

number of a closed ribbon and the writhing number of its axis, expressed by the White

formula, was formally discovered independently by Calugareanu [40], Fuller [88], Pohl

[142], and White [168].

�� (2.1)

Here the linking number,��

, is half the signed number of crossings between the two

boundary curves of the ribbon, and the twisting number,�

, is half the average signed

number of local crossing between the two curves. The non-local crossings between the

two curves correspond to crossings of the ribbon axis, which are counted by the writhing

number,��

. The linking number is a topological invariant, while the twisting number and

the writhing number are not. A small subset of the mathematical literature on the subject

can be found in [20, 79].

18

Besides the mathematical interest, the White Formula and the writhing number have

received attention both in physics and in biochemistry [70, 117, 128, 157]. For example,

they are relevant in understanding various geometric conformations we find for circular

DNA in solution, as illustrated in Figure 2.1 taken from [37]. By representing DNA as

Figure 2.1: Circular DNA takes on different supercoiling conformations in solution.

a ribbon, the writhing number of its axis measures the amount of supercoiling, which

characterizes some of the DNA’s chemical and biological properties [30].

As another example, the writhing number and some of its variants have also been ap-

plied to protein backbones, modeled as open curves, as shape descriptors to classify pro-

tein structures [128, 145]. The intuition for such approaches follows from the fact that the

writhing number of a space curve measures the relative position between any two points in

the curve and the relative orientation between the tangents at those points. (This view will

become more clear after we introduce Equation 2.3 in the next section. ) When extended

to a polygonal curve, this means that the writhing number measures the relative position

and orientation between any two edges of the curve. Hence two protein backbones with

similar arrangements of secondary structure elements produce similar writhing number.

This chapter studies algorithms for computing the writhing number of a polygonal

knot. Section 2.2 introduces background work and states our results. Section 2.3 relates

the writhing number of a knot with the winding number of its Gauss map. Section 2.4

shows how to compute the writhing number in time less than quadratic in the number of

edges of the knot. Section 2.5 discusses a simpler sweep-line algorithm and presents initial

experimental results.

19

2.2 Prior and New Work

In this section, we formally define the writhing number of a knot and review prior al-

gorithms used to compute or approximate that number. We conclude by presenting our

results.

Definitions. A knot is a continuous injection � �� or, equivalently, an oriented

closed curve embedded in �� . We use the two-dimensional sphere of directions, � � , to

represent the family of parallel projections in �� . Given a knot � and a direction �� ,the projection of � is an oriented, possibly self-intersecting, closed curve in a plane normal

to � . We assume � to be generic, that is, each crossing of � in the direction � is simple

and identifies two oriented intervals along � , of which the one closer to the viewer is the

overpass and the other is the underpass. We count the crossing as��

if we can align the

two orientations by rotating the overpass in counterclockwise order by an angle between

and � . Similarly, we count the crossing as � � if the necessary rotation is in clockwise

order. Both cases are illustrated in Figure 2.2. The Tait or directional writhing number of

+1 −1Figure 2.2: The two types of crossings when two oriented intervals intersect.

� in the direction � , denoted as � � � , is the sum of crossings counted as��

or � � as

explained. The writhing number is the averaged directional writhing number, taken over

all directions �� ,��

�

�� (2.2)

We note that a crossing in the projection along � also exists in the opposite direction, along

�� , and that it has the same sign. Hence � � � � � �� , which implies that the

20

writhing number can be obtained by averaging the directional writhing number over all

points of the projective plane or, equivalently, over all antipodal points pairs � � � �� of the

sphere.

Computing the writhing number. Several approaches to computing the writhing number

of a smooth knot exactly or approximately have been developed. Consider an arc-length

parameterization � � � � � � � , and use �� and � to denote the position and the unit

tangent vectors for � �� . The following double integral formula for the writhing number

can be found in [142, 159]:

��

� �� (2.3)

If the smooth knot is approximated by a polygonal knot, we can turn the right hand side of

(2.3) into a double sum and approximate the writhing number of the smooth knot [33, 128].

This can also be done in a way so that the double sum gives the exact writhing number of

the polygonal knot [28, 121, 166].

Alternatively, we may base the computation of the writhing number on the directional

version of the White formula,� � � � � � � � � � � for � � � � . Recall that both the

linking number and the twisting number are defined over the two boundary curves of a

closed ribbon. Similar to the definition of � � � , the directional twisting number,� � � � ,

is defined as half the sum of crossings between the two curves, each counted as��

or � �

as described in Figure 2.2. We get (2.1) by integrating over � � and noting that the linking

number does not depend on the direction. This implies

��

� � � � � � � � � � � �

� � ��

� (2.4)

To compute the directional and the (average directional) twisting numbers, we expand �to a ribbon, which amounts to constructing a second knot that runs alongside but is disjoint

21

from � . Expressions for these numbers that depend on how we construct this second knot

can be found in [121]. Le Bret [35] suggests to fix a direction � and define the second

knot such that in the projection it runs always to the left of � . In this case we have� � � � � and the writhing number is the directional writhing number for � minus the

twisting number.

A third approach to computing the writhing number is based on a result by Cima-

soni [62], which states that the writhing number is the directional writhing number for a

fixed direction � , plus the average deviation of the other directional writhing numbers from � � � . By observing that � � � is the same for all directions � in a cell � of the de-

composition of � � formed by the Gauss maps

and �

(also referred to as the tangent

indicatrix or tantrix in the literature [56, 154]), we get

��

�� (2.5)

where � is �� for any one point � in the interior of � , and ��

is the area of � .

If applied to a polygonal knot, all three algorithms take time that is at least proportional

to the square of the number of edges in the worst case.

Our results. We present two new results. The first result can be viewed as a variation of

(2.4) and a stronger version of (2.5). For a direction � � � � not on

and not on �

, let �� be its winding number with respect to

and �

. As explained in Section 2.3, this

means that

and �

wind �� times around � .

THEOREM A. For a knot � and a direction � , we have

��

� � � ��

�

Observe the similarity of this formula with (2.4), which suggests that the winding number

can be interpreted as the directional twisting number for a ribbon one of whose two bound-

22

ary curves is � . We will prove Theorem A in Section 2.3. We will also extend the relation

in Theorem A to open knots and give an algorithm that computes the average winding

number in time proportional to the number of edges. Our second result is an algorithm that

computes the directional writhing number for a polygonal knot in time sub-quadratic in the

number of edges.

THEOREM B. Given a polygonal knot � with � edges and a direction � �� , � � � can

be computed in time O � � �� , where � is an arbitrarily small positive constant.

Figure 2.3: A knot whose directional writhing number is quadratic in the number of edges.

Theorems A and B imply that the writhing number for a polygonal knot can be computed

in time O � � �� . As shown in Figure 2.3, the number of crossings in a projection can

be as large as quadratic in � . The sub-quadratic running time is achieved because the

algorithm avoids checking each crossing explicitly. We also present a simpler sweep-line

algorithm that checks each crossing individually and therefore does not achieve the worst-

case running time of the algorithm in Theorem B. It is, however, fast when there are few

crossings.

2.3 Writhing and Winding

In this section, we develop our geometric understanding of the relationship between the

writhing number of a knot and the winding number of its Gauss map. We define the Gauss

23

map as the curve of critical directions, prove Theorem A, and give a fast algorithm for

computing the average winding number.

2.3.1 Closed knots

Critical directions. We specify a polygonal knot � by the cyclic sequence of its vertices,

�� in � � . We use indices modulo � and write ��

for the unit vector along the edge � � � � � � . Note that �� is also a direction in � � and a point in� � . Any two consecutive points �� and �� determine a unique arc, which, by definition, is

the shorter piece of the great circle that connects them. The cyclic sequence � � � � � � �� thus defines an oriented closed curve

in � � . We also need the antipodal curve, �

, which

is the central reflection of

through the origin.

Figure 2.4: In all three cases, the viewing direction slides from left to right over the orientedgreat circle of directions defined by the hollow vertex and the solid edge. The directional writhingnumber changes only in the third case, where we lose a positive crossing.

The directions � on

and �

are critical, in the sense that the directional writhing

number changes when we pass through � along a generic path in � � , and these are the only

critical directions [62]. We sketch the proof of this claim for the polygonal case. It is clear

that � � � � is critical only if it is parallel to a line that passes through a vertex � � and

a point on an edge � �� of the knot that is not adjacent to � � . There are � � � � � such

vertex-edge pairs, each defining a great circle in � � . First, we note that only � of these great

24

circles actually carry critical points, namely, the great circles that correspond to ��

� �

and �� . The reason for this is shown in Figure 2.4, where we see that the writhing

number does not change unless � � is separated from � �� by only one edge along the knot.

Second, assuming �� we observe that the subset of directions along which � � projects

onto � � � � � � � � is the arc � �� from � � to the direction � � � � � � � � � � � � � � � � � � � � ��

in � � ,and symmetrically the arc � � �� from � �� to �� . The subset of directions along which

� � � � projects onto � � � � � � are the arcs � � � � � � and � � �� . The points �� , �� , and �� lie

on a common great circle and � � lies on the arc �� . This implies that the concatenation

of � �� and �� is the arc � � � � � � , and that of � � � �� and � � �� is the arc � � �� . It

follows that

and �

indeed comprise all critical directions.

Decomposition. The curves

and �

are both oriented, which is essential. We say a

direction � � � � lies to the left of an oriented arc �� if it lies in the open hemisphere to

the left of the oriented great circle that contains �� . Equivalently, � sees that great circle

oriented in counterclockwise order. If � passes from the left of an arc �� of

to its right,

then we either lose a positive crossing (as in the third row of Figure 2.4), or we pick up

a negative crossing. Either way the directional writhing number decreases by one. This

motion corresponds to � � passing from the right of the arc � � �� of �

to its left. Since

the directional writhing numbers at � and � � are the same, we decrease the directional

writhing number by one in the opposite view as well. In other words, if � moves from the

left of an arc of �

to its right, then the effect on the directional writhing number is the

opposite from what it is for an arc of

. These simple rules allow us to keep track of the

directional writhing number while moving around in � � . The curves

and �

decompose

� � into cells within which the directional writhing number is invariant. We can thus rewrite

(2.2) as

��

� � � � � �25

where the sum ranges over all cells � of the decomposition, and � is the directional

writhing number of any one point in the interior of � . Equation (2.5) of Cimasoni can now

be obtained by subtracting � � � from � inside the sum and adding it outside the sum.

This reformulation provides an algorithm for computing the writhing number.

Step 1. Compute � � � for an arbitrary but fixed direction � .Step 2. Construct the decomposition of � � into cells, label each cell � with � �

� � � , and form the sum as in (2.5).

The running time for Step 2 is � � � � � in the worst case as there can be quadratically many

cells. We improve the running time to O � � � and, at the same time, simplify the algorithm.

First we prove Theorem A.

Winding numbers. We now introduce a function

over � � that may be different from

but changes in the same way. In other words, � � � � � � � � � � � � � � �

for all � � � � � � . This function is the winding number of a point � � � � with respect

to the two curves

and �

that do not contain � . Observe that the space obtained by

removing two points from the two-dimensional sphere is topologically an annulus. We

fix non-critical, antipodal directions � and �� and define � � � equal to the number of

times

winds around the annulus obtained by removing � and �� plus the number of

times �

winds around the annulus obtained by removing � and � . This is illustrated in

Figure 2.5, where � � � � � �� and

� � � � . Here we count the winding of

in

counterclockwise order as seen from � positive, and winding in clockwise order negative.

Symmetrically, we count the winding of �

in clockwise order as seen from � positive,

and winding in counterclockwise order negative. Imagine moving a point � along

and

connecting � to � with a circular arc. Specifically, we use the circle that passes through

� , � , and �� and the arc with endpoints � and � that avoids �� . Symmetrically, we move

� � along �

and connect � to � � with the appropriate arc of the circle passing through

26

−T

T

−z

z

x

Figure 2.5: The winding number counts the number of times � separates � from �� and ��separates � from � .

� , � � , and � . Locally at � we observe continuous movements of the two arcs. Clockwise

and counterclockwise movements cancel, and �� is the number of times the first arc

rotates in counterclockwise order around � plus the number of times the second arc rotates

in clockwise order around � . The winding number of � is always an integer but can be

negative.

Observe that

indeed changes in the same way as does. Specifically,

drops by

1 if � crosses

from left to right, and it increases by 1 if � crosses �

from left to right.

Starting from the definition (2.2) of the writhing number, we thus get

��

� ��

�

� � � � ��

�

� � � � ��

�

� � � ��

which completes the proof of Theorem A.

Signed area modulo 2. Observe that the writhing number changes continuously under

deformations of the knot, as long as � does not pass through itself. When � performs a

small motion during which it passes through itself there is a �� jump in � � � , while the

27

average winding number changes only slightly. We use these observations to give a new

proof of Fuller’s relation [13, 89],

� � �� (2.6)

where �� is the signed area of the curve

in � � . Note first that

� �

�� because both � � � and � � � are integers. We start with �

being a circle in � � , in which case (2.6) holds because�� and �� . Other

than continuous changes, we observe jumps of � in��

when � passes through itself.

Theorem A together with the fact that the fractional parts of��

and �� are the

same implies that (2.6) is maintained during the deformation. Fuller’s relation follows

because every knot can be obtained from the circle by continuous deformation.

Computing the average winding number. Three generic points �� define three

arcs, which bound the spherical triangle �� . Recall that the area of �� is the sum of angles

minus � . We define the signed area of �� as � � �� if lies to the left of

the oriented arc �� , and as � � � � �� if it lies to the right. Let �� be a

non-critical direction. As shown in Figure 2.6, every arc � � � � � � forms a unique spherical

triangle � � � � � � � . Let � � be its signed area. The corresponding arc � � � � � � � � � of �

forms

the antipodal spherical triangle � � � �� with signed area � � � . The winding number of a

it

+1i−t−ti

ti+1

−z

z

Figure 2.6: The two spherical triangles defined by an arc of � and its antipodal arc of �� .

direction �� can be obtained by counting the number of spherical triangles that contain

it. To be more specific, we call a spherical triangle positive if its signed area is positive and

negative if its signed area is negative. Let �� and � � � � � be the numbers of positive and

28

negative spherical triangles � �� that contain � , and similarly let � � � �� and � � � �� be

the numbers of positive and negative spherical triangles � � � � � � � � � � that contain � . Then

� � � � � � � ��

To see this note that the equation is correct for a point � near � and remains correct as �moves around and crosses arcs of

and of �

. The average winding number is thus

� �

� � � � � � ��

�

��

� � ��

��

� � � � �

� �

�

��

� � �

Computing the sum in this equation is straightforward and takes only time O( � ).

2.3.2 Open knots

We define an open knot as a continuous injection� � � � � � � � � . Equivalently, it is an

oriented curve, embedded in �� , with endpoints. The directional writhing number of�

is

well-defined, and the writhing number is the directional writhing number averaged over

all parallel projections, as before. Assume�

is a polygon specified by the sequence of its

vertices, � � � � � � �� , and let � be the knot obtained by adding the edge � �� . The

critical directions of�

differ in two ways from those of � :

(i) there are critical directions of � that are not critical for�

, namely the ones whose

definition includes a point of � �� ;

(ii) there are new critical directions, namely those defined by an endpoint (� � or � �� )

and another point of the polygon but not on the two adjacent edges.

To see that the directions in (ii) are indeed critical for�

, examine the first two rows of

Figure 2.4. The hollow vertex is now an endpoint of�

, so we remove one of the two

29

dashed edges. Because of this change, the directional writhing number changes at the

moment the hollow vertex passes over the solid edge. Changing the critical curve

of �to the critical curve � of

�can thus be achieved by removing the arcs of Case (i) and adding

the arcs of Case (ii). We illustrate this process in Figure 2.7. To describe the process, we

..

tn−1

.3

.

t

t0

t −2n

−3=u −3 wnn

...

wn−4

w 2

vn−3

nt −3

1

.

u0 = v2

v

.

......

.. .

n 0=−1

−un

nv=−1−t

= vn−2−2

−1 1=−un w

w

Figure 2.7: The critical curves of the knot � are marked by hollow vertices, and the additionsrequired for the critical curves of the open knot � are marked by solid black vertices.

define � � � � � � � � � � � � � � � � � , for�� , and

� � � �� , for

� � �� . Observe that � � � � , �� , �� , � � �� ,

and ��

�� . We get the critical curve � from

by

1. removing the partial arcs � �� and � � � � , and the arcs � �� and � �� ,2. adding new paths � � � � � �

��

� �� and � � ��

� �� .

Note that Step 2 adds a piece of �

, namely � � � �� and � � � �� , to the new

critical curve � . Symmetrically, we get �� from �

. Everything we said earlier about the

winding number of the critical curve

of � applies equally well to the critical curve � of�

. Similarly, all algorithms described in the subsequent sections apply to knots as well as

to open knots.

30

2.4 Computing Directional Writhing

In this section, we present an algorithm that computes the directional writhing number of

a polygonal knot with � edges in time roughly proportional to �� . The algorithm uses

complicated subroutines that may not lend themselves to an easy implementation.

Reduction to five dimensions. Assume without loss of generality that we view the knot

� from above, that is, in the direction of � � � � � � � � . Each edge � � � � � � � � � of� is oriented. Another edge � � � �� that crosses � � in the projection either passes

above or below and it either passes from left to right or from right to left. The four cases

are illustrated in Figure 2.8 and classified as positive and negative crossings according to

Figure 2.2. Letting � � and � � be the numbers of edges that form positive and negative

+1 −1 −1 +1

Figure 2.8: The four ways an oriented edge can cross another.

crossings with � � , the directional writhing number is

� � � � �

� ��

� � ��

� ��

To compute the sums of the � � and � � efficiently, we map edges in �� to points and half-

spaces in �� . Specifically, let � � be the oriented line that contains the oriented edge � � and

use Plucker coordinates as explained in [52] to map � � to a point � � � � � or alternatively to

a half-space � � in �� . The mapping has the property that � � and � � form a positive crossing

if and only if � � lies in the interior of � � . We use this correspondence to compute � � � in

two stages: first we collect the ordered pairs of oriented lines that form positive crossings,

and second we count among them the pairs of edges that cross.

Recursive algorithm. It is convenient to explain the algorithm in a slightly more general

31

setting, where � and � are sets of � and � oriented edges in � � . Let � � � � � � denote

the number of pairs � � �� that form positive crossings, and note that � � ��

�� if � is the set of edges of the knot � and � � � . We map � to a set � of

points and � to a set � of half-spaces in � � . Let�� be a sufficiently large constant.

A � -cutting of � and � is a collection of pairwise disjoint simplices covering � such that

each simplex intersects at most � � � hyperplanes bounding the half-spaces in � . We use

the algorithm in [10] to compute a � -cutting consisting of � simplices in time O( � � � ),

where � is at most�� times a constant independent of

�. For each simplex �� in the

cutting, define

�� ! � � � � � ��#" � � � �

Letting � � � $&%(' � �� and �)� � $&%(' � �� , we have � � � � � and �)� � � � � . By construc-

tion, every � � �� *� � ! � defines a pair of lines that form a positive crossing. For each

simplex �� , we count the edge pairs � � �� *� � ! � that form positive crossings, and let

� � be the number of such pairs. Then

� � � � � � ��

� ��

Note that � � is the number of crossings between projections of the line segments in �+�and in

! � . We can therefore use the algorithm in [51] to compute all numbers � � , for

� � � � � , in time , �� O �� .- � � �.- � �� . We recurse to compute

the �� and stop the recursion when � � � . The running time of this algorithm is at

32

most

� �� , � � � � � ��

� ��

O � � � �� )� � � � �

for any � � , provided� � � � � � is sufficiently large.

Improving the running time. We improve the running time of the algorithm by taking

advantage of the symmetry of the mapping to � � . Specifically, a point � � lies in the interior

of a half-space � � if and only if the point � � lies in the interior of the half-space �� . We

proceed as above, but switch the roles of points and half-spaces when � � becomes less than

� . That is, if � �� then we map the edges in � to half-spaces and the edges in � to points.

By our above analysis, the running time is then less than� � � �� O ��

O � � � �� . The overall running time is thus less than

� � � � � � � � , ��

� �� if � �� if � ��

�O � ��

where is a positive constant and � is any real larger than � . It follows that � � � can

be computed in time O( � �� , for any constant �� . Similarly, � � � and therefore

the directional writhing number, � � � , can be computed within the same time bound,

thereby proving Theorem B.

We remark that the technique described in this section can also be used to compute the

linking number between two polygonal knots with � and �� edges in time O( � �� ).

2.5 Experiments

In this section, we sketch a sweep-line algorithm that computes the writhing number of a

polygonal knot using Theorem A. We implemented the algorithm in C++ using the LEDA

33

software library and compared it with two versions of the algorithm based on the double

integral in (2.3). We did not implement any version of Le Bret’s algorithm mentioned in

Section 2.2 since it is based on a formula similar to Theorem A and can be expected to

perform about the same as our sweep-line algorithm, and since it only works for closed

knots.

2.5.1 Algorithms

Sweep-line algorithm. Theorem A expresses the writhing number of a knot � as the

sum of three terms. Accordingly, we compute the writhing number in three steps.

Step 1. Compute the directional writhing number for an arbitrary but fixed, non-critical

direction � , � � � .Step 2. Compute the winding number of � relative to the Gauss maps

and �

, � � � .

Step 3. Compute the average winding number by summing the signed areas of the

spherical triangles � �� , �� .

Return � � � � � � � � �� .

Instead of using the algorithm described in Section 2.4, we implemented Step 1 using a

sweep-line algorithm [71], which reports the crossing pairs formed by the � edges in

time O � � � � � �� . Steps 2 and 3 are both computed in a single traversal of the spherical

polygons

and �

, keeping track of the accumulated angle and the signed area as we go.

The running time of the traversal is only O � � � .

Double-sum algorithm. We compare the implementation of the sweep-line algorithm

with two implementations of (2.3). Write � � � � � � � � � � for the unnormalized tangent

vector. Following [33, 128], we discretize (2.3) to

� � � � �

��

��

� � � � � � � �� (2.7)

34

We note that� � is not the writhing number of the polygonal knot, but it converges to the

writhing number of a smooth knot if the polygonal approximation is progressively refined

to approach that knot [43].

Alternatively, we may discretize the double integral in such a way that the result is

the writhing number of the approximating polygonal knot. Given two edges �� and � � , we

measure the area of the two antipodal quadrangles in � � along whose directions we see the

edges cross. The area of one of the quadrangles is the sum of angles minus one full angle,

�� . The absolute value of the signed area � � � is the same, and its sign

depends on whether we see a positive or a negative crossing. We thus have

� � � � �

��

��

� � � � (2.8)

Straightforward vector geometry and trigonometry can be used to derive analytical formu-

las for the � � � [28, 121].

2.5.2 Comparison

We compare the three implementations using a sequence of polygonal approximations of

an artificially created smooth knot. It has the form of the infinity symbol, � , and is fairly

flat in � � , with only a small gap in the middle. Because the knots are fairly flat, most of

their parallel projections have one crossing and the writhing number is just a little smaller

than�� . Figure 2.9 shows that the algorithms that compute the exact writhing numbers

for polygonal approximations converge faster to the writhing number of the smooth knot

than the algorithm implementing (2.7). Figure 2.10 shows how much faster the sweep-line

algorithm is than both implementations of the double-sum algorithm. Let � be the number

of edges. The graphs suggest that the running time of the sweep-line algorithm is O( � ) and

the running times of the two implementations of the double-sum algorithm are� � � � � . We

35

Figure 2.9: Comparing convergence rates between �� (upper curve) and � � (lower curve). Foreach tested approximation of the � -knot, we draw the number of vertices along the horizontal axisand the writhing number along the vertical axis.

Figure 2.10: Comparing the running times of the sweep-line algorithm (lower curve) and thetwo implementations of the double-sum algorithm: approximate (middle curve) and exact (uppercurve). The � -axis and � -axis represent the number of vertices in the curve, and the running timeof the algorithm respectively.

observe the linear bound whenever we approximate a smooth knot by a polygon, since for

generic projections the number of crossings as well as the number of edges simultaneously

intersected by the sweep-line are independent of the total number of edges.

Protein backbones. We present some preliminary experimental results obtained with

the three implementations. All experiments are carried out on a SUN workstation, with

a 333 MHz UltraSPARC-IIi CPU, and 256 MB memory. Short of conformation data of

long DNA strands, we decided to run our algorithms on a modest collection of open knots

representing protein backbones, down-loaded from the protein data bank [2]. We modified

the algorithms to account for the missing edge in the data, as explained in Section 2.3.

Figure 2.11 displays the four backbones chosen for our experimental study. Table 2.1

presents some of our findings.

36

Figure 2.11: The open knots modeling the backbone of the protein conformations stored in thePDB files 1AUS.pdb (upper left), 1CDK.pdb (upper right), 1CJA.pdb (lower left), and 1EQZ.pdb(lower right).

Thick knots. Even though the writhing number of a polygonal knot can be as large as

quadratic in the number of edges, all four protein backbones in Figure 2.11 have writhing

numbers that are significantly smaller than the numbers of edges. If a knot is made out of

rope with non-zero thickness, then the quadratic bound can be achieved only if the ratio

of length over cross-section radius is sufficiently high. Specifically, the writhing number

of a knot of length�

with an embedded tubular neighborhood of radius � is less than

�� - � [44]. Such “thick” knots can be used to capture the fact that the edges of a

protein backbone are about as long as they are thick. A backbone with � edges thus has

writhing number at most some constant times �� - � . Examples which show that the upper

37

Data Size Time Writhing #� � ��

1AUS 439 122 0.09 3.93 9.28 22.70 17.871CDK 343 111 0.06 2.39 5.62 7.96 6.011CJA 327 150 0.06 2.19 5.10 12.14 10.431EQZ 125 18 0.02 0.31 0.73 4.78 3.37

Table 2.1: Four protein backbones modeled by open polygonal knots. The size of the problemis measure by the number of edges, � , and the number of crossings in the chosen projection, � .The time the sweep-line ( � �� ), the approximate double-sum ( �� ), and the exact double-sum( �� ) algorithms take is measured in seconds. � � is an approximation of the writhing number forpolygonal data.

bound is asymptotically tight can be found in [38, 45, 72].

2.6 Notes and Discussion

In this chapter, we have developed an efficient algorithm to compute the writhing number

of a space curve in � � . A fast method is important as the writhing number of DNA strands

is computed at each step in some molecular simulations. Other than this computational

aspect, it would be interesting to further investigate the concept and see whether there is a

correlation between the writhing numbers and the common classification of protein folds.

As mentioned in Chapter 2.1 there has been some initial work in this direction [82, 145].

It seems that although the writhing number of a protein backbone describes the spatial

arrangements of its secondary structure elements, it alone is not discriminative enough in

classifying protein structures. One major reason is that the writhing number is mainly

effective in describing the global geometry of a given space curve. To solve this prob-

lem, it might be necessary to consider backbones on a range of scale levels and compute

the writhing number as a function of scale. Another possible approach is to combine the

writhing number with other topological or geometric measures that describe different as-

pects, especially the local geometry, of protein structures.

38

Chapter 3

Backbone Simplification

3.1 Introduction

Protein structures are examined at different levels of details in various applications. Sim-

pler structures are exploited in cases either when time complexity is too high, or when

excessive details might obscure crucial features or principals that one would like to ob-

serve. It is therefore desirable to build up a level-of-detail (LOD) representation for pro-

tein structures. One way to achieve such a representation for protein backbones is via curve

simplifications.

Given a polygonal curve, the curve-simplification problem asks for computing another

polygonal curve that approximates the original curve based on a predefined error criterion

and whose size is as small as possible. Other than their potential usage in simplifying

protein structures, curve simplification is widely applied to numerous application areas,

such as geographic information systems (GIS), computer vision, computer graphics and

data compression. Simplification helps to remove unnecessary cluttering due to excessive

details, to save memory space needed to store a curve, and to expedite the processing

of a curve. For example, one of the main problems in computational cartography is to

visualize geo-spatial information as a simple and easily readable map. To this end, curve

simplification is used to represent rivers, road-lines, coast-lines and other linear features at

an appropriate level of detail when a map of a large area is being produced.

In this chapter, we study the curve simplification problem under the so-called Frechet er-

ror measure, and propose the first near linear time algorithm to simplify curves in � � with

guaranteed quality. Below we first introduce the curve simplification problem formally.

39

Problem definition. Let � denote a polygonal curve in �� with� � � � �� as its se-

quence of vertices. A polygonal curve �� simplifies � if� � � � ��

� � � � . Given an error measure

and a pair of indices� � � � � �

� , let �� denote the error of the segment � � � � with respect to � under the error measure

. Intu-

itively, �� measures how well � � �� approximates the portion of � between � � and

�� . The error of a simplification � � � � � � � � � � � � �� of � is defined as

�� %��

We call � � an � -simplification of � , under the error measure

, if �� . Let

� � � � �� denote the minimum number of vertices in an � -simplification of � under the

error measure

. Given a polygonal curve � , an error measure

, and a parameter � , the

curve-simplification problem asks for computing an � -simplification of size � �� .We now define the error measure

we study in this chapter: Let � � �� be

a distance function, e.g., the Euclidean distance, or� � or

��norm. Now given two curves

� � � � � � � � � , and � � � � �� , the Frechet distance under metric � , �! � �� , is

defined as

�" � �� #%$'&� � � � � � � �

� � �� (� �

� %��*),+ �.- �0/ � � �� (3.1)

where � and�

range over continuous and monotonically non-decreasing functions with

� � � � � � � � � � � � � � � � � and� � � � � �� . If �� and ��

are the maps that realize the

Frechet distance, then we refer to the map 1 � ��32�� as the Frechet map from � to � .

For a pair of indices� � � � � �

� , the Frechet error of a segment � � �� is defined to be

� � � � � �� " � � � �� where � � � � � � denotes the portion of � from � to � .

40

Most previous work has been focused on the so-called Hausdorff error measure. We

define it here as well, as we will compare the simplification under Frechet- and Hausdorff-

error measures later in this chapter. If we define the distance between a point � and a line

segment � as � � � � � � � � #%$�� )��'� � � � � � , then the Hausdorff error under the metric � , also

referred to as � -Hausdorff error measure, is defined as

�� %��%� � � � � � � � � � � � � � �

An � -simplification under Hausdorff error measure and � � � � �� are defined similarly to

the case of Frechet error measure.

If we remove the constraint that the vertices of �� are a subset of the vertices of � , then

� � is called a weak � -simplification of � . Let �� denote the minimum number of

vertices in a weak Frechet � -simplification of � .

3.2 Prior and New Work

Previous work. The problem of approximating a polygonal curve � has been studied

extensively during the last two decades; see [100, 167] for surveys. Imai and Iri [109]

formulated the curve-simplification problem as computing a shortest path between two

nodes in a directed acyclic graph �� : each vertex of � corresponds to a node in �� , and

there is an edge between two nodes � � � �� if �� . A shortest path from � �

to � � in �� corresponds to an optimal � -simplification of � under the error measure

.

In � � , under the Hausdorff measure with the so-called uniform metric1, their algorithm

takes � � � �� time. Chin and Chan [50], and Melkman and O’Rourke [134] improve

the running time of their algorithm to quadratic. Agarwal and Varadarajan [3] improve the�The uniform metric in � �

is defined as follows: given two points �� ,

�� ! #"��$ � if �!%&�'�$%& ( otherwise.

41

running time to � � � - � �� for� � - and uniform-Hausdorff error measures , for an arbitrarily

small constant � � , by implicitly representing the graph � � . In dimensions higher than

two, Barequet et al. compute the optimal � -simplification under� � - or

��-Hausdorff error

measures in quadratic time [29]. For� � -Hausdorff error measure, an optimal simplification

can be computed in near-quadratic time for � � �and in � � � � �.-��

�� polylog � � in � �

for � � �.

Curve simplification using the Frechet error measure was first proposed by Godau [92],

who showed that �� . Alt and Godau [16] also proposed an � � � -time algorithm to determine whether � � � � � � � � � for given polygonal curves � and � of

size � and , respectively, and for a given error measure � � . Following the approach

of Imai and Iri [109], an � -simplification of � , under the Frechet error measure, of size

� � � � � �� can be computed in � � � � time.

The problem of developing an optimal near-linear � -simplification algorithm remains

elusive. Among the several heuristics that have been proposed over the years, the most

widely used is the Douglas-Peucker method [75] (together with its variants). Originally

proposed for simplifying curves under the Hausdorff error measure, its worst-case running

time is � � � � in � � . For � � the running time is improved by Snoeyink et al. [101]

to � � �)� � � . However, the Douglas-Peucker heuristic does not offer any guarantee on

the size of the simplified curve—it can return an � -simplification of size � � � � even if

� � �� .Much work has been done on computing a weak � -simplification of a polygonal curve

� . Imai and Iri [108] give an optimal � � � -time algorithm for finding an optimal weak

� -simplification (under the Hausdorff error measure) of a � -monotone curve in � � . As for

weak � -simplification of planar curves under Frechet distance, Guibas et al. [96] proposed

an � � �� -time factor-2 approximation algorithm and an � � � � -time exact algorithm.

They also proposed linear-time algorithms for approximating some other variants of weak

42

simplifications.

The problem of simplifying curves becomes much harder when additional constraints

such as topology preservation and non-intersection requirements are introduced. Given

a set of non-intersecting curves, the problem of simplifying the curves optimally so that

the simplified curves are also non-intersecting is NP-hard – in fact, it is hard to approxi-

mate within a factor of �� - � � � , for any � � , where � is the total number of vertices of

the curves [83]. Guibas et al. [96] show that the problem of computing an optimal non-

intersecting simplification of a simple polygon is NP-hard, and that computing the optimal

weak simplification of a set of non-intersecting curves is also NP-hard.

Our results. Let � be a polygonal curve in � � , and let �� be a parameter. In Sec-

tion 3.3, we present simple, near-linear algorithms for computing � -simplifications of �

with size at most � �� under the Frechet-error measure.

Theorem 3.2.1 Let � be a polygonal curve in �� with � vertices, and let � � be a

parameter. We can compute in � � �� time a simplification � � of � of size at most

� � � � �� so that � � � � � �� , assuming that the distance between points is measured

in any��

-metric.

To our knowledge, this is the first simple, near-linear approximation algorithm for curve

simplification with guaranteed quality that extends to � � for arbitrary curves. We

illustrate its simplicity and efficiency by comparing its performance with Douglas-Peucker

and exact algorithms in Section 3.4. Our experimental results on various data sets show

that our algorithm is efficient and produces � -simplifications of near-optimal size.

Also in Section 3.3, we compare curve simplification under Hausdorff and Frechet er-

ror measure, and we show that �� , thereby improving the result by

Godau [92].

43

3.3 Frechet Simplification

Let �� be a polygonal curve in � � , and let �

� be a parameter. In this

section, we first prove a few properties of the Frechet error measure. We then present an

approximation algorithm for simplification under the Frechet error measure. At the end of

this section, we compare Frechet simplification with some other versions of simplification.

Let �" � � � � � be as defined in (3.1), and let � � � � � � denote the Euclidean distance between

two points in � � .

Lemma 3.3.1 Given two directed segments �� and � � in � � ,

�" � �� %��

PROOF. Let � � � %�� . First �! � �� , since � (resp. ) has to

be matched to � (resp. � ). Assume the natural parameterization for segment �� , �� , such that �� . Similarly, define � � � � � � � � � � � � for

segment � � , such that � � � � � � � � � �� . For any two matched points �� and � � � � , let

� � � � � ��

Since � � � � is a convex function,�� for any � � �

� � � . Therefore � � �� .

Lemma 3.3.2 Given a polygonal curve � in � � and two directed segments �� and � � ,

��" � �� " �� " � ��

PROOF. Assume that �� is the natural parameterization of �� , such that

�� . Let �� , � � � � � � , be a parameterization of the polygonal curve

44

� such that � � � � and �� realize the Frechet distance between � and �� . As in the proof

of Lemma 3.3.1, let � � � � � � � � � � � � be such that � � � � � � � � � � � � � . By triangle

inequality, for any � � � � � � , � � � � � � �� yielding the

lemma.

��

��

Figure 3.1: Dashed curve is�� , �� and

are mapped to �� and � on � � , respectively.

Lemma 3.3.3 Let �� be a polygonal curve in � � . For � ��

,

� � � � � � � ��

PROOF. Let �� . Let � � � � � � � �� be the Frechet map from � � � � to

� � � � � �� (see Section 3.1 for definition). For any vertex � � � � � � � � �� , set �� ;see Figure 3.1 for an illustration. By definition, �� . In particular,

� � � � � �� . By Lemma 3.3.1, �! � � � � � �� . It then follows from

Lemma 3.3.2 that

� � � � � � �� " � �� 3.3.1 Algorithm

Our simplification algorithm is a greedy approach (Figure 3.2). Suppose we have already

added� � � � � � � � �� to � � . We then find an index

� � � � such that (i) � � � � ��

and (ii) � � � � �� . We set � � � � � � and add � �� to � � . We repeat this process

until we encounter �� . We then add � � to � � .45

ALGORITHM GreedyFrechetSimp ( � , � )Input: � � � � � � �� ; � � �Output: � � " � such that � � � � � � �� .

begin�� ; � � � � ; � � � � � � � ;

while ( � � � � ) do� � ;while ( � � � � �� ) do� � � � �

;end while ��

�;� # � � �

�� ;

while ( �� # � � � � � ) do

� # � � � �� # � � � � ;if ( � � � � � � � � � �� ) �� # � ;else

� # � � � � # � ;end while� � � � � � � �� ; � � � � � � � � �� ; � � � � � ;

end whileend

Figure 3.2: Computing -simplification under the Fr echet error measure.

Alt and Godau [16] have developed an algorithm that, given a pair� � � � � �

� , can

determine in � � � � � time whether � � � � � �� . Therefore, a first approach would be

to add vertices greedily one by one, starting with the first vertex � � � , and testing each edge

� � � � � , for� � � � , by invoking the Alt-Godau algorithm, until we find the index

�. However,

the overall algorithm could take � � � � time. To limit the number of times that Alt-Godau

algorithm is invoked when computing the index�

, we proceed as follows.

First by an exponential search, we determine an integer� so that � � � � � � � ��

� and � � � � �� . Next, by performing a binary search in the interval��

�� ,

we determine an integer� � �

� �

�� such that � � � � �� and � � � � � � � � � � � � ��

� . Note that in the worst case, the asymptotic costs of the exponential and binary searches

are the same. Set� �

� � � � . See Figure 3.2 for pseudo-code of this algorithm. Since com-

puting the value of�

requires invoking the Alt-Godau algorithm � � � � � �� times,

46

each with a pair � � � � � such that��

� � , the total time spent in computing the value of�

is � � � � � � � � � �� . Hence, the overall running time of the algorithm is � � �� .

Theorem 3.3.4 Given a polygonal curve � � � � � � �� in � � and a parameter � � ,we can compute in � � �� time an � -simplification � � of � under the Frechet error

measure so that � � � � � � � � � �� .

PROOF. Compute � � by the greedy algorithm described above. By construction � � � � � �� , so it suffices to prove that � � � � � � � � � � �� . Let � � � � � � � � � � � �� , and let

� � �� be an optimal �� -simplification of � of size � � �� .

We claim that �� for all . This would imply that

� � � � �� .We prove the above claim by induction on . For � �

, the claim is obviously

true because � � � � � � �. Suppose ��

�� . If

�� , we are done. So

assume that�� . Since � � is an � � �� -simplification, � � � �� .

Lemma 3.3.3 implies that for all �� , � � � � � �� . But by construc-

tion, � � � � � �� , therefore �� #� � � and thus �� .

Remark.Our algorithm works within the same running time even if we measure the

distance between two points in any� �

-metric.

3.3.2 Comparisons

��

(a) (b) (c)

Figure 3.3: (a) Polygonal chain (a piece of protein backbone) composed of three alpha-helices, (b)its Fr echet -simplification and (c) its Hausdorff -simplification.

47

Hausdorff vs. Frechet. One natural question is to compare the quality of simplifi-

cations produced under the Hausdorff- and the Frechet- error measures. Given a curve

� � � � � � � �� , it is not too hard to show that � � � � � �� under any��

-metrics, which implies that � � �� . The inverse however does not

hold, and there are polygonal curves � and values of � for which � � � � �� and

� � � � � �� .

The Frechet error measure takes the order along the curve into account, and hence is

more useful in some cases especially when the order of the curve is important (such as

curves derived from protein backbones). Figure 3.3 illustrates a substructure of a protein

backbone, where the � -simplification under Frechet error measure preserves the overall

structure, while the � -simplification under Hausdorff error measure is unable to preserve

it.

We remark that the Douglas-Peucker algorithm is also based on Hausdorff error mea-

sure. Therefore the above discussion holds for it as well.

Weak-Frechet vs. Frechet. In Section 3.3.1 we described a fast approximation algorithm

for computing an � -simplification of � under the Frechet error measure, where we used the

Frechet measure in a local manner: we restrict the curve� � � � �� to match to the line

segment � � �� . We can remove this restriction to make the measure more global by consid-

ering weak � -simplification. More precisely, given � and � �� , where � �

does not necessarily lie on � , � is a weak � -simplification under Frechet error measure if

�" � � � � � � � . The following lemma shows that for Frechet error measure, the size of the

optimal � -simplification ( �� ) can be bounded in terms of the size of the optimal weak

� -simplification ( �� ):

48

��

��

��

� � � �

��

��

��

(a) (b)

��

��

� � �� (c)

Figure 3.4: Relationship between �� and � �� : (a) �! #" �%$ � � and &' (" �%$ � � � � ; (b)depicts the map ) , * � �� +) � �� , and

+) � �� . (c) the case for , � .- � and , � � � /- � � � :0 1* � �� 12 � �� .

Theorem 3.3.5 Given a polygonal curve � ,

� � � � � � �� PROOF. The second inequality is obvious, so we prove the first one. Let � �

� � � � �� be an optimal weak �� -simplification of � , and let � � � � � be a Frechet map so

that � � � � � � � � � � � � for all � � � (see Section 3.1 for the definition). For� � � � � ,

let � � � � � � � � � � � be the edge of � that contains � � � � � , and let �� be the endpoint of � � that

is closer to � � � � � . We set � � � � �� 43 ; we remove a vertex from this sequence if it

is the same as its predecessor. Clearly,� � � � � � � , so � � is a simplification of � . Next we

show that � � � �� , for all� � � � � , which then implies the theorem.

Let ��

� � � � � and �

� � � � � � � . See Figure 3.4 (a) for an illustration.

Claim 3.3.6 �" � ��

49

PROOF. By construction, � � � � � � � , � � � � � � � � � � � . Therefore, �! � �� (by Lemma 3.3.1). On the other hand, since � is a weak ( � � )-simplification of � ,

�" � � � � � � � �� . The claim then follows from Lemma 3.3.2.

Let� � � � � � � � �� be a Frechet map such that � � � � � �� , for all � � � � � � � .

Let � be the line containing �� . We define a map� � � � � � that maps a point � � � �

to the intersection point of � with the line through � and parallel to � � � � � � � � � � � � � . See

Figure 3.4 (b) for an illustration. If � � � � � � � , then � � � � � � � � �� since � � � � �� : This is the case depicted in Figure 3.4. Hence, we can infer that

�" � �� (3.2)

Similarly, we define a map � � � � � � � � that maps a point � � � � � � to the intersection point

of � with the line through � and parallel to � � � � � � � � � � � � � . As above,

�" � � � � � � �� (3.3)

Let � (resp. � ) denote the point� � � � � � (resp. � � �� ); see Figure 3.4 (c).

Claim 3.3.7 �" � � � � ��

PROOF. Since � � �� , � � �� , the claim follows from Lemma 3.3.1.

Claim 3.3.8 �" � � � � � � ��

PROOF. By definition of Frechet distance,

�" � � � �� %�� " � � � � � � � � � � � �� " � � � � � � � � � � � � � � � � � �� " � � � � � � �� 50

By (3.2) and (3.3), the first and the third terms are at most � � . To bound the second

term, observe that� � � � � � � � � � � � � � � � � and � � � � � � � � � � � � � � � � � . It then follows from the

definition of the map�

and Claim 3.3.6 that

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �" � ��

which implies that the second term is at most � �� as well. Thus proves the claim.

Claim 3.3.7 and 3.3.8 along with Lemma 3.3.2 imply that � � � �� , and

therefore � � � � � � �� . This completes the proof of the theorem.

3.4 Experiments

We have implemented our simplification algorithm, GreedyFrechetSimp, and the

� � � � -time optimal Frechet-simplification algorithm (referred to as Exact) by construct-

ing the shortest path in certain graphs, outlined in Section 3.2, that computes an � -simplification

of � of size � �� . In this section, we measure the performance of our algorithm in terms

of the output size and the running time.

Data sets. We test our algorithms on two different types of data sets, each of which is a

family of polygonal curves in � � .

� Protein backbones. The first set of curves are derived from protein backbones by

adding small random “noise” vertices along the edges of the backbone. We have

chosen two backbones from the protein data bank [2]: PDB file ids� � and

� � � .The number of vertices in the original backbones of

� � and� � � are

� �� and� �

,

respectively. Protein A is the original protein� � . Protein B is

� � � with � � � random vertices added: the new vertices are uniformly distributed along the curve

edges, and then perturbed slightly in a random direction.

51

� Stock-index curves. The second family of curves is generated from the daily NAS-

DAQ index values over the period January 1971 to January 2003 (data is obtained

from [1]). We take a pair of index curves � � � � � � � � � � � � � � and � � �� and generate the curve � �� in � � . In particular, we take telecom-

munication index and Bio-technology index as the � - and � -coordinates and time

as the � -coordinates to construct the curve Tel-bio in �� . In the second case, we

take transportation, telecommunication index, and time as � -, � -, and � -coordinates,

respectively, to construct curve Trans-Tel in �� .

These two families of curves have different structures. The stock-index curves ex-

hibit an easily distinguishable global trend; however, locally there is a lot of noise. The

protein curves, though coiled and irregular-looking, exhibit local patterns that represent the

structural elements of the protein backbones (commonly referred to as the secondary struc-

tures). In each of these cases, simplification helps identify certain patterns (e.g., secondary

structure elements) and trends in the data.

Output size. We compare the quality (size) of simplifications produced by our algorithm

(GreedyFrechetSimp) and optimal algorithm (Exact) in Figure 3.5 for curves from

the above two families respectively. The simplifications produced by our algorithm are

almost always within��

of the optimal.

To provide a visual picture of the simplification produced by various (commonly used)

algorithms for curves in � � , Figure 3.8 shows the simplifications of protein A computed

by the GreedyFrechetSimp, exact (i.e., optimal) Frechet simplification algorithm, and

the Douglas-Peucker heuristic (using Hausdorff error measure under� � metric).

Running time. As the running time for the optimal algorithm is orders of magnitude

slower than our algorithm, we compare the efficiency of GreedyFrechetSimpwith

the widely used Douglas-Peucker simplification algorithm under the Hausdorff measure –

52

Output sizeProteinA ProteinB

(327) (9,777)� GreedyFrechetSimp Exact GreedyFrechetSimp Exact

0.05 327 327 6786 64310.12 327 327 1537 6511.20 254 249 178 1681.60 220 214 140 1322.00 134 124 115 885.00 37 36 41 3910.0 22 22 24 2020.0 10 8 8 650.0 2 2 2 2

(a)

Output sizeTrans-Tel Tel-Bio(7,057) (1,559)

� GreedyFrechetSimp Exact GreedyFrechetSimp Exact

0.05 6882 6880 1558 15580.50 4601 4469 1473 14711.20 2811 2637 1292 12793.00 1396 1228 974 9425.00 890 732 772 72010.0 414 329 490 40220.0 168 124 243 20050.0 47 35 94 73

(b)

Figure 3.5: The sizes of Fr echet simplifications on (a) protein data and (b) stock-index data.

53

Running time (ms.)ProteinA ProteinB

(327) (9,777)� GreedyFrechetSimp DP GreedyFrechetSimp DP

0.05 3 16 146 7720.50 3 16 171 5241.20 4 16 176 4881.60 5 12 202 3942.00 5 11 210 3545.00 5 11 209 35610.0 5 10 222 32920.0 5 8 233 26350.0 2 1 87 50

(a)

Running time (ms.)Trans-Tel Tel-Bio(7,057) (1,559)

� GreedyFrechetSimp DP GreedyFrechetSimp DP

0.05 82 599 16 1130.50 103 580 17 1141.20 113 559 19 1133.00 119 510 22 1095.00 121 472 24 10910.0 127 411 25 9620.0 146 360 27 8550.0 162 271 27 71

(b)

Figure 3.6: The running time of GreedyFrechetSimp and Douglas-Peucker algorithm on (a)protein data and (b) stock-index data.

54

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 40 45 50

Runnin

g tim

e (

secs)

Error

DPFS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 40 45 50

Runnin

g tim

e (

secs)

Error

DPFS

Figure 3.7: Comparison of running time of GreedyFrechetSimp and DP algorithms for varying � onProtein B.

we can extend the Douglas-Peucker algorithm to simplify curves under the Frechet error

measure; however, such an extension is inefficient and can take � � � � � time in the worst

case.

Figures 3.6 illustrates the running time of the two algorithms. Note that as � increases,

resulting in small simplified curves, the running time of Douglas-Peucker decreases. This

phenomenon is further illustrated in Figure 3.7, which compares the running time of our

algorithm with the Douglas-Peucker on Protein B (with artificial noise added) with � � �� vertices. This phenomenon is due to the fact that, at each step, the DP algorithm deter-

mines whether a line segment � � �� simplifies � � � � � �� . The algorithm recursively solves

two subproblems only if � � � � � �� . Thus, as � increases, it needs to make fewer

recursive calls. Our algorithm, however, proceeds in a linear fashion from the first ver-

tex to the last vertex using exponential and binary search. Suppose the algorithm returns

� � � � �� for an input polygonal curve � � � � � � �� . The exponential search

takes � � � time, while the binary search takes � � �� where � � �� and�

is the number of vertices of � � . Thus as � increases, � � increases, and therefore the time

for binary search increases, as Figure 3.7 illustrates. Note however that if � is so large that� � , i.e., the simplification is just one line segment connecting � � to � � , the algorithm

55

does not perform any binary search and is much faster, as the case for �� illustrates

in Figure 3.7.

3.5 Notes and Discussions

In this chapter, we have proposed and developed a simple near-linear time curve simplifi-

cation algorithm. Other than being the first near-linear simplification algorithm for curves

in � � , our algorithm tends to preserve long but relatively skinny features (Figure 3.3) as we

use the Frechet error measure. This property is desirable when simplifying protein back-

bones as it helps to maintain a trace of secondary structural elements such as alpha-helices

and beta-strands in the simplified structures.

It would be interesting to see whether our algorithm can help to produce an automatic

method to identify protein secondary structure elements. It is also possible to generate

a level-of-detail representation of a protein backbone via simplification and compute its

writhing number (or other shape descriptors) at different scales, in order to characterize

protein structures. Such a level-of-detail representation can also be used when comparing

protein backbones, as current structural alignments methods are usually of high computa-

tional complexity.

We end this chapter by mentioning a few open problems related to curve simplification.

(i) Does there exist a near-linear algorithm for computing an � -simplification of size at

most � � � � �� for a polygonal curve � , where � is a constant?

(ii) Is it possible to compute the optimal � -simplification under Hausdorff error measure

in near-linear time, or under Frechet error measure in sub-cubic time?

(iii) Is there any provably efficient exact/approximation algorithm for curve simplifica-

tion in � � that returns a simple curve if the input curve is simple.

56

GreedyFrechetSimp EXACT DP ��

� � ��

� � � ��

� � � ��

��

� � � �

Figure 3.8: Simplifications of a protein (Protein A) backbone.

57

Chapter 4

Elevation Function

4.1 Introduction

The starting point of the work described in this chapter is the desire to identify features

that are useful in finding a fit between solid shapes in �� . We are looking for cavities

and protrusions and a way to measure their size. The problem is made difficult by the

interaction of these features, which typically exist on various scale levels. We therefore

take an indirect approach, defining a real-valued function on the surface that is sensitive to

the features of the shape. We call this the elevation function because it has similarities to

the elevation measured on the surface of the Earth, but the problem for general surfaces is

more involved and the analogy is not perfect.

Related work in protein docking. The primary motivation for designing elevation func-

tion to characterize protein surfaces is protein docking, which is the computational ap-

proach to predicting protein interactions, a biophysical phenomenon at the very core of life.

The phenomenon is clearly important and the interest in protein docking is correspondingly

wide-spread. The related work on attacking the docking problem will be surveyed in Chap-

ter 6, and here we only mention some survey articles on docking algorithms [81, 97, 114].

The idea of docking by matching cavities with protrusions goes back to Crick [69] and

Connolly [67]. Connolly also introduced the idea of using the critical points of a real-

valued function defined on the protein surface to identify cavities and protrusions. The

particular function he used is the fraction of a fixed-size sphere that is buried inside the

protein volume as we move the sphere center on the protein surface. In the limit, when

the size of the sphere goes to zero, this function has the same critical points as the mean

58

curvature function [48]. A similar but different function suggested for the same purpose

is the atomic density [136]. Here we take the buried fraction of the ball bounded by the

sphere but we also vary its radius from zero to about ten Angstrom. At every point of the

protein surface, the function value is the fraction of buried volume averaged over the balls

centered at that point.

Our results. The main contribution of this chapter is the description and computation

of a new type of feature points that mark extreme cavities and protrusions on a surface

embedded in � � . More specifically,

� we extend the concept of topological persistence [77] to form a pairing between all

critical points of a function on a 2-manifold embedded in �� ;

� we use the pairings obtained for a 2-parameter family of height functions to define

the elevation function on the 2-manifold;

� we classify the generic local maxima of the elevation function into four types;

� we develop and implement an algorithm that computes all local maxima of the ele-

vation function.

The elevation differs from Connolly’s and the atomic density functions in two major ways:

it is independent of scale and it provides, beyond location, estimates for the direction and

size of shape features. Both additional pieces of information are useful in shape character-

ization and matching. The four generic types of local maxima are illustrated in Figure 4.1.

In each but the first case, the maximum is obtained at an ambiguity in the pairing of critical

points. In all cases, the endpoints of the legs share the same normal line, and the legs have

the same length if measured along that line. The case analysis is delicate and aided by a

transformation of the original 2-manifold to its pedal surface, which maps tangent planes

to points and thus expresses points with common tangent planes as self-intersections of the

59

Figure 4.1: From left to right: a one-, two-, three-, and four-legged local maximum of the elevationfunction. In the examples shown, the outer normals at the endpoints of the legs are all parallel (thesame). Each of the four types also exists with anti-parallel outer normals in various combinations.

pedal surface. The algorithm we describe for enumerating all local maxima is inspired by

our analysis of the smooth case but works on piecewise linear data.

Outline. Section 4.2 defines the pairing of the critical points, based on which we then

introduce the height and elevation as functions on a 2-manifold. Section 4.3 describes a

dual view of these concepts based on the pedal surface of the 2-manifold. Section 4.4 uses

surgery to make elevation continuous and to define a stratified Morse function on the new

2-manifold. We then characterize the four types of generic local maxima of the continuous

elevation function. Section 4.5 sketches an algorithm for enumerating all local maxima.

Section 4.6 presents experimental results for protein data.

4.2 Defining Elevation

4.2.1 Pairing

The elevation function is based on a canonical pairing of the critical points, which we

describe in this section.

Traditional persistence. Let�

be a connected and orientable 2-manifold and � � � � �

a smooth function.1 A point � � �is critical if the derivative of � at � is identical to 0,

and it is non-degenerate if the Hessian at the point is invertible. It is convenient to assume

that � is generic:�We remark that the algorithms we describe below work for 2-manifolds with multiple components as well.We assume there is only one components for simplicity.

60

I. all critical points are non-degenerate;

II. the critical points have different function values.

A function that satisfies Conditions I and II is usually referred to as a Morse function

[135]. It has three types of critical points: minima, saddles and maxima distinguished by

the number of negative eigenvalues of the Hessian. Imagine we sweep�

in the direction

of increasing function value, proceeding along a level set of closed curves. We write��

� � � � � �� for the swept portion of the 2-manifold. This portion changes the

topology whenever the level set passes through a critical point. A component of��

starts

at a minimum and ends when it merges with another, older component at a saddle. A hole

in the 2-manifold starts at a saddle and ends when it is closed off at a maximum. After

observing that each saddle either merges two components or starts an new hole, but not

both, it is natural to pair up the critical point that starts a component or a hole with the

critical point that ends it. This is the main idea of topological persistence introduced in

[77]. It is clear that a small perturbation of the function that preserves the sequence of

critical events does not affect the pairing, other than by perturbing each pair locally. The

method pairs all critical points except for the first minimum, the last maximum, and the ��saddles starting the �� cycles that remain when the sweep is complete. Here � is the genus

of�

. These � �� unpaired critical points are the reason we need an extension to the

method, which we describe next.

Extended persistence. It is natural to pair the remaining minimum with the remaining

maximum. The remaining �� saddles, � , contains � up-forking and � down-forking sad-

dles. We wish to pair up-forking saddles with down-forking ones, and this can be achieved

in a way that reflects how they introduce cycles during the sweep. This pairing is best

described using the Reeb graph obtained by mapping each component of each level set to

a point, as illustrated in Figure 4.2. As proved in [63], the Reeb graph has a basis of �

61

B

2

B

1

2

24

3

AA

0

3

1

4

0

B

1

2

24

A

0

1

B

A

3

Figure 4.2: Left: a 2-manifold whose points are mapped to the distance above a horizontal plane.Middle: the Reeb graph in which the critical points of the function appear as degree-1 and degree-3nodes. The labels indicate the pairing. Right: the tree representing the Reeb graph from slightlyabove

�downwards.

cycles such that each cycle is the sum (modulo 2) of a subset of basis cycles. Each cycle

has a unique lowest and a unique highest point, referred to as lo-point and hi-point. We say

the lo- and hi-point span this cycle but note that they may span more than one cycle. There

is a one-to-one correspondence between lo- (hi-) points and up- (down-) forking saddles,

thereby giving exactly � lo-points and � hi-points. We pair each lo-point � with the lowest

hi-point � that spans a cycle with � . Note that � is also the highest lo-point that spans a

cycle with � . Indeed, if it were not the case, then we could add the cycle spanned by � and

� and the cycle spanned by � and the lo-point higher than � to get a cycle spanned by � and

lower hi-point than � , a contradiction. This implies that each lo-point and each hi-point

belongs to exactly one pair, giving a total of � pairs between up- and down-forking saddles

from � as desired.

The Reeb graph of a piecewise-linear function on a triangulation with � edges can be

constructed in time � � �)� � � using the algorithm in [63]. We now describe an algorithm

that computes both the traditional persistence pairing and the extended persistence pairing

as introduced above, given the Reeb graph � of�

.2 It simulates the sweep of � , maintain-

ing a forest for� �

during the course. In particular, it takes the following steps at reaching�We remark that the algorithm can in fact construct the Reeb graph and the pairing simultaneously in onesweep. We assume the Reeb graph is given for simplicity.

62

a critical point � (i.e., a node in � ), merging two arcs across a degree-2 node whenever one

is created.

Case 1: � is a minimum. We add a new tree, consisting of a single node, to the forest.

Case 2: � is an up-forking saddle. We turn the corresponding leaf into an internal node,

adding two new leaves as its children.

Case 3: � is a down-forking saddle, connecting leaves � and . We glue the two down-

ward paths from � and to their roots, and ends the gluing at � . In one case, � is the

root of one tree (the higher one); � is a minimum, which we now pair with � (this

corresponds to a traditional persistence pairing). In the other case, � is the lowest

common ancestor of � and ; � is an up-forking saddle, and we pair it with � (this

corresponds to an extended persistence pairing).

Case 4: � is a maximum. We pair it with its parent � and remove the joining edge

together with the two nodes; � can be either an up-forking saddle, producing a tradi-

tional persistence pairing, or a minimum, producing an extended persistence pairing.

In order to perform these operations efficiently, we use the linking and cutting tree data

structure proposed by Sleator and Tarjan [152]. It decomposes the forest into a family of

vertex-disjoint paths, and each path is represented using a biased binary search tree. By

maintaining a linking and cutting tree

, cases� � � and

can be handled in � � ��

overall time. So we focus only on case�. Given an instance of case

�, assume that the

common ancestor of � and , � , exists (the case where it does not exist can be processed

similarly). We can find � in � �� time using the operations supported by the linking

and cutting tree data structure. The only extra operation we need is to glue the path in

from � to � with that from to � . Let and�

be the length of these two paths with� � . We can perform the gluing operation by inserting each node from the shorter

path into the longer one, which takes � � �� time as each path is represented using

63

a weighted binary search tree.3 Assume that there are a sequence of � gluing operations

during the entire sweep, and the � ’th operation glues a path of length� � with one of length

� for� � � � . The overall time complexity is

� � � � � �� )� � � .

If we regard the parent of each node as its successor, then

induces a partial order on

the � nodes of � . Let 1 � be the number of total orders that are extensions of the partial

order induced by

after the � ’th gluing operation. Since � �

initially and a single path

after all operations, 1 � � �� and 1 � � . The � ’th gluing operation merges two paths of

length � and� � , 1��

� � � � � � � 1�� . Therefore,

$ 1 � � � � $ 1 � � $ � � � � � � � �� $ � ��

� ��

Hence,

��

� ��

� $ 1�� $ 1�� $

� �

implying that the overall time for computing the persistence pairing is � � �� .Symmetry. The negative function, � � � � � � , has the same critical points as � . We

claim that it also generates the same pairing.

SYMMETRY LEMMA. Critical points � and � are paired for � iff they are paired for � � .

PROOF. The claim is true for the first minimum, � , and the last maximum, � . Every other

pair of � contains at least one saddle. We assume without loss of generality that � is a

saddle and that �� . Consider again the sweep of the 2-manifold in the direction

of increasing values of � . When we pass � �� we split a cycle in the level set into

two. The two cycles belong to the boundary of �� , the set of points with function value

or higher. If the two cycles belong to the same component of �� , such as for the point

�

In fact, for our algorithm, to achieve an � �� bound, a balanced binary tree will suffice.

64

labeled 2 in Figure 4.2, then � is a lo-point and � is the lowest hi-point that spans a cycle

with � . The claim follows because � is also the highest lo-point that spans a cycle with � .

If, on the other hand, the two cycles belong to two different components of �� , such as for

the point labeled � in Figure 4.2, then � is the lower of the two maxima that complete the

two components. In the backward sweep (the forward sweep for � � ), � starts a component

that merges into the other, older component at � . Again � and � are also paired for � � ,which implies the claimed symmetry.

4.2.2 Height and Elevation

In this section, we define the elevation as a real-valued function on a 2-manifold in � � .

Measuring height and elevation on Earth. Even on Earth, defining the elevation of a

point � on the surface is a non-trivial task. Traditionally, it is defined relative to the mean

sea level (MSL) in the direction of the measured point. In other words, the MSL elevation of

a point � is the difference between the distance of � from the center of mass and the distance

of the MSL from the center of mass in the direction of � . The difficulty of measuring height

in the middle of a continent was overcome by introducing the geoid, which is a level surface

of the Earth’s gravitational potential and roughly approximates the MSL while extending

it across land. The orthometric height above (or below) the geoid is thus more general

and about the same as the MSL elevation. It is perhaps surprising that the geoid differs

significantly from its best ellipsoidal approximation due to non-uniform density of the

Earth’s crust [87]. Standard global positioning systems (GPS) indeed return the ellipsoidal

height, which is elevation relative to a standard ellipsoidal representation of the Earth’s

surface. They also include knowledge of the geoid height relative to the ellipsoid and

compute the orthometric height of � as its ellipsoidal height minus the ellipsoidal height of

the geoid in the direction of � .

65

A simplifying factor in the discussion of height and elevation on Earth is the existence

of a canonical core point, the center of mass. For general surfaces, distance measurements

from a fixed center make much less sense. We are interested in this general case, which

includes surfaces with non-zero genus for which there is no simple notion of core. As

on Earth, we define the elevation of a point � as the difference between two distances,

except we no longer use a reference surface, such as the mean sea level or the geoid, but

instead measure relative to a canonically associated other point on the surface. To explain

how this works, we give different meanings to the ‘height’ of a point � , which we define

for every direction, and the ‘elevation’ of the point, which is the difference between two

heights. While height depends on an arbitrarily chosen origin, we will see that elevation

is independent of that choice. Indeed, the technical concept of elevation, as introduced

shortly, will be similar in spirit to the idea of orthometric height, with the exception that it

substitutes the canonical associated point for a globally defined reference surface.

Height, persistence and elevation. Let�

be a smoothly embedded 2-manifold in � � .

We assume that�

is generic but it is too early to say what exactly that should mean. We

define the height in a given direction as the signed distance from the plane normal to that

direction and passing through the origin. Formally, for every unit vector � � � � , we call

�� the height of � in the direction � . This defines a 2-parameter family of

height functions,

�� # � ��

where�� # � �� . The height is a Morse function on

�for almost all direc-

tions. We pair the critical points of �� as described in Section 4.2.1. Following [76], we

define the persistence of a critical point as the absolute difference in height to the paired

point: � '��

� ' � � � � � � �� .Each point � � �

is critical for exactly two height functions, namely for the ones in

66

the direction of its outer and inner normals: �� . We proved in Section 4.2.1 that the

pairs we get for the two opposite directions are the same. Hence, each point � � �has a

unique persistence, which we use to introduce the elevation function,

� ��(% � # � $ � � � � �

defined by� ��(% � # � $ � � � �

� '�� . We note that the elevation is invariant under transla-

tion and rotation of�

in �� .

Two-dimensional example. We illustrate the definitions of the height and elevation func-

tions for a smoothly embedded 1-manifold�

in � � . The critical points of ��

are the points � � �with normal vectors ��

�� . Figure 4.3 illustrates a sweep in the

vertical upward direction � . Each critical point of � � starts a component, ends a compo-

nent by merging it into an older component, or closes the curve. The critical points that

start components get paired with the other critical points. The elevation is zero at inflexion

Figure 4.3: A 1-manifold with marked critical points of the vertical height function. The shadedstrips along the curve connect paired critical points. The black and grey dots mark two- and one-legged elevation maxima.

points and increases as we move away in either direction. The function may experience a

discontinuity at points that share tangent lines with others, such as endpoints of segments

that belong to the boundary of the convex hull. On the way towards a discontinuity, the

elevation may go up and down, possibly several times. The elevation may reach a local

maximum at points that either maximize the distance to a shared tangent line or the distance

67

to another critical point in the normal direction. Examples of the first case are the black

dots in Figure 4.3, where the elevation peaks in a non-differentiable manner. An example

of the second case is the grey point, where the elevation forms a smooth maximum.

Singular tangencies. The elevation is continuous on�

, except possibly at points with

singular tangencies. These points correspond to transitional violations of the two genericity

conditions of Morse functions. Such violations are unavoidable as� ��# � �� is a 2-parameter

family within which we can transition from one Morse function to another:

� two critical points may converge and meet at a birth-death point where they cancel

each other;

� two critical points may interchange their positions in the ordering by height, passing

a direction at which they share the same height.

The first transition corresponds to an inflexion point of a geodesic on�

. Such points are

referred to as flat or parabolic, indicating that their Gaussian curvature is zero. The second

transition corresponds to two points � �� that share the same tangent plane,�� .

Both types of singularities are forced by varying one degree of freedom and are turned into

curves by varying the second degree of freedom. These curves pass through co-dimension

two singularities formed by two simultaneous violations of the two genericity conditions.

There can be two concurrent birth-death points, a birth-death point concurrent with an

interchange, or two concurrent interchanges. In each case, the singularity is defined by

two pairs of critical points and we get two types each because these pairs may be disjoint

or share one of the points. See Table 4.1 for the features on�

that correspond to the six

types of co-dimension two singularities. We can now be more precise about what we mean

by a generic 2-manifold.

GENERICITY ASSUMPTION A. The 2-parameter family of height functions on�

has no

violations of Conditions I and II for Morse functions other than the ones mentioned

68

above (and enumerated in Table 4.1 below).

Some of these violations will be discussed in more detail later as they can be locations of

maximum elevation. A second genericity assumption referring specifically to the elevation

function will be stated in Section 4.4.1.

4.3 Pedal Surface

In this section, we take a dual view of the height and elevation functions based on a trans-

formation of�

to another surface in � � . We take this view to help our understanding of

the singularities of�� # � �� , but it is of course also possible to study them directly using

standard results in the field [23, 99].

Pedal function. Recall that� is the plane tangent to

�that passes through � � �

. The

pedal � of � is the orthogonal projection of the origin on� . We write � �� % �� and

obtain a function

� � � % � � � � � �

whose image � �� % � � � is the pedal surface of�

[36]. If the line � is normal to�

then � � � . More generally, we can construct � by drawing the diameter sphere with center

� � passing through and � . This sphere intersects� in a circle with center ��

that passes through � and � �� % � � � . In fact, � is the evolute of the diameter spheres

defined by the origin and the points � � �, as illustrated in Figure 4.4. The following

three properties are useful in understanding the correspondence between�

and its pedal

surface:

� points on�

have parallel and anti-parallel normal vectors iff their images under the

pedal function lie on a common line passing through the origin;

69

Figure 4.4: A smoothly embedded closed curve (boldface solid) and the image of the pedal func-tion (solid) constructed as the evolute of the diameter circles (dotted) between the curve and theorigin.

� the height of a point � � �in the direction of its normal vector is equal to plus or

minus the distance of� � � % � � � from the origin;

� from � � � and the angle � between the vector � and the normal ��

of � at � we

can compute the radius � of the corresponding diameter sphere and the preimage �at distance �� #%$ � from � in the direction normal to � and � � �

�

.

The third property implies that the pedal surface determines the 2-manifold.

Tangents, heights, and pedals. We are interested in singularities of the pedal function as

they correspond to directions along which the height function is not generic. For example,

a birth-death point of� ��# � � � corresponds to a cusp point of � . To see this recall that the

birth-death point corresponds to a flat point � � �. A generic geodesic curve through this

point has an inflexion at � , causing the tangent plane to reverse the direction of its rotating

motion as we pass through � . Similarly, it causes a sudden reversal of the motion of the

image point thus forming a cusp at� � � % � � � . In contrast, an interchange of

��# � �� , which

corresponds to a plane tangent to�

in two points, maps to a point of self-intersection (a

xing) of � . These two cases exhaust the co-dimension one singularities of�� # � �� , which

are listed in the upper block of Table 4.1.

Co-dimension two singularities. There are six types of co-dimension two singularities

listed in the lower block of Table 4.1. Perhaps the most interesting is formed by two

70

Dictionary of Singularities� ��

flat point birth-death (bd) point cuspdouble tangency interchange xing

Jacobi point 2 bd-points dovetail pointtriple tangency 3 interchanges triple point

bd-pt. and interchange cusp xing2 bd-points cusp-cusp overpass2 interchanges xing-xing overpassbd-pt. and interchange cusp-xing overpass

Table 4.1: Correspondence between singularities of tangents of the manifold, the 2-parameterfamily of height functions, and the pedal surface. There are two singularities of co-dimension one:curves of cusps and curves of self-intersections (xings). There are six singularities of co-dimensiontwo.

concurrent birth-death points that share a critical point. As illustrated in Figure 4.5, left,

the corresponding dovetail point in the pedal surface is endpoint of two cusps but also of

a self-intersection curve. The second most interesting type is formed by two concurrent

triple point cusp intersectiondovetail point

Figure 4.5: Left: a portion of the pedal surface in which a self-intersection and two cusps end at adovetail point. Middle: three sheets of the pedal surface intersecting in a triple point. Right: a cuspintersecting another sheet of the pedal surface.

interchanges that share a critical point and therefore force a third concurrent interchange

of the other two critical points. It corresponds to three self-intersection curves formed by

three sheets of � that intersect in a triple point, as shown in Figure 4.5, middle. Third,

we may have a concurrent birth-death point and interchange that share a critical point. As

illustrated in Figure 4.5, right, this corresponds to a cusp curve that passes through another

sheet of the pedal surface. There are three parallel types in which the concurrency happens

in the same direction � but not in space. They correspond to two curves on the pedal surface

that cross each other as seen from the origin but do not meet in � � . As before, a birth-death

point corresponds to a cusp curve and an interchange to a curve of self-intersections.

71

4.4 Capturing Elevation Maxima

4.4.1 Continuity

We are interested in the local maxima of the elevation function, which are the counterparts

of mountain peaks and deepest points in the sea. But they are not well defined because the

elevation can be discontinuous. We remedy this shortcoming through surgery, and establish

a stratified Morse function on the new 2-manifold.

Discontinuities at interchanges. As mentioned in Section 4.2.1, the pairs vary con-

tinuously as long as the height function varies without passing through interchanges and

birth-death points (Conditions I and II). It follows that the elevation is continuous in re-

gions where this is guaranteed. Around a birth-death point, the elevation is necessarily

small and goes to zero as we approach the birth-death point. The only remaining possibil-

ity for discontinuous elevation is thus at interchanges, which happen when two points share

the same tangent plane. As mentioned in Table 4.1, this corresponds to a point at which

the pedal surface intersects itself. Figure 4.6 shows that discontinuities in the elevation

can indeed arise at co-tangent points. We see four points with common vertical normal

direction, of which � and � are co-tangent. Consider a small neighborhood of the vertical

direction, � , and observe that the critical points vary in neighborhoods of their locations

for �� . The critical point near � changes its partner from the right side of � to the left side

of � as it varies from left to right in the neighborhood of � . Similarly, the critical point near

changes its partner from the right side of � to the left side of � as it varies from left to

right in the neighborhood of

. Since the height difference is the same at the time of the

interchange, the elevation at � and

is still continuous. However, it is not continuous at

� and at � , which both change their partners, either from � to

or the other way round.

Not all interchanges cause discontinuities, only those that affect the pairing. These are the

interchanges that affect a common topological feature arising during the sweep of�

in the

72

y

w

x

z

Figure 4.6: The four white points share the same normal direction, as do the four light shadedand the four dark shaded points. The strips indicate the pairing, which switches when the heightfunction passes through the vertical direction. The insert on the right illustrates the effect of surgeryat � and � on the pedal curve.

height direction.

Continuity through surgery. We apply surgery to�

to obtain another 2-manifold�

on which the elevation function is continuous. Specifically, we cut�

along curves at

which� ��(% � # � $ � � � � is discontinuous, resulting in a 2-manifold with boundary, � .

Then we glue � along its boundary, making sure that glued points have the same elevation.

Formally, we cut by applying the inverse of a surjective map from � to�

, and glue by

applying a surjective map from � to�

:

��

�

As argued above, the boundary curves of � occur in pairs, and each pair is defined by an

interchange, thus corresponding to a self-intersection curve (a xing) of the pedal surface.

The latter view is perhaps the most direct one in which surgery means cutting along xings

and gluing the resulting four sheets in a pairing that resolves the self-intersection. This

is illustrated in Figure 4.6 where on the right we see a self-intersection being resolved by

cutting the two curves and gluing the upper and lower two ends. In the original boldface

curve on the left, this operation corresponds to cutting at � and � and gluing the four ends

to form two closed curves: one from � to � to � � � and the other from � to

to � � � .

As mentioned earlier, not all xings correspond to discontinuities and we perform surgery

73

only on the subset that do. In general, a discontinuity follows a xing until it runs into a

dovetail or a triple point. In the former case, the xing and the discontinuity both end. In the

latter case, the xing continues through the triple point and the discontinuity may follow,

turn, or even branch to other xings passing through the same triple point. Two possible

configurations created by surgery in the neighborhood of a triple point � are illustrated

in Figure 4.8 and 4.9. Their particular significance in the recognition of local maxima

will be discussed shortly. Whatever the situation, the subset of xings along which the

elevation is discontinuous together with the gluing pattern across these xings provides a

complete picture of how to use surgery to change � into a new surface, � . The 2-manifold�

is the one for which this is the pedal surface: � � � � � % � � � . That�

is indeed a

manifold can be shown by (tediously) enumerating and examining all cases of cut-and-glue

patterns that may occur. After surgery, we have a continuous function� ��(% � # � $ � � � � .

Furthermore, we have continuously varying pairs of critical points. To formalize this idea,

we introduce a new map� $ � #

� � � � � � �

that maps a point � to its paired point � � � $ � # � � � �� . The function

� $ � # � � � is a

homeomorphism and its own inverse. We note in passing that we could construct yet

another 2-manifold by identifying antipodal points. Each local maximum of the elevation

function on this new manifold corresponds to a pair of equally high maxima in�

. This

construction is the reason we will blur the difference between maxima and antipodal pairs

of maxima in the next few sections.

Smoothness of� ��(% � # � $ . The elevation function on

�is smooth almost everywhere. To

describe the violations of smoothness, let � � � denote the boundary of the intermediate

manifold. Let � �� and define � � �� $ � # � � � � �� ; � is the set of points

at which the elevation function is not smooth. By Genericity Assumption A, � is a graph,

consisting of nodes and arcs. We have degree-1 and degree-3 nodes that correspond to

74

dovetail points and triple points in the pedal surface respectively, as well as degree-4 nodes

that correspond to overpasses between xings. Each degree-4 node is the crossing of an arc

in � and an arc in the antipodal image of � . We think of this construction as a stratification

of�

. Its strata are

� the three kinds of nodes;

� the open and closed arcs;

� the open connected regions in�� .

Figure 4.7 illustrates the construction by showing how such a stratification may look like.

When restricted to every stratum, the elevation function is smooth, but still not a Morse

function. For example, all points from lines of inflexion have elevation identical to 0,

forming lines of local minima for� ��(% � # � $ . We now complete our description of what we

mean by a generic 2-manifold.

Figure 4.7: Stratification of the 2-sphere obtained by overlaying a spherical tetrahedron with itsantipodal image. The (shaded) degree-4 nodes are crossings between

�and its antipodal image.

GENERICITY ASSUMPTION B. The local maxima of� ��(% � # � $ on

�are isolated.

The implication of this assumption becomes more clear after we enumerate the generic

types of local maxima of the elevation function in next section. In particular, this means

that surfaces such as spheres and cylinders are not generic under this assumption.

75

4.4.2 Elevation Maxima

In this section, we enumerate the generic types of local maxima of the elevation function.

They come in pairs in�

which, by inverse surgery, form multi-legged creatures in�

.

Classification of local maxima. Depending on its location, a point � � � can have

one, two, or three preimages under surgery. We call this number its multiplicity, � � � � .Specifically, � has multiplicity three if it is a node of the graph � , it has multiplicity two

if it lies on an arc of � , and it has multiplicity one otherwise. Degree-4 nodes in the

stratification correspond to antipodal pairs of points with multiplicity two each. Let now

� � � be a local maximum of the elevation function. We know that � is not a flat point

of�

, else its elevation would be zero. This simple observation eliminates five of the eight

singularities in Table 4.1. Furthermore, the assumption of a generic 2-manifold�

implies

that the sum of the multiplicity of � and that of � � � $ � # � � � � � � is at most four (where

the xings intersect each other transversely, otherwise, we can deform the manifold slightly

to enforce it). This leaves the following four possible types of local maxima � :

one-leggedtwo-leggedthree-leggedfour-legged

�� if

�� and � � � � � �� and � � � � � � ��

where � � � $ � # � � � �� ; see Figure 4.1. We sometimes call the preimages of � the heads

and those of � the feet of the maximum. The most exotic of the four types is perhaps the

four-legged maximum, which corresponds to an overpass of two xings in the pedal surface

or, equivalently, a degree-4 node in the stratification. The image under� � � % of � lies on

one xing and the image of � lies on the other. Both maxima have two preimages under

surgery, which makes for a complete bipartite graph with two heads, two feet, and four

legs.

Neighborhood patterns. Given a point � � �, take an open neighborhood of � on

�.

76

Denote by � � � � � " � � the image of this neighborhood under Gauss map4, and refer to it

as the neighborhood of �� . If � is not a flat point (i.e., the Gaussian curvature at � is not

zero), then � � � � � is homeomorphic to an open disk, and there is a one-to-one map from the

neighborhood of � to that of �� under Gauss map. In the following discussion, we study

only non-flat points since flat points will not possibly be maxima of the elevation function.

It is instructing to look at the local neighborhood of a maximum � in�

. Most in-

teresting is the three-legged type, with feet � � � � � � � � . A small perturbation of the normal

direction can change the ambiguous pairing of � with all three to an unambiguous pairing

of a point in the neighborhood of � with a point in the neighborhood of one of the feet. We

indicate this by labeling the points in the neighborhood of � � (i.e., � � � � � ) with the indices

of the feet, as shown in Figure 4.8. The three curves passing through � � correspond to the

nxn

xnx

3

21

3 3

p

1 2

3 3

1 2

p

2 1

12

2 1

p

3

Figure 4.8: The three sheets of � after cutting and gluing the neighborhood of a triple point

in�at the top, and the corresponding pairing patterns in the neighborhood of � � , at the bottom. The

(shaded) Mercedes star is necessary for a three-legged maximum.

three xings passing through the triple point � � � . Note that in generic cases, such curves

should pass through each other at �� in a transversal manner as long as � is not a flat point.

Hence, they decompose the neighborhood into six slices corresponding to the six permu-

tations of the three feet. The labeling indicates the pairing and reflects the surgery at these

feet and, equivalently, at the corresponding triple point in the pedal surface. Only the right-�

The Gauss map takes each point on�

to its normal vector on � �.

77

most pattern in Figure 4.8 corresponds to a maximum, the reason of which will become

clear later after we introduce and prove the necessary projection conditions for elevation

maxima. We call this pattern the Mercedes star property of three-legged maxima.

There are in fact two ways to apply surgery at a three-legged maximum, one of which

is already shown in Figure 4.8. We illustrate the neighborhood patterns of the other in

Figure 4.9. The neighborhood pictures for the remaining three types of maxima are simpler.

For a one-legged maximum we have an undivided disk, which requires no surgery. For a

two- (resp. four-) legged maximum we have a disk divided into two halves (resp. quarters)

and there is only one way to do the surgery (see Figure 4.10).

p

xn xnxn

3

21

2 1

p

1 2

3 3

1 2

2 1

12

3 3

p

3

Figure 4.9: The second type of surgery pattern at a triple point: The three sheets of � after cuttingand gluing the neighborhood of a triple point

in

�, at the top, and the corresponding pairing

patterns in the neighborhood of � � , at the bottom.

xn 21

2, 1

2, 2

nx

1, 1

n1, 2

x

(a) (b)

Figure 4.10: Neighborhood patterns for (a) a two-legged maximum and (b) a four-legged maxi-mum, where we mark a region by � � , if � � is paired with � � in it.

Necessary projection conditions. Given a maximum � of� ��(% � # � $ � � � � with

� � � $ � # � � � �� , Let � � be the two directions at which the heads ( � � ’s) and feet

78

( � � ’s) of this maximum in�

are critical. Recall that all � � ’s (resp. � � ’s), are co-tangent.

Furthermore, they should satisfy the following necessary properties, stated as projection

conditions:

PROJECTION CONDITIONS. The point � is a maximum of the elevation function only if

�legs

� � : � is parallel or anti-parallel to � � � � ;

�legs

� : � , � � � � , and �� are linearly dependent and the orthogonal projection

of � � onto the line of the two feet lies between � � and �� ;�

legs� �

: the orthogonal projection of � onto the plane of the three feet lies inside

the triangle spanned by � � , �� and � � ;�

legs�

: the orthogonal projections of the segments � � � � and � �� onto a plane

parallel to both have a non-empty intersection.

In summary, � is a local maximum only if � is either a positive or a negative linear combi-

nation of the vectors � � � � � . Below we first prove the necessary conditions for one-legged

maxima. We then sketch the proof for all the remaining types of maxima.

For a one-legged maximum � , assume without loss of generality that � lies at the origin,

� � � � � � � , � , and ��

� �� . By definition,

� ��(% � # � $ �� (% � # � $ � � � � , which is the height difference between � and � in the vertical direction.

We parameterize points in � � � � " � � by � �� as illustrated in Figure 4.11, where

� � � � � � and � � � � � � for a sufficiently small � � . Next, let � ��

(resp.

� �� ) be the point in the neighborhood of � (resp. � ) with normal � �� . Denote

by � �� the height difference between � �� and � �� in direction � �� , i.e.,

� �� (% � # � $ �� (% � # � $ � � ��

� ��

79

Obviously, � is a maximum if and only if

� � �� for a sufficiently small � � � (4.1)

Moreover,

��

Figure 4.11: Illustration of normal � and the parameterization of its neighborhood � � � � , which isa spherical patch from � � .

� �� .# $ �$ � � � � �.# $ �

�.# $ � � $ � � � � � and

� �� #%$ �$ � � � �� #%$ �

� #%$ � �� $ � � � � � �

where � � �� denote the radius of curvature at position � in direction parameterized by � .

Similarly, we can compute � �� . It then follows that

� ��

��.# $

�$ � � � � � �.# $ �

�.# $ � � $ � � �

� � � � �� $ � � � � � (4.2)

Hence,

� � �� $ � � � �� $ � � �� $ � � � � � �.#%$ � ��

� � � � $ � � � �� $ � � �� $ � � �� 1 ��

where�.# $ 1 � � � � � � � and

$ � � 1 � � � � � � ( 1 is the angle from vector � � � � to

vector � �� in the � � -plane). The third term in the bracket dominates the above value for

80

sufficiently small � , as$ � � � � tends to infinity in that case. Hence, (4.1) implies that � is a

maximum if and only if (i) � � � � � , i.e., � and �

� ; and (ii) for any � � � � � � ,

� � �� . Note that the projection condition for one-legged maxima is the same

as (i) and is thus indeed necessary. Furthermore, if we add (ii), then the conditions are also

sufficient. In fact, the necessary Projection Conditions for all other types of maxima can be

made sufficient by adding appropriate constraints on the curvature of�

at � � ’s and at �� ’s,

and, if � is three-legged, it also needs the Mercedes star property. We sketch the proofs

for the remaining types of maxima below. As these curvature conditions are not used in

our algorithm for the piecewise-linear manifolds, we omit terms related to curvatures in

the following sketch for simplicity. Hence, the above discussion suggests that it suffices to

consider only

� �� $ � � �� 1 � �

which has opposite sign as � � �� as � tends to zero.

For a�-legged maximum � , where

� � �, the complication is that a point in the

neighborhood of � might be paired with points from the neighborhood of different � � ’s,

as illustrated by the neighborhood patterns in Figures 4.8 – 4.10. The disk in the neigh-

borhood pattern corresponds to the projection of � � � � in direction � . Therefore, we can

parametrize it using � as well, and each region � in the pattern corresponds to an interval� � � � in

� � � � , refered to as the range of this region.

First consider� � or

�, and assume again that � is at the origin, and �

� � � � � � .Define � � �� , 1 � , and � � �� as before by substituding � with � � . Obviously, � is a

maximum if and only if for each region � in the neighborhood pattern,

� � �� (4.3)

For a two-legged maximum, the length of the range of each region is � . Hence, (4.3)

imposes that the graph � � �� as shown in Figure 4.12 (a). In other words, we

81

(a) (b)

Figure 4.12: Solid (dotted) curve corresponds to the graph of � � �� ( � � �� ). (a) is necessary for �to be a maximum. In (b), there exists � (hollow dot) such that both � � �� and � � �� . Thus� cannot be a maximum.

have 1 � � 1 � � � , implying that the � � -projections of segment � �� contains the origin

(i.e., � ).

(a) (b)

Figure 4.13: Dark, light, and dotted curves are graph of � � �� , � � �� and � �� , respectively. (a) is

necessary for � to be a maximum. In (b), where we move � �� slightly, there exists � (hollow dot)

such that � � �� for ,� �� and � . Thus � cannot be a maximum.

For a three-legged maximum, first note that the Mercedes star property is necessarily

true. This is because for any other neighborhood patterns in Figure 4.8 or 4.9, there exist

pairs of antipodal normals in � � �� that are marked by the same index. For example, in

the middle pattern in Figure 4.8, there are such normals both marked by index�. As � � ��

and � � �� have opposite signs (unless � � �� ), � cannot be a maximum

in this case. Furthermore, (4.3) imposes that 1 � � 1 � � � � 1 � � 1 �� 1 � � see Figure

4.13. This implies that the � � -projection of triangle � �� covers the origin (i.e., � ).

The case for a four-legged maximum is slightly more involved as there are two heads

� � and � � , as well as two feet � � and �� . Assume that � � is at the origin, and �� .

By construction, � �� is parallel to the horizontal plane and � � � � � � � � � lies within it.

82

We define 1 � ’s and � � ’s for� � � � as before by substituting � with � � . Denote by � � ��

the height of � � �� minus � � �� in direction � �� . By (4.2), � �� has the same

sign as � � �� $ � � �� 1 � � where 1 � is the angle from vector � � � � to vector � � � � � . Let� � � � � �

� � � �,� � �� . If � is a maximum, then for any � � �, either � � �� or

� � �� holds. This is because that � � �� is higher than � � �� for � � �. Thus the

height difference between � � �� and � � �� can only be larger than � � �� . It then

follows that 1 � � 1 � � 1 � , implying that the � � -projections of segment � �� intersects the

line passing through � � � � in the interior. By switching the role of � � ’s and � � ’s in the above

argument, we can show that segment � � � � also intersects, in its interior, the line passing

through the projections of � � and � � . Thus proves the necessary condition for four-legged

maxima.

4.5 Algorithm

In this section, we describe an algorithm for constructing all points with locally maximum

elevation. The input is a piecewise linear 2-manifold embedded in � � . The running time

of the algorithm is polynomial in the size of the 2-manifold.

Smooth vs. piecewise linear. We consider the case in which the input is a two-dimensional

simplicial complex in �� . This data violates some of the assumptions we used in our math-

ematical considerations above. This causes difficulties which, with some effort, can be

overcome. For example, it makes sense to require that � be a 2-manifold but not that it

be smoothly embedded. The 2-parameter family of height functions is well-defined and

continuous but not smooth. The definition of the elevation function is more delicate as it

makes reference to point pairs in all possible directions. For any given direction, we get a

well-defined collection of pairs, but how can we be sure that the pairs for different direc-

tions are consistent? The difficulty is rooted in the fact that a vertex in � can be critical for

more than one direction and it may be paired with different vertices in different directions.

83

To rationalize this phenomenon, we follow [76] and think of � as the limit of an infinite

series of smoothly embedded 2-manifolds. A vertex of � gets resolved into a small patch

with a two-dimensional variety of normal directions. Even as the patch shrinks toward the

vertex, the variety of normal directions may remain fixed or at least not contract. For dif-

ferent directions in this variety, the corresponding points on the patch may be paired with

points from different other patches. It thus seems natural that in the limit a vertex would

be paired to more than one other point.

To make this idea concrete, we introduce a combinatorial notion of the variety of nor-

mal directions. Let � be a simplex in � (a vertex, edge, or triangle), let � be a point in the

interior of � , and let � � � � be a direction. We say � is critical for the height function in

the direction � if

(i)�� for all points � of � ;

(ii) the lower link of � is not contractable to a point.

For example, the empty lower link of a minimum and the complete circle of a maximum

are both not contractible. Let � � � � " � � be the set of directions along which � is critical.

Generically, the set � for a point inside a triangle of � is an antipodal pair of points,

that for a point on an edge is an antipodal pair of open great-circle arcs, and that for a

vertex is an antipodal pair of open spherical polygons. Here, the word ‘generic’ applies to

a simplicial complex in � � , where it simply means that the vertices are in general position.

Computationally, this assumption can be simulated by a symbolic perturbation [78]. We

write � �� for the common intersection of the sets � of � , � and so on.

Finite candidate sets. Given a candidate for a maximum, we can use the extended per-

sistence algorithm to decide whether or not it really is a maximum. More specifically, we

need a point � and a direction � along which the sweep defining the pairing proceeds.

The details of this decision algorithm will be discussed shortly. We use the Projection

84

Conditions, which are necessary for local maxima, to get four kinds of candidates:

�legs

� � : pairs of points � and � on � with the direction � � � � � � � �� contained in

� �� ;�

legs� : triplets of points � � � � � � � such that the orthogonal projection � of � onto the

line of � � and �� lies between the two points and the direction � � � � � � � � � � � is

contained in � �� ;�

legs� �

: quadruplets of points � � � � � � � � � � such that the orthogonal projection � of �onto the plane of � � , �� , � � lies inside the triangle and the direction � � � � � � � � � � �is contained in � � � � � � � � � � � � � ;

�legs

� : quadruplets of points � � �� such that the shortest line segment � con-

necting the lines of � � �� and � � � �� also connects the two line segments and the

direction � � � � � � � � � is contained in � �� .With the assumption of a generic simplicial complex � , we get a finite set of candidates

of each kind. Since this might not be entirely obvious, we discuss the one-legged case

in some detail. Let � and � be two simplices and � and � points in their interiors. For

a generic � , the intersection of normal directions, � �� , is non-empty only if one of

the two simplices is a vertex or both are edges. If � � � is a vertex then � is necessarily

the orthogonal projection of � onto � , which may or may not exist. If � and � are both

edges then � � is necessarily the line segment connecting � and � and forming a right angle

with both, which again may or may not exist. In the end, we get a set of O( � � ) candidate

pairs � and � , where � is the number of edges in � . For the two-legged case, we get

O( � � ) candidates, each a triplet of vertices or a pair of vertices together with a point on an

edge. For the three- and four-legged cases, we get O( ��) candidates, each a quadruplet of

vertices, giving a total of O( ��) candidates.

Verifying candidates. Let � � � � � be a pair of points whose heads and feet all have

85

parallel or anti-parallel normal directions. In the smooth case, the necessary and sufficient

conditions for � and � to define an elevation maximum consists of three parts:

(a) the Projection Conditions of Section 4.4.1;

(b) the requirement that � � � $ � # � � � �� ;(c) the curvature constraint alluded to in Section 4.4.1.

We subsume the Mercedes star property in (b) since it depends on the antipodality map

or, equivalently, on the pairing by extended persistence. In the piecewise linear case, we

only have (a) and (b) because the concentration of the curvature at the edges and vertices

renders (c) redundant. We have seen above how to translate (a) to the piecewise linear

case. It remains to test (b), which reduces to answering a constant number of antipodality

queries: given a direction � and a critical point � of � � , find the paired critical point � . This

is part of what the algorithm described in Section 4.2.1 computes if applied to a sweep of� in the direction � . More precisely, the algorithm computes one of the possible pairs,

if applied in non-generic directions in which two or more vertices share the same height.

Most of our candidates generate non-generic directions, and we cope with this situation by

running the algorithm several times, namely once for each combination of permutations

of the heads and of the feet. Each combination corresponds to a generic direction that

is infinitesimally close to the non-generic direction. The largest number of combinations

is six, which we get for three-legged maxima. This is also how we decide the Mercedes

star property: each of the three feet is the answer to exactly two of the six antipodality

queries. Letting � be the number of edges, the algorithm takes time O( � �� ) to answer

the antipodality query. Since we have O( ��) candidates to test, this amounts to a total

running time of O( � � �� ).

86

4.6 Experiments

We implemented the algorithm described in Section 4.5 and used it on surface representa-

tions of a few protein structures. We describe the findings to illustrate how the concepts

introduced in the earlier sections might be applied.

Elevation on surface. We discuss the experimental findings for chain � of the protein

complex with pdbid 1brs, which we downloaded from the protein data bank. It contains

864 atoms, not counting the hydrogens which are too small to be resolved in the x-ray

experiment and are not part of the structure. The particular surface representation we use

is the molecular skin [55], which is similar to the better known molecular surface [65]. The

reason for our choice is the availability of triangulating software and the guaranteed smooth

embedding. The computed triangulation displayed in Figure 4.16 has slightly more than

50 thousand vertices after some simplification. Given the triangulation of the skin surface,

we compute the elevation for each vertex of it and visualize it in Figure 4.16: Recall that

each vertex has a range of directions associated with it that makes it critical, from which

we choose an arbitrary one to compute its elevation value.

Number of maxima. Table 4.2 gives the number of maxima of each type computed for

the skin triangulation for protein 1brs. We show in a separate row the number of additional

maxima paired by extended persistence as introduced in Section 4.2.1. Since the genus of

this particular surface is zero, all these maxima lie on the convex hull of the surface.

�legs one two three four

�max (trad.) 5 3,617 728 1,103

�max (addl.) 15 0 6 0

Table 4.2: The number of maxima for the molecular skin of the 1brs protein structure obtainedvia traditional persistence (second row) and the additional maxima obtained by its extension (thirdrow).

We notice that there are significantly more two-legged than other types of maxima. The

reasons is perhaps the particular shape of molecules in which covalently bonded atoms

87

Figure 4.14: The percentage of maxima with elevation exceeding the threshold marked on thevertical axis. From top to bottom: the curves for the three-legged, four-legged, and two-leggedmaxima.

form small dumbbells which invite two-legged maxima with one foot on each atom. These

dumbbells are rotationally symmetric and form surface patches with non-generic elevation

function, which further contributes to the abundance of two-legged maxima. The con-

figurations required to form one-legged and three-legged maxima are considerably more

demanding, but when they occur the maxima tend to have higher elevation. This obser-

vation is quantified in Figure 4.14, which sorts the maxima in the order of decreasing

elevation. We see that for each threshold, the fraction of three-legged maxima higher than

Figure 4.15: The one hundred pairs of maxima with highest elevation. The heads are marked bylight and the feet by dark dots.

that elevation is significantly larger than the fractions of two- and four-legged maxima.

The difference is even more pronounced for one-legged maxima of which four of the five

88

have elevation exceeding 5 Angstrom. The statistics for other proteins are similar.

High elevation maxima. We are indeed mostly interested in high elevation maxima as

the others are likely consequences of insignificant surface fluctuations or artifacts of the

piecewise linear nature of the data. Figure 4.15 shows the top one hundred maxima on the

skin surface of 1brs. Each antipodal pair of maxima is represented by its one or two heads

and one, two, or three feet.

One might expect that the binding site of a protein would perhaps have more or higher

maxima. We did not observe any such trend in the few cases we studied. It seems that

maxima are more or less uniformly distributed over the surface. This should be contrasted

to the finding that in many cases the pocket with the largest volume identifies the location

of the binding site [130]. The elevation is indeed a less specific measurement with respect

to a single surface, and we expect its primary use to be in the study of interactions between

two or more shapes.

Meshes of different resolution. To see how the resolution of surface meshes influences

the behavior of the elevation function, we generate two meshes for the molecular surface

of chain E of the protein complex (pdbid 1cho), using the msms program (also available

as part of the VMD package [105]). These two meshes, � � and �� , have � � � � �and

� � � � �

vertices, respectively. Let �

(resp.

� ) be the set of maxima of the elevation function

from surface � � (resp. �� ) and � �� " �

(resp.

� �� " � ) the subset of maxima

with an elevation greater than some threshold � . We show in Table 4.3 the size of � �� and

� �� for various � ’s. We note that the size of

� �� (from the coarser mesh)

differs less from that of

� �� as � increases. Furthermore, Table 4.4 shows that points

from � �� , � � �� , roughly covers points from

� �� , i.e., � � �� . In particular, a point

� � � � �� is � -covered if there is some point � � � � �� such that � � � � � � � � , where

� � is the covering radius. Table 4.4 shows the percentage of points from � � �� that are

89

Threshold 0.0 0.1 0.2 0.3 0.5 0.8 1.0 2.0 4.0� � �� 2292 1445 1010 734 500 345 272 115 26�� 3714 2233 1290 931 489 357 287 104 34

Table 4.3: The number of maxima whose elevation is greater than the threshold from surfaces � �and � � .

� -covered by � � �� , called covering density � � � �� of �� over � � �� , for various � ’s and

for � � � ˚�

. Similarly, � � � denotes the covering density of � � �� over �� . The covering

density increases in general as � increases, indicating that larger features are preserved

better than smaller ones as the surface mesh becomes coarser.

� 0.1 0.2 0.3 0.5 0.8 1.0 2.0 4.0� � � ��

Table 4.4: As � changes (upper row), the covering density of� � �� over

�� (middle row) and

that of�� over

� � �� (lower row).


The main contribution of this chapter is the definition of elevation as a real-valued function

on a 2-manifold embedded in � � and the computation of all local maxima. The definition

of this function can be extended to a � -manifold embedded in � � � � .The logical next step in this research is the exploitation of the maxima in protein dock-

ing and other shape matching problems. We will describe in Chapter 6 one such approach.

It would also be worth exploring extensions of our results to manifolds with boundary. A

crucial first step will have to be the generalization of the concept of extended persistence

to these more general topological spaces. Another interesting direction of research is iden-

tifying “features” (as those computed by elevation function) directly from a point cloud

representing some underlying surface

: Surface reconstruction in general is hard with

oversampled and/or noisy point sets. However, it is easy to construct a simplicial complex

90

� (which may not be a manifold) that roughly describes

and may have different topol-

ogy as

. By computing points with maximal elevation on � , and keeping only those

with high elevation, it is still possible to identify important features on

.

Finally, the algorithm presented in Section 4.4.1 enumerates all local maxima of the

elevation function, without computing the elevation function itself, other than at a collec-

tion of candidate points. This approach is suggested by the ambiguities that arise in the

definition of the elevation function for piecewise linear data. Unfortunately, it implies the

fairly high running time of O( � � �� ) in the worst case. Can the maxima be enumerated

more efficiently than that? Is there an algorithm that enumerates all maxima above some

elevation threshold without computing the maxima below the threshold?

91

Figure 4.16: Visualization of elevation on skin surface for protein 1brs. Roughly, the higher theelevation is, the darker the color is.

92

Part II:Shape Matching

93

Chapter 5

Matching via Hausdorff Distance

5.1 Introduction

The problem of shape matching in two and three dimensions arises in a variety of applica-

tions, including computer graphics, computer vision, pattern recognition, computer aided

design, and molecular biology [17, 97, 148]. For example, proteins with similar shapes

are likely to have similar functionalities, therefore classifying proteins (or their fragments)

based on their shapes is an important problem in computational biology. Similarly, the

proclivity of two proteins binding with each other also depends on their shapes, so shape

matching is central to protein docking problem in molecular biology [97].

Informally, the shape-matching problem can be described as follows: Given a distance

measure between two sets of objects in � � or � � , determine a transformation, from an al-

lowed set, that minimizes the distance between the sets. In many applications the allowed

transformations are all possible rigid motions. However, in certain applications there are

constraints on the allowed transformations. For example, if matching the pieces of a jigsaw

puzzle, it is important that no two pieces overlap each other in their matched positions. An-

other example is the aforementioned docking problem, where two molecules bind together

to form a compound, and, clearly, at this docking position the molecules should occupy

disjoint portions of space [97]. Moreover, because of efficiency considerations, one some-

times restricts further the set of allowed transformations, most typically to translations

only.

Several distance measures between objects have been proposed, varying with the kind

of input objects and the application. One common distance measure is the Hausdorff dis-

94

tance [17], originally proposed for point sets. In this chapter we adopt this measure, extend

it to non-point objects (mainly, disks and balls), and apply it to several variants of the shape

matching problem, with and without constraints on the allowed transformations. We are

primarily interested in the case of balls because of molecular-biology applications, where

a molecule is typically modeled as a set of balls, with each atom represented by a ball [19].

Problem statement. Let � and � be two (possibly infinite) sets of geometric objects

(e.g., points, balls, simplices) in � � , and let �� be a distance function between

the objects in � and in � . For � � , we define � � � �� #%$'&�� )�� . Similarly, we

define � � � � � � � #%$'& � )�� , for � � � . The directional Hausdorff distance between �

and � is defined as � �� )��

and the Hausdorff distance between � and � is defined as

� �� %�� (It is important to note that in this definition each object in � or in � is considered as a

single entity, and not as the set of its points.) In order to measure similarity between �

and � , we compute the minimum value of the Hausdorff distance over all translates of �

within a given set " � � of allowed translation vectors. Namely, we define

�� #%$'&�*) � � ��

where � � � � � � � � � � � . In our applications,

will either be the entire � � or

the set of collision-free translates of � at which none of its objects intersects any object of

� . The collision-free matching between objects is useful for applications (like the docking

problem) in which the goal is to determine a transformation � so that the shape of � � �best complement that of � . We use � �� to denote �� .

95

As already mentioned, our definition of (directional) Hausdorff distance is slightly dif-

ferent from the one typically used in the literature [17], in which one considers the two

unions �� , �� as two (possibly infinite) point sets, and computes the standard Hausdorff

distance

� � �� %�� where � � �� )��

� � � � )�� #%$'&� )��

We will denote �� (resp., � � ) as�� (resp.,

�� ), and use the notation �� to denote� � � � � � � � . Analogous meanings hold for the notations �� and � � �� .

A drawback of the directional Hausdorff distance (and thus of the Hausdorff distance)

is its sensitivity to outliers in the given data. One possible approach to circumvent this

problem is to use “partial matching” [57], but then one has to determine how many (and

which) of the objects in � should be matched to � . Another possible approach is to use

the root-mean-square (rms, for brevity) Hausdorff distance between � and � , defined by

� ��

� � � � - � % $ �

� �� %�� with an appropriate definition of integration (usually, summation over a finite set or the

Lebesgue integration over infinite point sets). Define

� �� # $'&�*) � � ��

Finally, we define the summed Hausdorff distance to be

� ��

� � �

96

and similarly define � � and � � . Informally, � �� can be regarded as an� �

-distance

over the sets of objects � and � . The two new definitions replace��

by� � and

� � ,

respectively.

Prior work. It is beyond the scope of this section to discuss all the results on shape

matching. We refer the reader to [17, 97, 148] and references therein for a sample of

known results. Here we summarize known results on shape matching using the Hausdorff

distance measure.

Most of the early work on computing Hausdorff distance focused on finite point sets.

Let � and � be two families of and � points, respectively. In the plane, � ��can be computed in � � � � � �� time using Voronoi diagrams [14]. In �� , it

can be computed in time � � � � � � - � �� using the data structure of Agarwal and Ma-

tousek [9]. Huttenlocher et al. [107] showed that �� can be computed in � � � � � � � � � � �� time in � � , and in time � � � � � � � � � � �� in � � , where � � is

arbitrarily small. Chew et al. [57] presented an � � � � �� - �� -time algo-

rithm to compute �� in � � for any � . The minimum Hausdorff distance between

� and � under rigid motion in � � can be computed in � � � � � � �� time [106].

Faster approximation algorithms to compute � �� were first proposed by Goodrich

et al. [95]. Aichholzer et al. proposed a framework of approximation algorithms using

reference points [12]. In � � , their algorithm approximates the optimal Hausdorff dis-

tance within a constant factor in � � � � � �� time over all translations, and in

� � �)� � � � �� time over rigid motions. The reference point approach can be

extended to higher dimensions. However, it neither approximates the directional Hausdorff

distance over a set of transformations, nor can it cope with the partial-matching problem.

Indyk et al. [110] study the partial matching problem, i.e., given a query� � , com-

pute the maximum number of points � from � such that � � � �� . They present

algorithms for � -approximating the maximum-size partial matching over the set of rigid

97

motions in � � +� � � � � � � �� time in � � , and in � � +� � � � � � � � � �)�� time in

� � , where � is the maximum of the spreads of the two point sets1. Their algorithm can

be extended to approximate the minimum summed Hausdorff distance over rigid motions.

Similar results were independently achieved in [46] via different technique.

Algorithms for computing � � �� and/or � � �� where � and � are sets of seg-

ments in the plane, or sets of simplices in higher dimensions [11, 14, 15]. Atallah com-

puted � � �� for two convex polygons [25]. Agarwal et al. [11] provide an algorithm

for computing � � �� in time � � � � � �� . For the case where any rigid motion

is allowed, the minimum Hausdorff distance can be computed in time � � � � � ��

(Chew et al. [58]). Aichholzer et al. approximate the minimum Hausdorff distance under

different families of transformations for sets of points, segments in � � , and sets of trian-

gles in � � using reference points [12]. Other than that, little is known about computing

� � �� or �� where � and � are simplices or other geometric shapes in higher

dimensions.

Our results. In this chapter, we develop efficient algorithms for computing ��

and � � �� for balls, and for approximating � �� for sets of points in

� � . Consequently, the chapter consists of three parts, where the first two deal with the two

variants of Hausdorff distances for balls, and the third part studies the rms and summed

Hausdorff-distance problems for point sets.

Let � � � � denote the ball in � � of radius�

centered at . Let � � � � � � �� and � � � � � � �� be two families of balls in � � , where � � � � � � � � � and � � �

�� , for each � and�. Let � be the set of all translation vectors � � �� so that no ball

of � � � intersects any ball of � .

Section 5.2 considers the problem of computing the Hausdorff distance between two

sets � and � of balls under the collision-free constraint, where the distance between two�The spread of a set of points is the ratio of its diameter to the closest-pair distance.

98

disjoint balls � � � � and � � � � is defined as � �� . We can

regard this distance as an additively weighted Euclidean distance between the centers of

� � and � � , and it is a common way of measuring distance between atoms in molecular

biology [97]. In Section 5.2 we describe algorithms for computing �� in two and

three dimensions. The running time is � � � � � � �� in � � , and � � � � � �� in � � . The approach can be extended to solve the (collision-free) partial-

matching problem under this variant of Hausdorff distance in the same asymptotic time

complexity.

Section 5.3 considers the problem of computing � � �� and � � �� , i.e., it

computes the Hausdorff distance between the points lying in the union of � and the

union of � , minimized over all translates of � in � � or in � . We first describe an

� � � � � � �� -time algorithm for computing � � �� and � � �� in � � ,

which relies on several geometric properties of the union of disks. A straightforward ex-

tension of our algorithm to �� is harder to analyze, and does not yield efficient bounds on

its running time, so we consider approximation algorithms. In particular, given a parame-

ter �� , we compute a translation � , in time � � � � � � � � � � �� in � � and in

time � � � � � �� in � � , such that � � �� .

We also present a “pseudo-approximation” algorithm for computing � � �� : Given

an � � , the algorithm computes a region � " �� , an � -approximation of � (in a sense

defined formally in Section 5.3), and returns a placement � �� such that

� � ��

in time � � � � � �� in � � . This variant of approximation makes sense in

applications where the data is noisy and shallow penetrations between objects are allowed,

as is the case in the docking problem [97].

Finally, let � and � be two sets of points in � � . Section 5.4 describes an algorithm

99

that computes an � -approximation of � �� in time � � � � � � � �� .2 It also

provides a data structure so that it can return in � �� time an � -approximation

of � �� for a query vector � � � � . In fact, we solve a more general problem,

which is interesting in its own right. Given a family � � � �� of point sets in � � , with

a total of � points, we construct a decomposition of �� into � � � � � � cells, which is

an � -approximation of each of the Voronoi diagrams of � � � � �� , in the sense defined

in [24, 98]. Moreover, given a semigroup operation�

, we can preprocess this decom-

position in � � � � � � � �� time, so that an � -approximation of �� , for

a query point � , can be computed in � �� time. We also extend the approach

to � -approximating �� in � � � � � � � � � �� time. This result relies on

a dynamic data structure, which we propose, for maintaining an � -approximation of the

�-median of a point set in � � , under insertion and deletion of points.

5.2 Collision-Free Hausdorff Distance between Sets of Balls

Let � � � � � � �� and � � � � � � �� be two sets of balls in � � , � � � � . For

two disjoint balls � � � � � � � � � � � and � � � � � � � � � � � � , we define

� ��

namely, the (minimum) distance between � � and � � as point sets. Let � be the set of

placements � of � such that no ball in � � � intersects any ball of � . In this section, we

describe an exact algorithm for computing �� , and show that it can be extended to

partial matching.�Indyk et al. [110] outline an approximation algorithm for computing �� without providing anydetails. We believe that if we work out the details of their algorithm, the running time of our algorithm isbetter. Moreover our algorithm is more direct.

100

5.2.1 Computing �� in 2D and 3D

As is common in geometric optimization, we first present an algorithm for the decision

problem, namely, given a parameter � � , we wish to determine whether �� .We then use the parametric-searching technique [11, 133] to compute �� . Given

� � , for� � � � , let �� " � � be the set of vectors � � � � such that

(V1) � � � � does not intersect the interior of any � � � � ;

(V2) �# $ �� .

Let �� and ��

� �� . Then� ��

��

is the set of vectors that satisfy (V2), and the interior of� ��

� � violates (V1).

Clearly, �� $ � � �� . Let

� �� %� � � �

� $ ��

��

��

�

See Figure 5.1 for an illustration. By definition, � �� " � is the set of vectors �� such that � �� . Similarly, we define

� � � � � � " � � � � � � � � � � � � � � � � � � �Thus �� if and only if � ��

Lemma 5.2.1 The combinatorial complexity of � �� in � � is � � � � .PROOF. If an edge of �� is not adjacent to any vertex, then it is the entire circle

bounding a disk of �� or �� . There are � � � such disks, so it suffices to bound the

number of vertices in � �� .Let be a vertex of � �� ; is either a vertex of � � , for some

� � � � , or an

intersection point of an edge in � � and an edge in �� , for some� � � �

� � � . In the

latter case,

��

��

101

��

(a) (b)

Figure 5.1: (a) Inner, middle and outer disks are� � �� , �

�� , and

� �� , respectively; (b) anexample of � � (dark region), which is the difference between �� (the whole union) and

�� (inner

light region).

In other words, a vertex of � �� is a vertex of� �� ,

� �� ,

� �� , or

� ��

�� ,

for� � � � � � . Observe that a vertex of

� �� (resp., of� ��

�� ) that lies on both

� � �� and � � �� (resp., � �� ) is also a vertex of

� �� (resp.,� ��

� ). Therefore,

every vertex in � �� is a vertex of� �� ,

� �� ,

� �� , or

� ��

� ,

for some� � � � � � . Since each

� �� is the union of a set of � disks, each of

� �� ,

� �� ,

� �� ,

� ��

� is the union of a set of � disks and thus has

� � � vertices [120]. Hence, � �� has � � � � vertices.

Lemma 5.2.2 The combinatorial complexity of � �� in �� is � � � � � .PROOF. The number of faces or edges of � �� that do not contain any vertex is � � � � �since they are defined by at most two balls in a family of � balls. We therefore focus on

the number of vertices in � �� . As in the proof of Lemma 5.2.1, any vertex � ��

satisfies:

� � � � � � � � � � � � ��

��

for some� � � � � � � � . Again, such a vertex is also a vertex of � � � � � � �*� ,

where � � (or � � , �*� ) is� �� or

� �� . Since the union of

�balls in �� has � � � � vertices,

� � � � � � �� has � � � � vertices, thereby implying that � �� has � � � � � vertices.

102

Similarly, we can prove that the complexity of � � � � � � is � � � � in � � and � � � � �in � � . Extending the preceding arguments a little, we obtain the following.

Lemma 5.2.3 � �� has a combinatorial complexity of � � � � � � � in

� � , and � � � � � � � � � in � � .

Remark. The above argument in fact bounds the complexity of the arrangement of � �

� � � � � � � �� . For example, in � � , any intersection point of �� and � � � lies on the

boundary of � � �� , and we have argued that � � � � � has � � � vertices. Hence, the

entire arrangement has � � � � vertices in � � .

We exploit a divide-and-conquer approach, combined with a plane-sweep, to compute

� �� , � �� , and their intersections in � � . For example, to compute � �� , we

compute � � �� - �� and � � � ��

�� - �� recursively, and merge � ��

by a plane-sweep method. The overall running time is � � � � � � �)� � � .To decide whether � �� in �� , it suffices to check whether

�� is empty for all balls � � �

� � � �� . Using the fact that

the various �� meet any �� in a collection of spherical caps, we can compute

��

in time � � � � � � �� , by the same divide-and-conquer approach as computing

� �� in � � . Therefore we can determine in � � � � � � � � �� time

whether � �� in � � .

Finally, the optimization problem can be solved by the parametric search technique [11].

In order to apply the parametric search technique, we need a parallel version of the above

procedure. However, above divide-and-conque paradigm uses plane-sweep during the

conque stage, which is not easy to parallelize. As such, we instead adopt the same al-

gorithm as in [11] to compute the union/intersection of two planar or spherical regions. It

103

yields an overall parallel algorithm for determining whether � �� is empty

in � �� time using � � � �� )� � � processors in � � , and � � � � � �

� � �� processors in � � . This implies the following.

Theorem 5.2.4 Given two sets � and � of and � disks (or balls), we can compute

�� in time � � � � � � �� in � � and in time � � � � � � � � ��

in � � .

5.2.2 Partial matching

Extending the definition of partial matching in [110], we define the partial collision-free

Hausdorff distance problem as follows.

Given an integer�

, let �� denote the� � � largest value in the set � � � ��

� � ; note that � �� . We define � � �� in a fully symmetric manner, and

then define �� , � � �� as above. The preceding algorithm can be extended to

compute � � �� in the same asymptotic time complexity. We briefly illustrate the

two-dimensional case. Let � � � � � � � � � �� be as defined above, and let � � � � be the

arrangement of � . For each cell � �� , let � � � � be the number of � � ’s that contain � .

Note that for any point � in a cell � with � � � � � � � � � , �� , and vice

versa. Hence, we compute � � � � and � � � � for each cell � �� , and then discard all

the cells � for which � � � � � � �� . The remaining cells form the set � � � � � � � ��

� �� . By the Remark following Lemma 5.2.3, � has � � � � vertices, and it can

be computed in � � � �� time. Therefore, � can be computed in � � � ��

time. Similarly, we can compute � � � � � � � � � � � � � � � � � in � �

� �� time, and

we can determine in � � � � � � �� time whether � � � �� . Similar arguments

can solve the partial matching problem in �� . Putting everything together, we obtain the

following.

104

Theorem 5.2.5 Let � and � be two families of and � balls, respectively, and let�

be an integer, we can compute � � �� in � � � � � � �)� � � � time in � � , and in

� � � � � � � � �� time in � � .

5.3 Hausdorff Distance between Unions of Balls

In Section 5.3.1 we describe the algorithm for computing � � �� in � � . The same

approach can be extended to compute � � �� within the same asymptotic time com-

plexity. In Section 5.3.2, we present approximation algorithms for the same problem in � �

and � � .

5.3.1 The exact 2D algorithm

Let � � � � � � � �� and � � � � � � �� be two sets of disks in the plane. Write,

as above, � � � � � � � � � , for ��

��

� , and � � � �� , for� � � �

��

� � . Let��

(resp.,�� ) be the union of the disks in � (resp., � ). As in Section 5.2, we focus on the

decision problem for a given parameter � � .For any point � , we have

� � � � � � � � � # $� ) �� #%$��

� � # $��

%�� This value is greater than � if and only if

� #%$��

In other words, � � �� if and only if there exists a point � � �� such that

� � � ��

� � � � � � � � , where � � � � � � � � � � � � � � � is the disk � � expanded by � .Let

� � � � � � � �� 105

� is the set of all translations � such that�� " �

� � � � . Our decision procedure

computes the set � and the analogously defined set

� � � � � � � � � � � � � � � � � �and then tests whether

� � � �� . To understand the structure of � , we first study the

case in which � consists of just one disk � , with center and radius � . For simplicity of

notation, we denote�� temporarily by

�. Let � be the set of vertices of � � and

�the

set of (relatively open) edges of � � ; � � � � � � � � �� [120].

Consider the Voronoi diagram� � ' � � � � � of the boundary features of

�, clipped to

within�

. This is a decomposition of�

into cells, so that, for each � � � � � , the cell

� � � � of � is the set of points � � �such that � � � � � � � � �� , for all � � � � � � . The

diagram is closely related to the medial-axis of � � . See Figure 5.2(a). For each� � � , let

� � � � denote the circular sector spanned by�

within the disk � � � � � whose boundary is�

,

and let� � � � � �� )�� . The diagram has the following structure. A variant of the

following lemma was observed in [19].

Lemma 5.3.1 (a) For each� � � , we have � � � � � � � � � .

(b) For each � �� , we have � � � � � � � � � � � � � , where � �� is the Voronoi cell of � in

the Voronoi diagram� � ' � � � of � . Moreover, � � � � is a convex polygon.

Lemma 5.3.1 implies that� � ' � � � � � yields a convex decomposition of

�of linear

size. Returning to the study of the structure of � , by definition, � � � " �

if and only if

� � � � � � � � , where � is the feature of � � � whose cell contains � � . This implies

that the set � �� of all translations � of � for which � � � " �

is given by

� �� )��

� � � � �

where

�� 106

�

��

�

(a) (b) (c)Figure 5.2: (a) Medial axis (dotted segments) of the union of four disks centered at solid points:The Voronoi diagram decomposes the union into

cells; (b) shrinking by � the Voronoi cell � ��

of each boundary element � of the union; (c) light-colored disk bounds a convex arc, and dark diskbounds a concave arc.

For� � � , �� is the sector obtained from

� � � � by shrinking it by distance � towards

its center. For �� , � � � � � � � � � � � � � � � . See Figure 5.2(b) for an illustration.

Now return to the original case in which � consists of disks; we obtain

� ��

��

�� ) � � �

� ��

Note that each � �� is bounded by � � � circular arcs, some of which are convex (those

bounding shrunk sectors), and some are concave (those bounding shrunk Voronoi cells of

vertices). Convex arcs are bounded by disks �� for some� � � �

� ,

while concave arcs are bounded by disks � � � � � � � � for � � � . Furthermore, since � �� is obtained by removing all points � � �

such that the nearest distance from �to � � is smaller than � � , we have that: (i) �� " � �� ; and (ii) � � �� . See Figure 5.2 (c) for an illustration.

Lemma 5.3.2 For any pair of disks � � � � � � � , the complexity of � �� is

� � � .

PROOF. Clearly, � �� is bounded by circular arcs, whose endpoints are either

vertices of � �� or

� �� , or intersection points between an arc of � � �� and an arc

of � � �� . It suffices to estimate the number of vertices of the latter kind.

107

Consider the set � �� of the � � �� disks

� �� )�� We claim that any intersection point between two arcs, from � � �� and � � �� respec-

tively, lies on � � �� . Indeed, assume that � is such an intersection point that does not

lie on � � � �� . Then there is a disk � � �� that contains � (i.e., � � ). There are two

possibilities for the choice of disk .

(i) � � � � � � � � � � � � � � � (resp. � � � � � � � � � � � � � � � ) for some� � � �

� .

Such a disk bounds some convex arc on � � �� (resp. � � �� ), and " � �� (resp. " � �� ). As such, � cannot appear on the boundary of � � �� (resp.

� � �� ), contrary to assumption.

(ii) � � � � � � � � � (resp. � � � � � � � � � ) for some � � � . Recall that � is

the set of vertices on the boundary of � � . Therefore by definition, � � � � (resp.

� � � � ) contains � in its interior, so it cannot be fully contained in�

, implying that

� �� (resp. � �� ). Contradiction.

Thus we have proved the claim by contradiction. It then follows, using the bound of [120],

that the number of intersections under consideration is at most� � � � � �� .

Each vertex of � is also a vertex of some

� �� . Applying the preceding

lemma to all the � � � pairs � � � � � , we obtain the following.

Lemma 5.3.3 The complexity of � is � � � � , and it can be computed in � � � ��

time.

Similarly, the set � has complexity � �

� � and can be computed in time � �� )� � � .

Finally, we can determine whether � � � �� , by plane sweep, in time � � � �

108

� � �� . Using parametric search as in [11], � � �� can be computed in � � � �� time.

To compute � � �� , we follow the same approach as computing �� in

the preceding section. Combined with an argument similar to the one above, we can show

the following.

Theorem 5.3.4 Given two families � and � of and � disks in � � , we can compute both

� � �� and � � �� in time � � � � � � �� .

5.3.2 Approximation algorithms

No good bounds are known for the complexity of the Voronoi diagram of the boundary

of the union of � balls in � � , or, more precisely, for the complexity of the portion of the

diagram inside the union [19]. Hence, a naıve extension of the preceding exact algorithm

to � � yields an algorithm whose running time is hard to calibrate, and only rather weak

upper bounds can be derived. We therefore resort to approximation algorithms.

Approximating � � �� in � � and � � . Given a parameter �� , we wish to compute

a translation � of � such that � � �� , i.e., � � �� is

an � -approximation of � � �� . Our approximation algorithm for � � �� follows the

same approach as the one used in [12, 14]. That is, let� �� (resp.,

� � �� ) denote the bottom

left point, called the reference point, of the axis-parallel bounding box of�� (resp.,

�� ).

Set �� . It is shown in [14] that in � � , � � �� .

Computing � takes � � � � time. We compute � � �� using the parametric search

technique [11], which is based on the following simple implementation of the decision

procedure:

Put�� and

�� . For given parameter � � ,

we observe that � � �� if and only if�� " �

� � � � and�� " �

� � � � � � .

109

To test whether�� " �

� � � � , we compute � � � � � � � �� , the union of balls in

� � � and � � � � , and check whether any ball of � appears on its boundary. If not, then�� " �

� � � � . Similarly, we test whether�� " �

� � � � � � . The total time spent is

� � � � � �)� � � � � in � � , and � � � �� in � � .

In order to compute an � -approximation of � � �� from a constant-factor approx-

imation, we use the standard trick of placing a grid in the neighborhood of� �� , and

returning the smallest � � �� where � ranges over the differences between all grid

points and� �� . We conclude the following.

Theorem 5.3.5 Given two sets of balls � and � of size � and , respectively, and � � ,an � -approximation of � � �� can be computed in � � � � � � � � � � �� time in

� � , and in � � � � � � � � � � � � �� time in � � .

Pseudo-approximation for � � �� . Currently we do not have an efficient algo-

rithm to � -approximate � � �� in � � . Instead, we present a “pseudo-approximation”

algorithm in the following sense.

Let ��

�� , where � is the Minkowski sum, be the set of all placements of

� at which�� intersects

�� ; �

� �� - � �� . Clearly � � � � � � � � � . For

a parameter � , let

� �� - � � � � ��

and � � � � � � � � � � � �� . We call a region � " � � � -free if � " � " �� .This notion of approximating � is motivated by some applications in which the data

is noisy, and/or when shallow penetration is allowed. For example, each atom in a pro-

tein is in fact a “fuzzy” ball instead of a hard ball [97]. We can model this fuzziness

by allowing any atom �� to be intersected by other atoms, but only within the shell �� for some � � . In this way, the atoms of two docking molecules

110

may penetrate a little in the desired placement. Although � can have large complexity,

namely, up to � � � � � in � � , we present a technique for constructing an � -free region �of considerably smaller complexity. We thus compute � and a placement � � �� such that

� � �� . We refer to such an approximation � � �� as a pseudo- � -approximation for � � �� .

Lemma 5.3.6 An � -free region � of size � � � � � � can be computed in time

� � � � � � � ��

PROOF. Let � � � � � � � � � � � � � � � � � � � � � � � �� We insert each ball � � � into an oct-tree

. Let �� be the cube associated with a node of

. In order to

insert � � , we visit

in a top-down manner. Suppose we are at a node . If �� " � � , we

mark black and stop. If �� and the size of �� is at least � � � � � � � � � , then we

recursively visit the children of . Otherwise, we stop. After we insert all balls from , if

all eight children of a node are marked black, we mark black. Let � � � � � � � �� (� �be the set of highest marked nodes, i.e., each � is marked black but none of its ancestors

is black. It is easy to verify that each � � marks at most � � � � � � nodes black as nodes

they mark are disjoint and of size at least � � �� ; thus � � � � � � � � � � . The whole

construction takes � � � � � � � �� time, and obviously � � � � " � �()�� " � �Set � � $ � � � � � �()�� ; it is an � -free region, as claimed.

Furthermore, let� �� , and �

� � �� be as defined earlier in this section.

We prove the following result.

Lemma 5.3.7 Let � � � � be the closest point of � in � . Then

� � ��

111

PROOF. Let �� and �� the placement so that � � �� . Then

��

� � ��

A result in [12] implies that � � � �� . On the other hand,

��

��

� � � � ��

The closest point of � in � , � � , can be computed as follows. Recall that in Lemma 5.3.6,

� � $ � � � � � ��)�� . Set �� $ � � � � � � � � �() � �� consists of a set of disjoint

(other than on the boundary) cubes. We first check whether � � �� by a point-location

operation. If the answer is no, then � � � , and we return � � � � . Otherwise, � � is the

point from � � � � �� that is closest to � . In the latter case, � � is either a vertex of a cube

in � , or lies in the interior of an edge or a face of a cube in � . Given a node � � ,

for each boundary feature � � � � , that is, a face, an edge, or a vertex of � � , compute

the closest point of � in � . Let � � be the resulting set of closest points. Next, for each� �� , check whether � �� by visiting the neighboring marked nodes that also contain� . This can be achieved by performing point location operations. Finally, we traverse all

nodes in � . Among all those points from � � ’s that lie on � �� (thus on � � ), return

the one that is closest to � . There are � � � � � � cubes, and each has constant number of

boundary features. Furthermore, at most constant number of nodes from � will contain a

given point, and each point location takes � �� time. Hence, � � can be computed

in � � � � � � � �� time.

We can compute � � �� in � � � � � � � �� time, as described in Sec-

tion 5.3.2, so we can approximate � � �� in � � � � � � � �� time. Further-

more, we can again draw a grid around � � and compute an � -approximation of � � �� .We obtain the following.

112

Theorem 5.3.8 Given � , � in �� and � � , we can compute in � � � � � � � � � � � � �� time, an � -free region � " �� and a placement � � � of � s.t. � � �� .

5.4 RMS and Summed Hausdorff Distance between Points

We first establish a result on simultaneous approximation of the Voronoi diagrams of

several point sets, which we believe to be of independent interest, and then we apply

this result to approximate �� and � � �� for point sets � � � � � �� and

� � � � � � �� in any dimension.

5.4.1 Simultaneous approximation of Voronoi diagrams

Given a family � � � � � �� of point sets in � � , with a total of � points, and a parameter

� � , we wish to construct a subdivision of �� , so that, for any � � � � , we can quickly

compute points � � � � � , for all� � � ��

, so that � �� , where � � � �� is, as defined earlier, �

#%$ � )�� . Our data structure is based on a recent result by Arya

and Malamatos [24]: Given a set � of � points and a parameter � � , they construct

a partition � of � � into � � � � � � cells — each cell � � � is the region lying between

two nested hypercubes (the inner hypercube may be empty), and is associated with a point

1 � � � � � , so that for any point �� , � � � � 1 � � � � � � �� . � is the partition

induced by the leaves of a compressed quad tree

[146], built on an initial hypercube �

that contains � . � and

can be constructed in � � � � � � � �)� � � � � � � time, and the cell of �

containing a query point can be located in � �)� � � � � � � time.

Let � be a hypercube containing� �� . We construct the above compressed quad

tree � for point set � � , and let � � be the resulting subdivision. We then merge

� � � ��

into a single compressed quad tree

[146] and thus effectively overlay � � � �� . In

particular, we start with � and insert cells of � � ’s one by one, for � � � �

. We refine

113

after each insertion so that we still maintain a compressed quad tree structure [24]. Since

all � ’s are built using the same initial hypercube � , the four hypercubes involved in any

two cells during the process are either disjoint or one containing another. Hence each

insertion creates at most � new leaves. Let � be the resulting overlay of � � � � �� ; � is

a refinement of each � � and �� . Since the merged tree

is also a compressed

quad tree, the cell of � containing any query point can be computed in � �� time.

For any cell � �� , let 1 � � � � � � � denote the point associated with the cell � � � � � that

contains � . Recall that, for any point � �� , 1 � � � � is an � -nearest neighbor w.r.t. � � , i.e.,

� � � � � � � � � � � � 1�� If we store all the 1 � � � � ’s for each cell � �� (i.e., in the leaf nodes of

), we need

� � � � � � � � space, which we cannot afford. So we instead store 1 � at appropriate internal

nodes of

. More specifically, for a fixed� � � � � , and for any cell � � � � � , let �

be the node in the merged tree

associated with � � . Let� be the subtree of

rooted at

. For any cell � �� associated with a node of� , 1 � � � � � 1 � � � � � . We therefore store

1�� at , instead of storing it at all leaf nodes of� . Since � � � � � � � � � � � � � � � , the total

storage needed to store 1 � � � � ’s is �� To query with a point �

lying in a cell � �� , we collect 1 � � � � , �� , while traversing the path from the root

to the leaf of

associated with � . As 1 � is stored once along any path from the root to a

leaf of

, we conclude the following.

Theorem 5.4.1 Given a family � � � � �� of point sets in � � , with a total of � points,

and a parameter � � , we can compute in � � � � � � � �� time a subdivision of � �of size � � � � � � so that, for any point � � � � , one can � -approximate � � � �� , for all

� � � � �, in � �)� � � � � � � � � time.

114

5.4.2 Approximating ��

For�� , let � � � � �� We construct the preceding de-

composition, denoted as � � , and the associated compressed quad-tree� , for � � � �� ,

with the given parameter � ; � � � � � � � � � � � . Define

� � � � � � � � � � �� #%$��

and let

� � � � � � � � � ��

For each cell � � � � , define

�� - � � � � ��

� � � � � 1 � � � � � �

By construction, for any � �� ,

� � � � � � �� - � � � � ��

� � � � � 1 � � � � �

��

� � � � � � � � � � � � � � � � � � � � � � � � � � � �

implying that � �

�� - � � � � � � ��

Hence, it suffices to store�� - � � � � at each cell � � � � . Since

�� - � is a quadratic equation

in � � � � , it can be stored using � � � space (where the constant depends on � ) and updated

in � � � time for each change in 1 � � � � .If we compute

�� - � for each cell � � � � independently, the total time is � � � � � � � .We therefore proceed as follows. We perform an in-order traversal of the compressed

115

quadtree� . For the cell � associated with the first leaf of

� visited by the procedure,

we compute�� - � in � � time. For the subsequent leaves we compute

�� - � from the

value previously computed. Suppose we are currently visiting a cell � of � � , let � � be the

previous cell visited by the procedure, let � (resp. � � ) be the leaf associated with � (resp.

� � ), and let

� � - � � � � � 1�� 1 � � � � �The value of 1 � � � � , for all � � � � - � , are stored along the path from the nearest common

ancestor of � and � � to the leaf � . Since

�� - � � � � � �� - � � � � � ��%) ��

� � � � � � 1�� 1��

we can compute�� - � from

�� - � in � � � � - �� time. As � � � - � � � � � � � � � , the total

time required to compute all�� - � ’s is � � � � � � .

Next, we compute, in � � � � � � time, a subdivision � � on the family � � � � � � � � �� , for

� � � �� , and a quadratic function

�� - � for each cell � � � � so that�� - � � � � � � � � � � � � � � � � � We overlay � � and � � . The same argument as the one used to

bound the complexity of � � shows that the resulting overlay � has � � � � � � cells and

that it can be computed in � � � � � � � �� time. Finally, for each cell � in the

overlay, we compute

� � � %(' � � # $�*) � � %��

� �� - � � � � �� - � � � � � � �� %(' � � # $

�*) � � � � � � � ��

and return

� #%$� )��

�

Hence, we obtain the following.

116

Theorem 5.4.2 Given two sets � and � of and � points in �� and a parameter � � ,we can:

i. compute a vector � � � � � in � � � � � � � �� time, so that

� ��

ii. construct a data structure of size � � � � � � , in time � � � � � � � �� , so that

for any query vector � � � � , we can � -approximate � �� in � ��

time.

5.4.3 Approximating ��

Modifying the above scheme, we approximate � � �� as follows. Let � � , � � , � � , and

� � be as defined above. We define

� � � � � ��

� � � ��

� � � � � ��

� � � � � � � � ��

For each cell � � � � , let

�� - � � � � ��

� � � � 1 � � � � � � � � � � � � ��

and for each cell � � � � , let

�� - � � � � ��

� � � � 1 � � � � � � � � � � � � � � � � � � � � � �

As above, we overlay � � and � � . For each cell � in the overlay, we wish to compute

� � � � # $�*) � �

%�� - � � � � �

�

�

�� - � � � ��

117

However, since�� - � and

�� - � are not simple algebraic functions, we do not know how to

compute, store and update � � efficiently. Nevertheless, we can compute an � -approximation

of � � that is easier to handle. More precisely, for a given set � of points in � � , define the

�-median function

� � � � � � � � �� ) � �

� � � � � �

For any � � � � ,�� - � � � � � � � �� , where

� � � � � �,1 � � � � � � � � � � . The

same is true for�� - � , where � � � � . In Section 5.4.4, we describe a dynamic data

structure that, given a point set � of size � , maintains an � -approximation of the function

� � � � � � � as a function consisting of � � � � � � � �� pieces; the domain of each piece

is a � -dimensional (or the complement of a � -dimensional) hypercube. A point can be

inserted into or deleted from � in � �� time. Furthermore, given

two point sets � and � in � � , this data structure can maintain an � -approximation of

� %�� within the same time bound.

Using this data structure, we can traverse all cells of the overlay of � � and � � , as in

Section 5.4.2, and compute an � -approximation of�� - � and

�� - � , (thus roughly a ( � )-

approximation of � � ) for each cell � of the overlay. However, given two adjacent leaves

during the traversal, associated with cells � and � � respectively, we now spend

� � � - � � � � � � � � � ��

time to compute an � -approximation of�� - � from that of

�� - � . Putting everything to-

gether, we conclude the following.

Theorem 5.4.3 Given two sets � and � of and � points in � � and a parameter �� , we can compute:

i. a vector � �� in � � �� time so that

� � ��

118

ii. a data structure of � � �� size in time � � ��

�� , so

that for any query vector � � � � , we can � -approximate � � �� in time

� � �� .

5.4.4 Maintaining the 1-median function

Let � be a set of � points in � � and let � � be a parameter. For � � � � , define

the�-median function �

� � � �� )�� , as above. We describe a dynamic data

structure that maintains a function � � �� as the points are inserted into or deleted

from � so that

� � � � � � � � �� We maintain a height-balanced binary tree

with � leaves, each storing a point of � .

For a node � , let � � " � be the set of points stored at the leaves of the subtree rooted

at ; set � �� . For each node of height � (leaves have height ), set � � � � � � � ,

where � (= �� ) is the height of the tree

. We associate with a node , at height � , a

function � � that is a � � -approximation of �� , i.e.,

� � � �� The description complexity of � � is � � � � � � � �� . Finally, we maintain a function �that is an � -approximation of �

� � � � � � with description complexity � � � � � � � �� .

u

(a) (b)

Figure 5.3: (a) An exponential grid with 3 layers. (b) The larger (resp. smaller) cube is�

(resp.��), and the set of hollow dots are

�� .

119

More specifically, if a leaf stores the point � � � , then set � � �� . For all

internal nodes , we compute � � in a bottom-up manner as follows. Let

and � be the

children of . By induction, suppose we have already computed the function �� and � �each of descriptive complexity � � � � � � � �� . Set � � � � � � �� . Since

� � � �� , by induction hypothesis, � � is an � � � � � � � � � � -approximation of �

� � �� . However, the description complexity of � � is more than what we

desire. We therefore approximate � � by a simpler function � � as follows. For � � � � and

� � � , let� �� be the hypercube of side length � centered at � . For simplicity, let

� � � �� . We compute �� %)' � � # $

� � � �� and set � � � � � � � . Let� � � � � � � ��

for� � � � �� . Partition each cubic shell

� � � � � � � into hypercubes by a � -

dimensional grid � � in which each cell has side length �� . The union of

� � ’s is an exponential grid with � � � � � �)� � � � � � � cells that covers the hypercube� �

� � � �� . See Figure 5.3 (a) for an illustration. In each cell � � � � , pick an

arbitrary point � �� and set

�� (5.1)

For points outside�

, we set

��

� �� (5.2)

Hence, the function � � is piecewise-constant inside�

and a quadratic function outside�

. The description complexity of � � is � � � � � � � �)� � � � � � � � � � � � � � � �� . Since

�� and �� have the same structure, the point � can be computed by evaluating the func-

tion � � at the vertices of the exponential grids drawn for �� and �� . At each point � , we

can evaluate � � �� in time � �� time by simply locating � in the two exponential

grids. Hence, we can compute the point � in time � � � � � � � �� . We spend an-

other � � � � � � � �)� � � � � � � time to compute � � . That � � � � � is indeed a � � -approximation of

� � � � � � � � is proved in Lemmas 5.4.4 and 5.4.5. This finishes the induction step. Using the

120

same procedure, we compute an �� -approximation, � , of � root, of descriptive complexity

� � � � � � � �)� � � � � � � . By construction, for all � � � � ,

� � � � �� root��

� � �� Obviously, the size of the above data structure is � � � � � � � �� . To insert

or delete a point � , we follow a path from the leaf � storing � to the root of

and recompute

�� at all nodes along this path, and then compute � from � root. Hence, the update time is

� �)� � � � � �� , and the only missing component now is to show that � � , as

constructed above at each node � , is indeed a � � -approximation of �� .

Lemma 5.4.4 Let be a node of

at height � . For any� � � � �� and for any

� � � � " �,

� � � � � ��

PROOF. The triangle inequality implies that for any � � � � � � ,

� � � � � � � � � � � � � � � � � � � � � � � �� (5.3)

Therefore, by construction of the exponential grid,

� � � � ��

� � � � � �� (5.4)

Equation (5.1) and (5.3) imply that

��

��

Substituting � for � in (5.3), we obtain

� � � � � ��

� � � � � � � � � � � (5.5)

121

Hence, for any � �� ,

��

�

� � � � � � � � � � � � � � � ��

�

� ��

� � � � � � � ( using (5.4) )

� � � � � � � � ��

� � � � � � ��

�

� � �

� � � � � �� ( using (5.5) )

� � � � � �

� � � � � ��

Lemma 5.4.5 Let be a node of

at height � . Then for any � � �� ,

� � � �� PROOF. By (5.3),

� � � � �� (5.6)

The first inequality of the lemma is now immediate because

� � � �� As for the second inequality, we first upper-bound � in terms of �

� � �� . Let ��

� � � � � � �� and ��

�� (see Figure 5.3 (b)). Then

� � � � � � � � � � �� )��

� � � � � �� -)��

� � � � � � � � � ��

�

Therefore

� ��

122

On the other hand, for any � � � � � � and � � ��

, we have that

� �� Hence, as long as � � � � � , we have

� � � � � � � � �� ) ��

� � � � � � � ��

� � �

��

thereby implying that � � � � � � �� . Using (5.2) and (5.6),

��

� ��

5.4.5 Randomized algorithm

We briefly describe below a simple randomized algorithm to approximate � �� . The

algorithm of approximating �� is similar. Let � � be the optimal translation, i.e.,

� �� .

Lemma 5.4.6 For a random point � from � , � � � � � � �� , with probability

greater than� � . The same claim holds for � �� .

PROOF. Let � be a random point from � , where each point of � is chosen with equal

probability. Let � be the random variable � � � � � � � � �� . Then ��

� �

� � � � � � �� . Moreover,

� ��

��

� � � � � � ��

The lemma now follows immediately from Markov’s inequality.

123

Choose a random point � � � . Let � � � � � � � and � � � � �� , for

� � � �� . It then follows from Lemma 5.4.6 and the same argument as in Lemma 5.3.7,

that �#%$ � � � is a constant-factor approximation of � �� , with probability greater than

� �� . Computing � � exactly is expensive in � � , therefore we compute an approximate value

of � � , for� � � �

� , in time � � � � � �� , by performing approximate nearest-

neighbor queries [24]. We can improve this constant-factor approximation algorithm to

compute a � � � � � -approximation of �� using the same technique as in Section 5.3.

We thus obtain the following result.

Theorem 5.4.7 Given two sets � and � of and � points, respectively, in � � and a

parameter � � , we can compute in time � � � � � � � �� two translation vectors � �and � � , such that with probability greater than

� � ,

� �� and � ��


We provide in this chapter some initial study of various problems related to minimizing

Hausdorff distance between sets of points, disks, and balls. One natural question following

our study is to compute exactly or approximately the smallest Hausdorff distance over all

possible rigid motions in � � and � � . Given two sets of points � and � of size � and ,

respectively, let � be the maximum of the diameters of point sets � and � . We believe that

there is a randomized algorithm with roughly � � expected time, that approximates the

optimal summed-Hausdorff distance (or rms-Hausdorff distance) under rigid motions in

the plane. The algorithm combines our randomized approach from Section 5.4.5, a frame-

work to convert the original problem to a pattern matching problem [110], and a result by

Amir et al. on string matching [21]. However, this approach does not extend to families of

balls. We leave the problem of computing the smallest Hausdorff distance between sets of

124

points or balls under rigid motions as an open question for further research. Another ques-

tion is to approximate efficiently the best Hausdorff distance under certain transformations

when partial matching is allowed. The traditional approaches using reference points break

down with partial matching.

125

Chapter 6

Coarse Docking via Features

6.1 Introduction

Proteins perform many of their functions by interacting with other molecules, and these

interactions are made possible by molecules binding to form either transient or static com-

plexes. We focus in this chapter on the problem of predicting the binding (or docking)

configurations for two large protein molecules, which we refer to as the protein-protein

docking problem (see Figure 6.1). This problem is important because the docked complex

(a) (b) (c)

Figure 6.1: Given protein structures in (a) and (b), the docking problem predicts the dockingconfiguration in (c).

has functional consequences (e.g. signal transduction) and it is usually hard to crystal-

lize complexes. Many of the more than 25,000 proteins in Protein Data Bank (PDB) [31]

are able to form protein-protein complexes, however, there are only a few hundred non-

obligate crystallized complexes1; see the discussion in Chapter 1 for more motivations.

When two proteins bind, their shapes (molecular surfaces) roughly complement each�Obligate complexes are permanent multimers whose components do not exist independently.

126

other at the interface [113, 116]. This is the main justification for a geometric approach

in predicting the docking configurations, which is also the basis for our approach. The

docking problem that we study is as follows: Given two proteins, � and � , assume that

the native docked configurations are � and � � . The goal is to develop an efficient algo-

rithm to compute configurations for � that are close to � � , or in other words, to search

for configurations for � that complement � best geometrically. To measure the goodness

of the complementary fit between � and some configuration � � of � , a scoring function� $ � '�� is needed. Note that in the above formulation, we assume that both pro-

teins are rigid bodies during the docking process, which is usually refered to as the bound

docking problem. The unbound docking problem, in which each protein may change its

conformation, is not considered in this chapter, but will be the focus of future work. Of

course in nature, more than two proteins might bind and form one complex. The work in

this chapter serves as a starting point for attacking these more general problems.

Prior work. There are mainly two categories of interactions that have been extensively

studied: protein-ligand interactions which happen between a protein and a small molecule;

and protein-protein interactions. Much of earlier work in this area has focused on protein-

ligand interactions, mainly motivated by drug design. With the number of available protein

structures increasing rapidly, there is considerable research on understanding interactions

between large proteins. Despite some similarities, the known approaches for predicting

these two types of interactions are different in several aspects. For the case of protein-

ligand docking, both chemical information and the flexibility of the ligand (and sometimes

that of the receptor as well) are considered at a quite detailed level. While for the case of

protein-protein docking, it seems that geometry plays a more important role, and because of

their large size, these proteins may not be as flexible as small molecules during the docking

process. Even if we ignore the flexibility, predicting protein-protein docking configurations

is computationally more demanding than the case of protein-ligand docking because of

127

high complexity of protein molecules. We do not review the known results for protein-

ligand docking in this section. Interested readers are referred to [85, 102, 164].

Current research on protein-protein docking is focused on either bound docking, or

unbound docking with fairly small protein conformational changes. Most approaches

for unbound docking use bound docking as a subroutine. They usually consist of two

stages [153]:

(i) bound-docking stage, which produces a set of potential docking configurations by

considering only rigid transformation; and

(ii) refinement stage, which allows certain flexibility.

Two essential components are involved in both stages:

(i) developing a scoring function that can discriminate near-native docking configura-

tions from incorrect ones; and

(ii) a search algorithm to find (approximately) the best configuration under the score

function used.

In general, approaches to the bound docking stage rely mainly on geometric comple-

mentarity. Each of the input proteins can either be represented as a surface model [86, 127],

a union of balls (with each ball representing an atom) [32, 61], or a set of voxels [53, 143].

For the last case, the space is divided into a set of regular grid cells (voxels), where each

cell is marked as inside, outside, or on the surface of the protein. To search for the best geo-

metric fit, the most straightforward approach, called exhaustive search, discretizes transfor-

mational space into a six-dimensional grid and computes the score for each grid point [32].

Although this approach produces good near-native docking configurations with few false

positives2, it is too expensive. Several approaches have been proposed to reduce the time�A false positive is a docking configuration with high score but far from the native configuration

128

complexity of this search procedure.

One popular technique is the fast Fourier transform (FFT) method, which was first used

in molecular docking in [119]. The FFT method represents molecules as voxels. By de-

signing the scoring function appropriately and discretizing the translational space into a� � � � � grid, for a fixed rotation, the scores of all relative translations can be evaluated si-

multaneously in � � � �� time. It is still necessary to search the three-dimensional space

of rotations, which is typically done exhaustively. There are several properties of FFT-type

approaches that make it rather attractive: other than surface complementarity, chemical

properties such as hydrophobicity can be correlated as well, and it is fast to perform low-

resolution FFT-searches, which is useful in producing coarse fits. Available docking pro-

grams exploiting the FFT method include FTDock [90], 3D-Dock [137], GRAMM [163],

and ZDock [53]. Nevertheless, for reasonably high resolution docking, the time complex-

ity for FFT-type approaches is still fairly high. The approach proposed by Ritchie and

Kemp [143, 144] addresses this problem by using spherical harmonics to represent both

the molecular surface and the electric field. Complementarity between surfaces in different

orientations is calculated by Fourier correlations between the expansion coefficients of ba-

sis functions. A table of overlap integrals independent of protein identities is pre-calculated

to expedite the docking algorithm.

Another widely used type of approaches reduces the number of transformations in-

spected by aligning the so-called features on molecular surfaces. The idea goes back to

Connolly[67], who also proposed to identify point features by taking the minima and max-

ima of the so-called Connolly function. One representative here is by Fischer et al.[86]

who use geometric hashing technique to align points computed by a variant of the Con-

nolly function. Variants and improvements of their approach include [93, 127]. Such

algorithms are usually faster, but require post-processing to remove steric clashes and to

refine the geometric fit.

129

Other approaches for bound docking include genetic algorithms to conduct the search [39,

91], and fast bit-manipulation routines to expedite the evaluation of scores [140].

Although good in recombining separate components of a known complex, geometric fit

is not sufficient to dock unbound proteins. It is commonly believed that the native complex

is at the global minimum of � � , the difference between the free energy of the complex

and that of its separate components. Hence the refinement stage is usually modeled as

an energy-minimization problem. The scoring functions here focus more on the thermo-

dynamic aspects of the interaction. Flexibility of side chains, or even that of backbones,

is usually considered during the search process. One common way to incorporate side

chain flexibility is by considering only known populated rotamers of the side chains. Un-

fortunately, this still produces an exponential number of combinations. Techniques such

as iterative mean-field minimization [111], dead-end elimination [18], and genetic algo-

rithms [115] have been used to reduce the time complexity. Recently, Vajda et al. have

proposed a hierarchical, progressive refinement protocol [41, 42]. It follows the intuition

that a simplified energy landscape is sufficient for far-apart proteins, and as two proteins

get closer, more detailed energy landscape should be used. Their algorithm is able to reli-

ably converge to a near-native configuration (with smaller than ˚�

rmsd) starting with one

with initial rmsd of up to� ˚� .

Little success has been achieved for including backbone conformational changes. It

seems that a significant portion of this larger motion is the class of hinge-link inter-domain

movements [124]. Some initial investigation has been made to docking proteins consider-

ing such motions [147].

New work. Since each step in the refinement stage is quite costly, it is crucial to gen-

erate a small and reliable set of potential configurations during the bound docking stage.

On the other hand, even though only rigid motions are considered in this stage, the time

complexity of current approaches is still not satisfactory, preventing us from experiment-

130

ing with larger data sets. In this paper, we present an efficient algorithm for the bound-

docking stage. We use geometric complementarity to guide us in the search for a small

set of rigid motions that fit the two proteins loosely into each other. Such a set of poten-

tial configurations can be further refined independently, using both chemical information

and flexibility [42, 61], to obtain more accurate docking predictions. We remark that for

the case of unbound docking, it is especially important to consider coarse (not tight) fits

between input components in order to provide enough tolerance for flexibility in the later

refinement procedure.

We describe our algorithm, called � � %)' �� # � $ , in Section 6.2. It relies on the ability

to describe meaningful “features” on molecular surfaces, such as protrusions and cavities,

using a succinct set of pairs of points computed from the elevation function, defined in

Chapter 4. We then align such pairs and evaluate the resulting alignments by a simple and

rapid scoring function. Compared to similar approaches [86, 127] that align the so-called

feature points, our algorithm inspects orders of magnitude fewer transformations by align-

ing only meaningful features to produce a reliable set of potential docking positions. In

Section 6.3, we demonstrate the performance of our approach by testing a set of 25 bound

protein-complexes obtained from the Protein Data Bank [31]. We also show that by com-

bining our algorithm � � %('�� # � $ with the local improvement procedure described in [61],

we are able to efficiently find accurate near-native docking positions for all but one case

without any false positives. Additionally, we have tested our algorithm on the unbound

protein docking benchmark [54], demonstrating that � � %)' �� # � $ can generate useful pro-

tein poses that can serve as input for refinement methods that take protein flexibility into

account.

131

6.2 Algorithm

Assume that we are given two proteins � � � � � � � � � �� and � � � � � �� ,where each � � � � � � � � � (resp. � � � �� ) represents an atom centered at �� (resp.

� � � � � ) with van der Waals radius� � (resp. � � ). Let

� $ � '�� be a scoring function

that we will describe shortly. Ideally, we would like to find a transformation � for � that

maximizes� $ � '�� . In this section, we describe an algorithm that finds a set of

potentially good transformations for � . Below, we first explain the scoring function we

use. We then present an algorithm that produces a set of transformations for � by aligning

pairs of points computed from the elevation function.

At a high level, we first construct a set�� (resp.

�� ) of features based on elevation

function as defined in Chapter 4, each consisting of two points along with a normal di-

rection, to characterize the molecular surface of protein � (resp. � ). These two sets of

features,�� and

�� , are the inputs to our coarse alignment algorithm, which outputs a

list of possible configurations sorted by their scores. Below, we first describe the scoring

function that we use. Then after explaining how to construct features from the elevation

function, we present the algorithm � � %)'�� # � $ .

6.2.1 Scoring function

A good scoring function should produce a higher score for (near) native configurations

than for non-native ones. We design our scoring function to describe the geometric com-

plementarity between � and � . In particular, let

� $ � '�� # � # � $ � � � � � �� if � � � � �� if � � � otherwise �

132

where � � � � � � � � � � � � � � , and � is some prefixed constant that we refer to as

contact-threshold. The scoring function between � and � involves two components:

the score� $ � '�� - � � $ � ' � � � � � � and the collision number � � # � # � $ �� - � � � # � # � $ � � � � � � A configuration �� is valid if � � # � # � $ �� , where the the

collision-threshold � defines the maximum number of clashes that can be tolerated.

This definition is rather similar to the one used in [32] and in [53]. The main difference

is that in addition to counting collisions, we use them as a reason to lower the score. The

reason behind this is that as our algorithm generates coarse alignments between the input

proteins, we need a large tolerance for collision ( � � � in our experiments, in contrast to

� � �in [32]). However, a high collision number will also increase the number of pairs of

atoms that are in contact, resulting in high scores. By giving a penalty to the score when a

clash happens, we intend to counterbalance the consequential increase in score.

6.2.2 Computing features.

Given protein � , we compute its surface�

and the elevation function� ��(% � # � $ � � � �

on�

(as defined in Chapter 4). For docking purposes, we are interested in points with

locally maximal elevation, as they represent locally most significant features. Recall that

there are four types of maxima, which are illustrated in Figure 6.2 (a), each describing a

different type of features on molecular surfaces.

(a) (b)

Figure 6.2: (a) From left to right: a one-, two-, three-, and four-legged local maximum of theelevation function. (b) A three-legged maximum generates six features as indicated by shadedbars.

133

We collect�� , the set of features for protein � , as follows: for each maximum � from

�, add all possible pairs of points between the head(s) and feet of � into

�� . See Figure 6.2

(b) for an example. With each feature generated from the maximum � , we associate the

normal direction � � (i.e., all head(s) and feet of � are critical in the height function in

direction � � ), and assume that � � is always pointing towards the exterior of the surface at

� . Thus a feature is a pair of points along with the unit vector � � . Given a feature, we

refer to the distance between its two points as its length, and� ��)% � # � $ �� as its elevation.

The length and elevation of a feature indicate its importance, thereby providing a way to

distinguish less from more meaningful features.

We remark that previous alignment-based approaches describe features by points resid-

ing at a protrusion or a cavity. This is one of the main difference between our approach and

those methods. Points provide much less information in identifying specific protrusions or

cavities as compared to our representation of features. Therefore our alignment algorithm

(as described below) is able to inspect orders of magnitude fewer configurations than those

algorithms. We elaborate on this difference more in Section 6.4 at the end of this chapter.

6.2.3 Coarse alignment algorithm

Given two proteins � and � , along with their feature sets�� and

�� , respectively, our

algorithm, sketched in Figure 6.3, computes a set of potential coarse alignments. The

rationale behind � � %)' �� # � $ is that good fits between the input proteins should have some

features aligned, such as a protrusion from one surface fitting inside a cave from the other

(see Figure 6.4 (a)). Hence if we align all “features” from � to those from � , we will

cover potentially good fits. Here features are defined as in Figure 6.2 (b). As we wish to

align only important features, we preprocess�� and

�� by removing features with small

elevation value or short length. We now explain steps (1), (2) and (3) from the above

134

ALGORITHM � � %)'�� # � $Preprocess

�� and

�� ;

for any � � � �� and � ��

�(1) if ( � � � % � � # � � � � � � � � � )(2) � =

� # � $ ( � � � � � );(3) compute (

� $ � '�� , � � # �.# � $ ) for �� ;if ( � � # �.# � $ �� )

add � to � ; // � is validSort � by

� $ � ' � ;end

Figure 6.3: The coarse alignment algorithm.

q

p

(a) (b)

Figure 6.4: (a) If two surfaces complement each other well, then features at the interface shouldalign well too. (b) Points

and 0 do not match each other: they should have opposite criticality

(maximum with minimum w.r.t. their own normals).

algorithm.

Function� # � $ �� : The function

� # � $ � � � � � � � computes a rigid motion � that aligns the

pair of points � � � � � � � � � � with � � � � � � � � � � . In particular, assume that � � and � � are the

normals associated with �� and � � respectively. Let

� � � �� and � � � � ��

0 � � 0 �� 0 � � 0 � ��

To obtain the transformation, we first translate the midpoint of segment � � � � to the mid-

point of segment � � � � . Next, we rotate segment � � � � so that (i) segment � � � � lies on the

line passing through � � � � ; and (ii)� � coincides with

� � . See Figure 6.5 for an illustration.

Note that there is ambiguity in (i) as vector�� can either be in the same direction or in

135

� ��

��

��

� ��

��

� ��

�

��

� � �

� � � �

��

Figure 6.5: First move the midpoints (two empty dots) together. Then rotate so that� � � � and

��0 � 0 �coincide. Last, rotate

�w.r.t.

� �0 � 0 � so that � � coincides with � � .

opposite direction as vector�� . As such, the function

� # � $ � � in fact returns two distinct

transformations, although for simplicity, we pretend it returns only one.

Function � � � % � �.# � � � � : Obviously, if �� and � � are fairly ”dissimilar”, then they will

not align well with each other in any good configuration. By ”dissimilar”, we mean that

(F1) we are trying to align a protrusion (resp. cave) from one surface with a protrusion

(resp. cave) from the other (see Figure 6.4 (b) ), or

(F2) the length of � � � � is too different from that of � � � � .Given feature pairs � � � � � � � � � � and � � � � � � � � � � , associated with normal � � and � �

respectively, the function � � � % � � # � � � � � � � � � returns false if either (F1) or (F2) happens. In

particular, (F1) happens if � � (or � � ) is a local minimum (or local maximum) with respect

to � � and � � (or � � ) is of the same criticality with respect to � � . Let�

be the length of the

shorter feature pair and � the difference between the length of � � and � � . If � �� , we

consider (F2) happens, where � is a threshold on the ratio of the two lengths.

Compute� $ � '�� and � � # �.# � $ . By definition, we can compute both

� $ � '�� and

� � # �.# � $ �� in � � � time, where � and are the number of atoms in � and �

respectively. Here we present a simple algorithm that computes them in � � � � � �� time by building a hierarchical data structure for � as follows.

Let � be the set of centers of atoms from � ; the diameter of � is at most � � � � � where

�is the smallest radius of an atom from � (between one and two Angstrom). We build a

136

standard oct-tree

on � . Points of � are stored at the leaves of

. At each internal node

, let � � denote the set of centers in the subtree of and � � the set of balls from � with

center in �� . We associate with node the smallest enclosing ball for � � , denoted by � � .

The depth of

is � �� , therefore it can be constructed in � � �� time.

We now describe the computation of � � # � # � $ �� , and that of� $ � ' � �� is simi-

lar. In particular, for each � � , �� , we compute � � # �.# � $ �� # � # � $ � � � � �by a top-down traversal of

. At an internal node , if � � and � � intersect, we recurse down

the subtree rooted at . Otherwise, we return. At a leaf node, if � � intersects the atom

whose center is stored at the leaf node, we increment a counter that records the number of

collisions seen so far by�. The resulting number after the traversal is � � # �.# � $ �� .

It is easy to verify that the above traversal for a specific � � takes � �� time. This

is because that at any level of

, � � intersects at most constant number of nodes of

, and

that the depth of the tree is � �� . Hence it takes � � � � � �� time to compute

� � # �.# � $ �� (and similarly� $ � ' � �� ) for a particular configuration �� .

In practice, we build a similar tree for � and traverse down both trees simultaneously

to compute � � # � # � $ �� (as well as� $ � ' � �� ) directly, instead of computing each

� � # �.# � $ �� one by one. Although the new algorithm is proved to be more efficient

in practice, it has the same asymptotic time complexity as the one described above. The

overall time complexity for � � %)' �� # � $ is � � � � � � � � � � � � � �� .

6.3 Experiments

In this section, we first provide a detailed experimental study of one protein complex. Next

we test our docking algorithm on a diverse data set of 25 bound protein complexes chosen

from the Protein Data Bank [31], which contain both easy docking problems where large

protrusions fill deep pockets and more difficult problems where the interface is relatively

137

flat. Finally, we provide results for testing the unbound protein benchmark [54].

A case study. We take the protein complex barnase/barstar (pdb-id 1brs, chain � and ),

as an example. The two chains have 864 and 693 atoms respectively. We use the msms

program (also available as part of the VMD software [105]) to generate a triangulation

of the molecular surface for each input chain. The resulting triangulations have � � � � and

� � vertices respectively. The left of Table 6.1 shows the number of features generated

2-legged 3-legged 4-legged

A 1044 696 156D 828 510 154

2-legged 3-legged 4-legged

A 112 205 50D 68 160 49

Table 6.1:�

-legged features for chain A and D of �� $ in the left. Only large ones are in the right.

from the four different types of maxima of the elevation function: A�

-legged feature is

derived from a�-legged maximum. In the right, we show the number of large features

whose length is at least� ˚� and elevation at least 0.2: these features are the inputs to our

coarse alignment algorithm. We note that a significantly higher percentage of 3-legged

features are large compared to those of 4-legged and especially 2-legged features.

Given the two sets of large features, algorithm � � %)' �� # � $ generates a family � of� � � valid configurations whose score is higher than

� � (or � �� valid ones whose

score is greater than� ). The running time is around 3 minutes on a single processor

PIII 1GHz machine. Each configuration in � corresponds to a transformation for chain . For each transformation � computed from � , we compute the rmsd (root mean squared

deviation) between the centers of all atoms from chain and those of � � � , and use it

to measure how good our predictions are: the smaller the rmsd of a configuration is, the

closer it is to the native docking position. The rmsd for the top ranked configuration in �is��

� � ˚�

, and the rmsd is less than ˚�

for 6 out of the top 10 configurations. Next consider

only a subset � � , the top 100 ranked configurations from � . We refine each configuration

in � � by applying the local improvement heuristic [61] and re-rank � � afterwards based on

138

After LocalImprove Before LocalImprove

rank score rmsd rank � coll rmsd1 359 0.54 12 24 3.232 338 0.80 5 48 2.423 328 0.72 1 23 1.594 314 0.80 4 49 3.575 311 0.91 2 39 1.706 310 0.78 59 12 2.847 307 1.50 3 29 2.328 281 1.47 11 18 3.079 251 2.09 14 16 3.00

10 213 39.96 76 29 39.39

Table 6.2: Performance of the algorithm (including refinement) on protein complex bar-nase/barstar. Only those with ��

are kept after local improvement heuristic. Theright side shows the corresponding ranking and number of collisions before applying the localimprovement heuristic.

the new scores. The results, shown in Table 6.2, demonstrate that � � %('�� # � $ generates

multiple useful poses for chains A and D that can be refined to yield a near-native final

configuration.

More bound protein complexes. We extend our experiments to a diverse set of 25

bound protein complexes obtained from PDB. After computing the elevation function for

all protein surfaces, we compute the set of features for each chain, and remove the features

that are not large. Usually, the number of remaining features for each chain is roughly

the same as the number of atoms of that chain. Next, we apply � � %)'�� # � $ to these

feature sets. In Table 6.3, we show a low-rmsd (less than� ˚�

) configuration for each

protein complex, as well as its rank (by score) in the set of configurations returned by

� � %('�� # � $ , with parameters � � � ˚� and � � � . Note that in all but one case, we

have at least one configuration with low rmsd among the top 100 configurations. The last

column shows the running time of algorithm � � %)' �� # � $ . It does not include the time to

compute the molecular surfaces and the elevation function.

A configuration is of type � -�

if it is produced by aligning an � -legged feature from one

chain with an�-legged feature from the other one. Our experimental results indicate that

4-legged features seldom give rise to good configurations (i.e., those with rmsd less than

139

pdbid rank � coll rmsd time

1A22 (A, B) 2 23 2.75 201BI8 (A, B) 12 43 2.48 261BRS (A, D) 1 11 1.52 31BUH (A, B) 5 14 1.85 21BXI (B, A) 3 34 2.54 81CHO (E, I) 1 14 2.71 31CSE (E, I) 2 22 2.21 91DFJ (I, E) 78 11 3.09 271F47 (B, A) 15 1 1.49 11FC2 (D, C) 5 49 4.13 61FIN (A, B) 11 44 3.70 411FS1 (B, A) 1 29 1.62 51JAT (A, B) 522 20 1.20 91JLT (A, B) 8 23 3.64 101MCT (A, I) 1 27 3.49 31MEE (A, I) 1 23 1.33 91STF (E, I) 1 43 1.18 81TEC (E, I) 9 54 3.07 71TGS (Z, I) 1 46 2.61 61TX4 (A, B) 2 4 3.35 142PTC (E, I) 1 18 4.55 6

3HLA (A, B) 1 19 1.87 163SGB (E, I) 1 38 3.21 53YGS (C, P) 6 7 1.07 64SGB (E, I) 10 33 2.33 4

Table 6.3: Performance of �� on 25 proteins. Column 2 shows the rank of the firstconfiguration with low rmsd ( � ˚� ). The last column shows the running time of �� inminutes.

140

After Improve Before Improvepdbid rank score rmsd rank score � coll rmsd

1A22 1 475 1.08 2 270 23 2.751BI8 1 234 29.88 62 211 10 30.01BRS 1 349 1.07 2 389 37 3.011BUH 1 256 0.61 5 209 14 1.851BXI 1 289 0.63 16 217 21 5.591CHO 1 305 0.99 1 243 14 2.711CSE 1 317 0.82 23 273 36 2.571DFJ 1 220 1.28 78 178 11 3.091F47 1 221 0.56 15 129 1 1.491FC2 2 200 1.33 5 356 49 4.131FIN 1 413 0.61 34 382 54 9.941FS1 1 326 0.89 2 309 27 1.591JAT 1 288 0.87 522 168 21 1.201JLT 1 310 1.77 3 232 14 6.17

1MCT 1 322 0.32 84 233 34 3.571MEE 1 372 0.57 1 373 23 1.331STF 1 314 0.79 1 408 43 1.181TEC 1 304 1.28 10 362 51 4.511TGS 1 348 0.44 2 227 13 2.711TX4 1 355 0.36 80 243 25 4.342PTC 1 314 0.66 1 265 18 4.553HLA 1 416 0.70 1 246 19 1.973SGB 1 257 2.24 1 320 38 3.213YGS 1 209 0.85 6 216 7 1.034SGB 1 266 2.50 10 260 33 2.33

Table 6.4: Performance of algorithm for 25 proteins after and before the local improvement.Columns 2 to 4 show the first configuration with low rmsd after locally improving the top � coarsealignments and re-ranking them. Columns 5 to 8 show the corresponding configuration before thelocal improvement and re-ranking.

141

� ˚� ). In fact, no 4-legged feature is involved in producing the best configuration for 24/25

cases (the only exception is 2PTC), and the percentage of good configurations involving

4-legged features is usually less than� � . On the other hand, the computation of 4-legged

features is most expensive (refer to Chapter 4).

For each protein complex, we apply the local improvement heuristic [61] to the top �

configurations, and then re-rank them based on new scores. The results are illustrated in

Table 6.4, where we choose � � � (other than for the case of 1JAT, where we choose

� � � ). We consider only those configurations with at most�

collisions. In all but one

case, the top-ranked (or the second ranked in 1FC2) configuration after local improvement

has a small rmsd. In the only remaining case, we can obtain a small rmsd for the top-

ranked configuration if we relax the number of collisions to � . In other words, in 23 out

of these 25 test complexes, our coarse alignment algorithm in conjunction with the local-

search heuristics [61] can predict an accurate near-native docking configurations without

false positives.

Unbound protein benchmark. We further tested � � %)' �� # � $ on the protein-protein

docking benchmark provided in [54]. We omitted the seven complexes classified as difficult

in [54] because they have significantly different conformations in the unbound vs. bound

structures. We also omitted complexes 1IAI, 1WQ1 and 2PCC because of difficulties in

generating molecular surfaces of required quality. Of the 49 remaining complexes, 25

are so-called bound-unbound cases in which one of the components is rigid. For each

complex, we fix one chain as � , which is the rigid chain for the bound-unbound cases, or

the receptor for the unbound-unbound cases. We generate a set � of potential positions

for the other component, � . For each transformation � computed from � , we measure

the rmsd between the interface � � atoms from � and those from � � �� , and refer to it as

rmsd � . (For unbound protein complexes, rmsd � serves as a better measure than the rmsd

which we used before due to the flexibility, thus the unreliability in the positions of non-

142

C-id Top 2000 �� All outputs �rmsd � hits rank rmsd � rmsd size

1ACB 3.70 20 3,951 1.75 1.81 14,4261AVW 5.51 8 4,698 5.42 6.38 23,5651BRC 4.66 35 1,629 4.66 5.72 12,7701BRS 1.60 7 426 1.60 1.54 11,6071CGI 3.04 5 695 3.04 3.32 10,1351CHO 2.35 27 92 2.35 2.69 11,8151CSE 3.15 7 15,271 2.74 2.53 21,0681DFJ 6.44 2 1,433 6.44 6.09 35,2311FSS 7.65 2 10,721 5.15 5.45 25,609

1MAH 2.78 4 1,561 2.78 3.45 25,4021TGS 5.27 18 543 5.27 6.07 11,3831UGH 7.95 3 8,268 7.16 7.40 14,6562KAI 6.55 26 2,560 3.41 4.71 13,4782PTC 4.55 32 4,983 4.16 7.85 13,9292SIC 4.04 27 76 4.04 7.71 20,0652SNI 6.34 10 4,894 4.58 4.72 15,8301PPE 4.13 10 37 4.13 5.07 7,6601STF 1.41 8 1 1.41 1.92 15,0821TAB 3.78 3 48 3.78 5.48 8,2961UDI 4.50 3 1,124 4.50 7.38 21,1332TEC 1.42 5 6 1.42 1.53 21,1344HTC 5.94 2 396 5.94 7.39 14,0321AHW 9.38 1 2,781 4.37 10.44 32,9191BVK 1.95 5 1,189 1.95 3.69 24,6111DQJ 4.59 7 710 4.59 6.21 28, 6941MLC 3.71 7 6949 3.32 7.13 29,7471WEJ 6.27 3 4,659 5.86 6.01 18,1941BQL 6.98 11 10,388 4.39 4.64 23,3081EO8 2.31 1 11 2.31 3.11 45,5121FBI 6.49 8 11,783 2.30 2.08 26,0361JHL 3.47 18 14,185 2.61 3.27 32,0911KXQ 5.99 2 1,495 5.99 15.86 37,2181KXT 4.52 12 153 4.52 10.90 39,2401KXV 2.48 7 321 2.48 3.54 46,3681MEL 2.21 8 73 2.21 2.55 17,7411NCA 1.75 7 621 1.75 1.92 49,6001NMB 7.18 7 14,202 2.72 5.11 42,0661QFU 1.97 4 12 1.97 3.07 47,6932JEL 3.46 19 115 3.46 4.39 34,0722VIR 1.08 11 1 1.08 1.86 40,8131AVZ 4.06 8 4,243 3.52 4.08 7,8951L0Y 2.75 2 1136 2.75 3.83 34,0442MTA 2.91 40 19,167 2.07 2.16 36,9031A0O 5.95 3 3950 4.35 5.20 9,1131ATN 1.52 8 1 1.52 2.33 50,7291GLA - - 25,307 2.82 2.83 33,8791IGC 2.48 3 3,260 2.06 2.59 25,3031SPB 2.83 3 617 2.83 3.03 13,7282BTF 5.02 2 10,132 3.28 3.63 33,480

Table 6.5: Protein-protein benchmark. C-id means complex-id.

143

backbone atoms.) As the unbound structures (i.e, the input to our algorithm) provided in

the benchmark are superimposed onto their crystalized correspondances, this value is close

to the rmsd � measured between � �� and the crystalized structure for � .

Now take � � , the top � ranked configurations from � . The results are shown

in Table 6.5, where we show in column 2 the smallest rmsd � generated from � � , and in

column 3 the number of configurations from � � with a rmsd � smaller than� ˚� . Columns

4 to 6 provide information (rank, rmsd � , and corresponding rmsd) of the configuration in

� with the smallest rmsd � , and the size of � is shown in column 7.

Our results demonstrate a number of favorable characteristics of our algorithm. First,

within the relatively small set of � top-scoring configurations ( � � ), 38 out of 49 com-

plexes yield a configuration below� ˚� rmsd � . All but one complex yield a configuration

below the� ˚� cut-off needed as input for the hierarchical, progressive refinement protocol

in [41, 42]. The fact that most complexes generate multiple hits (column 3) increases the

probability that a local refinement will not be trapped in a local minimum and instead find

a correct solution. Second, within all the configurations generated ( � , at most 50,000),

47 out of 49 complexes yield a configuration below� ˚� , typically within the top 10,000

scores. All 49 generate at least one configuration below � � � ˚�

, in at most � � � configu-

rations. How these coarse alignments re-rank to yield high scoring solutions with low rmsd

remains to be investigated. We also remark that it is possible to futher reduce the size of �by clustering similar configurations [86].

6.4 Notes and discussion

We have presented in this chapter an efficient alignment-based algorithm to compute a set

of coarse configurations given two rigid input proteins. We have shown in Section 6.3

that when combined with the local improvement heuristics from [61], our algorithm can

predict an accurate near-native docking configurations for 23 out of 25 test bound protein

144

complexes, without producing any false positives. When tested on the unbound protein

docking benchmark [54], our algorithm is able to rapidly produce a relatively short list

of potential configurations which can be inputs to other local improvement methods that

allow protein flexibility and thus have the potential to solve the unbound protein docking

problem.

Comparisons. Current approaches for the bound docking stage differ in how they sample

the search space and how they evaluate the docking score. FFT-type methods discretize

the search space in a rather uniform way. They produce more accurate predictions for re-

combining known complexes, but at a much higher computational cost. For the case of

unbound docking, it is possible to run those algorithms at a low resolution to provide some

tolerance for flexibility. However, if the resolution is too low, there is a danger of miss-

ing good alignments completely; while if the resolution is high, too many configurations

will be generated and the selection of a small set of good candidates for the refinement

stage is not a trivial problem. Alignment-based methods tend to sample the search space

in a more selective manner guided by shape complementarity. Methods in this category

designed prior to our algorithm align feature points residing at protrusions and cavities.

In particular, given two sets of such points representing � and � , they align all possible

pairs of points from one set with all possible pairs from the other to generate potential

rigid motions. Since the number of feature pairs computed in our algorithm is similar to

that of feature points computed in their algorithms, they inspect significantly more trans-

formations than ours ( � � vs. �� , where � is the number of feature pairs or points) due

to meaningless ones and duplicates. In contrast, by aligning features computed from the

elevation function, we sample the transformational space in a much sparser manner than

previous approaches, focusing only on potentially good docking locations. The size of the

output of our algorithm is much smaller (with� � � configurations without clustering for

1BRS), and as they are generated by fitting features, we expect to capture reasonable con-

145

figurations unless proteins undergo dramatic conformational changes. We also comment

that it is possible to further improve the speed of our algorithm by the geometric hashing

technique as in [86].

146

Bibliography

[1] http://www.marketdata.nasdaq.com/mr4b.html.

[2] Protein data bank. http://www.rcsb.org/pdb/.

[3] P. Agarwal and K. R. Varadarajan. Efficient algorithms for approximating polygonalchains. Discrete Comput. Geom., 23:273–291, 2000.

[4] P. K. Agarwal, H. Edelsbrunner, J. Harer, and Y. Wang. Extreme elevation on a2-manifold. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages 357–365,2004.

[5] P. K. Agarwal, H. Edelsbrunner, and Y. Wang. Computing the writhing number of aknot. Discrete Comput. Geom., 32:37–53, 2004.

[6] P. K. Agarwal, L. J. Guibas, A. Nguyen, D. Russel, and L. Zhang. Collision detec-tion for deforming necklace. To Appear.

[7] P. K. Agarwal, S. Har-Peled, N. H. Mustafa, and Y. Wang. Near-linear time approx-imation algorithms for curve simplification in two and three dimensions. Algorith-mica, To appear.

[8] P. K. Agarwal, S. Har-Peled, M. Sharir, and Y. Wang. Hausdorff distance undertranslation for points and balls. In Proc. 19th Annu. ACM Sympos. Comput. Geom.,pages 282–291, 2003.

[9] P. K. Agarwal and J. Matousek. Ray shooting and parametric search. SIAM J.Comput., 22:540 – 570, 1993.

[10] P. K. Agarwal and J. Matousek. On range searching with semialgebraic sets. InDiscrete Comput. Geom., volume 11, pages 393–418, 1994.

[11] P. K. Agarwal, M. Sharir, and S. Toledo. Applications of parametric searching ingeometric optimization. J. Algorithms, 17:292 – 318, 1994.

[12] O. Aichholzer, H. Alt, and G. Rote. Matching shapes with a reference point. Intl. J.Comput. Geom. and Appl., 7:349–363, 1997.

[13] J. Aldinger, I. Klapper, and M. Tabor. Formulae for the calculation and estimationof writhe. J. Knot Theory and Its Ramifications, 4:343–372, 1995.

147

[14] H. Alt, B. Behrends, and J. Bloemer. Approximate matching of polygonal shapes.Annu. Math. Artif. Intell., 13:251 – 266, 1995.

[15] H. Alt, P. Brass, M. Godau, C. Knauer, and C. Wenk. Computing the hausdorffdistance of geometric patterns and shapes. Discrete Comput. Geom. – the Goodman-Pollack Festchrift, pages 65–76, 2003.

[16] H. Alt and M. Godeau. Computing the Frechet distance between two polygonalcurves. International Journal of Computational Geometry, pages 75–91, 1995.

[17] H. Alt and L. Guibas. Discrete geometric shapes: Matching, Interpolation, andApproximation. Handbook of Computational Geometry (J.-R. Sack and J. Urrutiaeds), 1999.

[18] E. Althaus, O. Kohlbacher, H. P. Lenhof, and P. Muller. A combinatorial approachto protein docking with flexible side-chains. In Prooceedings of the Fourth Inter-national Conference on Computaional Molecular Biology (RECOMB), pages 8–11,2000.

[19] N. Amenta and R. Kolluri. The medial axis of a union of balls. Comput. Geom:Theory Appl., 20:25–37, 2001.

[20] A. M. Amilibia and J. J. N. Ballesteros. The self-linking number of a closed curvein �

�. Journal of Knot Theory and Its Ramifications, 9(4):491–503, 2000.

[21] A. Amir, E. Porat, and M. Lewenstein. Approximate subset matching with Don’tCares. In Proc. 12th ACM-SIAM Symp. Discrete Algorithms, pages 305–306, 2001.

[22] M. Ankerst, G. Kastenmuller, H. Kriegel, and T. Seidl. 3d shape histograms for sim-ilarity search and classfication in spatial databases. In Proc. of the 6th Int. Sympos.on Spatial Databases, volume 1651, pages 207–226, 1999.

[23] V. I. Arnold. Catastrophe Theory. Springer-Verlag, Berlin, Germany, 1984.

[24] S. Arya and T. Malamatos. Linear-size approximate voronoi diagrams. In Proc.13th ACM-SIAM Symp. on Discrete Algorithms, pages 147–155, 2002.

[25] M. J. Atallah. A linear time algorithm for the Hausdorff distance between convexpolygons. Inform. Process. Lett., 17:207 – 209, 1983.

[26] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science,294(5):93–96, 2001.

148

[27] Y. A. Ban, H. Edelsbrunner, and J. Rudolph. Interface surfaces for protein-proteincomplexes. In RECOMB, pages 205–212, 2004.

[28] T. Banchoff. Self-linking numbers of space polygons. Indiana Univ. Math. J.,25:1171–1188, 1976.

[29] G. Barequet, D. Z. Chen, O. Daescu, M. T. Goodrich, and J. Snoeyink. Effi-ciently approximating polygonal paths in three and higher dimensions. Algorith-mica, 33(2):150 – 167, 2002.

[30] W. R. Bauer, F. H. C. Crick, and J. H. White. Supercoiled dna. Scientific American,243:118 – 133, 1980.

[31] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. H.Shinkdyalov, and P. E. Bourne. The protein data bank. Nucleic Acid Res., 28:235–242, 2000.

[32] S. Bespamyatnikh, V. Choi, H. Edelsbrunner, and J. Rudolph. Accurate proteindocking by shape complementarity alone, 2004.

[33] K. Brakke. Surface evolver software documentation.http://www.geom.umn.edu/software/evolver/.

[34] C. Branden and J. Tooze. Introduction to Protein Structure. Garland PublishingInc., 2 edition, 1999.

[35] M. L. Bret. Catastrophic variation of twist and writhing of circular DNAs withconstraint? Biopolymers, 18:1709 – 1725, 1979.

[36] J. W. Bruce and P. J. Giblin. Curves and Singularities. Cambridge Univ. Press,England, second edition, 1992.

[37] D. Brutlag. DNA topology and topoisomerases, 2000. http://cmgm.stanford.edu/-biochem201/Handouts/DNAtopo.html.

[38] G. Buck. Four-thirds power law for knots and links. Nature, 392:238–239, 1998.

[39] R. M. Burnett and J. S. Taylor. DARWIN: A program for docking flexiblemolecules. Proteins: Structure, Function, and Genetics, 41:173 – 191, 2000.

[40] G. Calugareanu. Sur les classes d’isotopie des noeuds tridimensionnels et leursinvariants. Czechoslovak Math. J., 11:588–625, 1961.

149

[41] C. J. Camacho, D. W. Gatchell, S. R. Kimura, and S. Vajda. Scoring docked con-formations generated by rigid-body protein-protein docking. Proteins, 40:525–537,2000.

[42] C. J. Camacho and S. Vajda. Protein docking along smooth association pathways.Proc. Natl. Acad. Sci., 98:10636–10641, 2001.

[43] J. Cantarella. On comparing the writhe of a smooth curve to the writhe of an in-scribed polygon, 2002.

[44] J. Cantarella, D. DeTurck, and H. Gluck. Upper bounds for the writhing of knotsand the helicity of vector fields. In Proceedings of the Conference in Honor of the70th Birthday of Joan Birman, J. Gilman, X. Lin W. Menasco (eds), 2000.

[45] J. Cantarella, R. Kusner, and J. Sullivan. Tight knot values deviate from linearrelation. Nature, 392:237–238, 1998.

[46] D. Cardoze and L. Schulman. Pattern matching for spatial point sets. In Proc. 39thAnnu. IEEE Sympos. Found. Comput. Sci., pages 156 – 165, 1998.

[47] O. Carugo and S. Pongor. Protein fold similarity estimated by a probablisticapproach based on � � - � � distance comparison. Journal of Molecular Biology,315:887–898, 2002.

[48] F. Cazals, F. Chazal, and T. Lewiner. Molecular shape analysis based upon theMorse-Smale complex and the Connolly function. In Proc. 19th Annu. ACM Sym-pos. Comput. Geom., 2003.

[49] H. S. Chan and K. A. Dill. The effects of internal constraints on the configurationsof chain molecules. Journal of Chemical Physics, 92(5):3118 – 3135, 1990.

[50] W. S. Chan and F. Chin. Approximation of polygonal curves with minimum numberof line segments. In Proc. 3rd Annu. Internat. Sympos. Algorithms Comput., volume650 of Lecture Notes Comput. Sci., pages 378–387. Springer-Verlag, 1992.

[51] B. Chazelle. Cutting hyperplanes for divide-and-conque. Discrete Comput. Geom.,9:145 – 158, 1993.

[52] B. Chazelle, H. Edelsbrunner, L. Guibas, M. Sharir, and J. Stolfi. Lines in space:cobinatorics and algorithms. Algorithmica, 15:428–447, 1996.

150

[53] R. Chen, L. Li, and Z. Weng. ZDOCK: An initial-stage protein docking algorithm.Proteins, 52(1):80–87, 2003.

[54] R. Chen, J. Mintseris, J. Janin, and Z. Weng. A protein-protein docking benchmark.Proteins, 52:88–91, 2003.

[55] H.-L. Cheng, T. K. Dey, H. Edelsbrunner, and J. Sullivan. Dynamic skin triangula-tion. Discrete Comput. Geom., 25:525–568, 2001.

[56] S. S. Chern. Curves and surfaces in Euclidean space. Studies in Global Geometryand Analysis, pages 16–56, 1967.

[57] L. P. Chew, D. Dor, A. Efrat, and K. Kedem. Geometric pattern matching in � -dimensional space. Discrete Comput. Geom., 21:257 – 274, 1999.

[58] L. P. Chew, M. T. Goodrich, D. P. Huttenlocher, K. Kedem, J. M. Kleinberg, andD. Dravets. Geometric pattern matching under euclidean motion. Comput. Geom.Theory Appl., 7:113 – 124, 1997.

[59] L. P. Chew, D. Huttenlocher, K. Kedem, and J. Kleinberg. Fast detection of commongeometric substructure in proteins. In Proc. 3rd. Int. Conf. Comput. Mol. Biol.,1993.

[60] I. Choi, J. Kwon, and S. Kim. Local feature frequency profile: A method to measurestructural similarity in proteins. PNAS, 101(11):3797 – 3802, 2004.

[61] V. Choi, P. K. Agarwal, H. Edelsbrunner, and J. Rudolph. Local search heuristic forrigid protein docking. In 4th Workshop on Algorithms in Bioinformatics, 2004.

[62] D. Cimasoni. Computing the writhe of a knot. Journal of Knot Theory and ItsRamifications, 10(3):387–395, 2001.

[63] K. Cole-McLaughlin, H. Edelsbrunner, J. Harer, V. Natarajan, and V. Pascucci.Loops in Reeb graphs of 2-manifolds. Discrete Comput. Geom. To appear.

[64] M. L. Connolly. Molecular surface review.

[65] M. L. Connolly. Analytical molecular surfaces calculation. Journal of AppliedCrystallography, 16:548 – 558, 1983.

[66] M. L. Connolly. Measurement of protein surface shape by solid angles. J. Mol.Graphics, 4:3 – 6, 1986.

151

[67] M. L. Connolly. Shape complementarity at the hemo-globin albl subunit interface.Biopolymers, 25:1229–1247, 1986.

[68] T. E. Creighton. Proteins: structures and molecular properties. W. H. Freeman andComapny, New York, second edition, 1993.

[69] F. H. C. Crick. The packing of alpha-helices: simple coiled coils. Acta Crystallog-raphy, 6:689–697, 1953.

[70] F. H. C. Crick. Linking numbers and nucleosomes. Proc. Natl. Acad. Sci. USA,73(8):2639–2643, 1976.

[71] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computationalgeometry: algorithms and applications. Springer, 1997.

[72] Y. Diao and C. Ernst. The complexity of lattice knots. Topology Appl., pages 1–9,1998.

[73] K. A. Dill. Theory for the folding and stability of globular proteins. Biochemistry,24(6):1501 – 1509, 1985.

[74] K. A. Dill. Dominant forces in protein folding. Biochemistry, 29(31):7132 – 7135,1990.

[75] D. H. Douglas and T. K. Peucker. Algorithms for the reduction of the number ofpoints required to represent a digitized line or its caricature. Canadian Cartogra-pher, 10(2):112–122, Dec. 1973.

[76] H. Edelsbrunner, J. Harer, and A. Zomorodian. Hierarchical morse complexes forpiecewise linear 2-manifolds. Discrete Comput. Geom., 30:87–108, 2003.

[77] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence andsimplification. Discrete Comput. Geom., 28:511–533, 2002.

[78] H. Edelsbrunner and E. P. Mucke. Simulation of simplicity: a technique to copewith degenerate cases in geometric algorithms. ACM Trans. Graphics, 9:66–104,1990.

[79] M. H. Eggar. On White’s Formula. Journal of Knot Theory and Its Ramifications,9(5):611–615, 2000.

152

[80] I. Eidhammer, I. Jonassen, and W. R. Taylor. Structure comparison and structurepatterns. J. Mol. Biol., 7:685–716, 2000.

[81] A. H. Elcock, D. Sept, and J. A. McCammon. Computer simulation of protein-protein interactions. J. Phys. Chem., 105:1504–1518, 2001.

[82] M. A. Erdmann. Protein similarity from knot theory and goemetric convolution. InRECOMB, pages 195–204, 2004.

[83] R. Estkowski and J. S. B. Mitchell. Simplifying a polygonal subdivision whilekeeping it simple. In Proc. 17th Annu. ACM Sympos. Comput. Geom., pages 40–49,2001.

[84] A. Fersht. Structure and mechanism in protein science. W. H. Freeman and Com-pany, New York, third edition, 2000.

[85] P. Finn and L. Kavraki. Computational approaches to drug design. Algorithmica,25:347–371, 1999.

[86] D. Fischer, S. L. Lin, H. Wolfson, and R. Nussinov. A geometry-based suite ofmolecular docking processes. J. Mol. Biol., 248:459–477, 1995.

[87] W. Fraczek. Mean sea level, GPS, and the geoid. Arc-Users Online, 2003. ERSIWeb Sites: www.esri.com/news/arcuser/0703/sum-mer-2003.html.

[88] F. B. Fuller. The writhing number of a space curve. In Proc. Natl. Acad. Sci. USA,volume 68, pages 815–819, 1971.

[89] F. B. Fuller. Decomposition of the linking number of a closed ribbon: a problemfrom molecular biology. In Proc. Natl. Acad. Sci. USA, volume 75, pages 3557–3561, 1978.

[90] H. A. Gabb, R. M. Jackson, and M. J. Sternberg. Modeling protein docking usingshape complementarity, electrostatics and biochemical information. J. Mol. Biol.,272(1):106–120, 1997.

[91] E. J. Gardiner, P. Willett, and P. J. Artymiuk. Protein docking using a genetic algo-rithm. Proteins: Structure, Function, and Genetics, 44:44 – 56, 2001.

[92] M. Godau. A natural metric for curves: Computing the distance for polygonalchains and approximation algorithms. In Proc. of the 8th Annual Symposium onTheoretical Aspects of Computer Science, pages 127–136, 1991.

153

[93] B. B. Goldman and W. T. Wipke. QSD: quadratic shape descriptors. 2. Moleculardocking using quadratic shape descriptors (QSDock). Proteins, 38:79–94, 2000.

[94] D. Goldman, S. Istrail, and C. H. Papadimitriou. Algorithmic aspects of proteinstructure similarity. In IEEE Symposium on Foundations of Computer Science,pages 512–522, 1999.

[95] M. T. Goodrich, J. S. Mitchell, and M. W. Orletsky. Practical methods for approx-imate geometric pattern matching under rigid motion. In Proc. 10th Annu. ACMSympos. Comput. Geom., pages 103 – 112, 1994.

[96] L. J. Guibas, J. E. Hershberger, J. S. Mitchell, and J. Snoeyink. Approximatingpolygons and subdivisions with minimum link paths. Internat. J. Comput. Geom.Appl., 3(4):383–415, Dec. 1993.

[97] I. Halperin, B. Ma, H. Wolfson, and R. Nussinov. Principles of docking: Anoverview of search algorithms and a guide to scoring functions. Proteins: Struc-ture, Function, and Genetics, 47:409 – 443, 2002.

[98] S. Har-Peled. A replacement for voronoi diagrams of near linear size. In Proc. 42ndAnnu. IEEE Sympos. Found. Comput. Sci., pages 94–103, 2001.

[99] A. Hatcher and J. Wagoner. Pseudo-Isotopies of compact manifolds, 1973. SocieteMathematique de France.

[100] P. Heckbert and M. Garland. Survey of polygonal surface simplification algorithms.In SIGGRAPH 97 Course Notes: Multiresolution Surface Modeling, 1997.

[101] J. Hershberger and J. Snoeyink. An � � �� implementation of the Douglas-Peucker algorithm for line simplification. In Proc. 10th Annu. ACM Sympos. Com-put. Geom., pages 383–384, 1994.

[102] A. Hillisch and R. Hilgenfeld, editors. Modern Methods of Drug Discovery.Springe Verlag, 2003.

[103] L. Holm and C. Sander. Mapping the protein universe. Science, 273:595–602, 1996.

[104] L. Holm and C. Sander. Dali/FSSP classification of three-dimensional protein folds.Nucleic Acids Research, 25(1):231–234, 1997.

[105] W. Humphrey, A. Dalke, and K. Schulten. VMD– Visual Molecular Dynamics. J.Molec. Graphics, 15:33 – 38, 1996.

154

[106] D. P. Huttenlocher, K. Kedem, and J. M. Kleinberg. On dynamic Voronoi diagramsand the minimum Hausdorff distance for point sets under Euclidean motion in theplane. In Proc. 8th Annu. ACM Sympos. Comput. Geom., pages 110 – 120, 1992.

[107] D. P. Huttenlocher, K. Kedem, and M. Sharir. The upper envelope of Voronoi sur-faces and its applications. Discrete Comput. Geom., 9:267 – 291, 1993.

[108] H. Imai and M. Iri. An optimal algorithm for approximating a piecewise linearfunction. Information Processing Letters, 9(3):159–162, 1986.

[109] H. Imai and M. Iri. Polygonal approximations of a curve-formulations and algo-rithms. In G. T. Toussaint, editor, Computational Morphology, pages 71–86. North-Holland, Amsterdam, Netherlands, 1988.

[110] P. Indyk, R. Motwani, and S. Venkatasubramanian. Geometric matching undernoise: Combinatorial bounds and algorithms. In Proc. 10th Annu. ACM-SIAM Sym-pos. Discrete Alg., pages 457 – 465, 1999.

[111] R. M. Jackson, H. A. Gabb, and M. J. E. Sternberg. Rapid refinement of proteininterfaces incorporating solvation: application to the docking problem. J. Mol. Biol.,276:265–285, 1998.

[112] D. J. Jacobs, A. J. Rader, L. A. Kuhn, and M. F. Thorpe. Protein flexibility predic-tions using graph theory. Proteins: Structure, Function, and Genetics, 44:150–165,2001.

[113] J. Janin and C. Chothia. The structure of protein-protein recognition site. J. Mol.Biol., 265:16027 – 16030, 1990.

[114] J. Janin and S. J. Wodak. The structural basis of macromolecular recognition. Adv.Protein Chem., 61:9–73, 2002.

[115] G. Jones, P. Willett, R. C. Glen, A. R. Leach, and R. Taylor. Development andvalidation of a genetic algorithm for flexible docking. J. Mol. Biol., 267:727–748,1997.

[116] S. Jones and J. M. Thornton. Principles of protein-protein interactions. Proc. Natl.Acad. Sci., 93(1):13–20, 1996.

[117] R. D. Kamien. Local writhing dynamics. Eur. Phys. J. B, 1:1–4, 1998.

155

[118] G. Kastenmuller, H. Kriegel, and T. Seidl. Similarity search in 3d protein databases,1998.

[119] E. Katchalski-Katzir, I. Shariv, M. Eisenstein, A. A. Friesen, C. Aflalo, and I. A.Vakser. Molecular surface recognition: determination of geometric fit between pro-teins and their ligands by correlation techniques. Proc. Natl. Acad. Sci., 89:2195–2199, 1992.

[120] K. Kedem, R. Livne, J. Pach, and M. Sharir. On the union of Jordan regions andcollision-free translational motion amidst polygonal obstacles. Discrete Comput.Geom., 1:59–71, 1986.

[121] K. Klenin and J. Langowski. Computation of writhe in modeling of supercoiledDNA. Biopolymers, 54:307 – 317, 2000.

[122] P. Koehl. Protein structure similarities. Curr. Opin. Struct. Biol., 11:348–353, 2001.

[123] P. Koehl and M. Levitt. A brighter future for protein structure prediction. NatureStructural Biology, 6(2):108 – 111, 1999.

[124] W. G. Krebs and M. Gerstein. The morph server: a standardized system for ana-lyzing and visualizing macromolecular motions in a database framework. NuclericAcids Res., 28:1665–1675, 2000.

[125] A. R. Leach. Molecular modelling: Principles and applications. Pearson EducationLimited, 1996.

[126] B. Lee and F. M. Richards. The interpretation of protein structures: Estimation ofstatic accessibility. J. Mol. Biol., 55:379 – 400, 1971.

[127] H. Lenhof. An algorithm for the protein docking problem. Bioinformatics: FromNucleic Acids and Proteins to Cell Metabolism, pages 125–139, 1995.

[128] M. Levitt. Protein folding by restrained energy minimization and molecular dynam-ics. J. Mol. Biol., 170:723–764, 1983.

[129] M. Levitt and M. Gerstein. A unified statistical framework for sequence comparisonand structure comparison. Proc. Nat. Acad. Sci., 95:5913–5920, 1998.

[130] J. Liang, H. Edelsbrunner, and C. Woodward. Anatomy of protein pockets andcavities: measurement of binding site geometry and implications for ligand design.Protein Sci., 7:1884–1897, 1998.

156

[131] S. Loncaric. A survey of shape analysis techniques. Pattern Recognition, 31(8):983– 1001, 1998.

[132] A. C. Martin, C. A. Orengo, E. G. Hutchinson, S. Jones, M. Karmirantzou, R. A.Laskowski, J. B. Mitchell, C. Taroni, and J. M. Thornton. Protein folds and func-tions. Struct. Fold. Des., 6:875 – 884, 1998.

[133] N. Megiddo. Applying parallel computation algorithms in the design of serial algo-rithms. J. ACM, 30:852 – 865, 1983.

[134] A. Melkman and J. O’Rourke. On polygonal chain approximation. In G. T. Tous-saint, editor, Computational Morphology, pages 87–95. North-Holland, Amsterdam,Netherlands, 1988.

[135] J. Milnor. Morse Theory. Princeton Univ. Press, New Jersey, 1963.

[136] J. C. Mitchell, R. Kerr, and L. F. T. Eyck. Rapid atomic density measures for molec-ular shape characterization. J. Mol. Graph. Model., 19:324–329, 2001.

[137] G. Moont and M. J. E. Sternberg. Modelling protein-protein and protein-dna dock-ing. Bioinformatics – From Genomes to Drugs, 1:361–404, 2001.

[138] A. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: A structural classifi-cation of proteins database for the investigation of sequences and structures. J. Mol.Biol., 247:536–540, 1995.

[139] C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M.Thornton. CATH – A hierarchic classfication of protein domain structures. Struc-ture, 5(8):1093–1108, 1997.

[140] P. N. Palma, L. Krippahl, J. E. Wampler, and J. J. G. Moura. BIGGER: A new softdocking algorithm for predicting protein interactions. Proteins: Structure, Func-tion, and Genetics, 39:178 – 194, 2000.

[141] F. M. G. Pearl, D. Lee, J. E. Bray, I. Sillitoe, A. E. Todd, A. P. Harrison, J. M.Thornton, and C. A. Orengo. Assigning genomic sequences to CATH. NucleicAcids Research, 28(1):277–282, 2000.

[142] W. F. Pohl. The self-linking number of a closed space curve. J. Math. Mech.,17:975–985, 1968.

157

[143] D. W. Ritchie. Evaluation of protein docking predictions using hex 3.1 in caprirounds 1 and 2. Proteins, 52(1):98–106, 2003.

[144] D. W. Ritchie and G. J. L. Kemp. Protein docking using spherical poloar fouriercorrelations. Proteins, 39:178–194, 2000.

[145] P. Rogen and B. Fain. Automatic classification of protein structure by using gaussintegral. PNAS, 100(1):119–124, 2003.

[146] H. Samet. Spatial Data Structures: Quadtrees, Octrees, and Other HierarchicalMethods. Addison-Wesley, Reading, MA, 1989.

[147] B. Sandak, R. Nussinov, and H. J. Wolfson. A method for biomolecular struc-tural recognition and docking allowing conformational flexibility. J. Comput. Biol.,5:631–654, 1998.

[148] S. Seeger and X. Laboureux. Feature extraction and registration: An overview.Principles of 3D Image Analysis and Synthesis, pages 153 – 166, 2002.

[149] I. N. Shindyalov and P. E. Bourne. Protein structure alignment by incremental com-binatorial extension (ce) of the optimal path. Proten Eng., 11(9):739–747, 1998.

[150] M. L. Sierk and W. R. Pearson. Sensitivity and selectivity in protein structure com-parison. Protein Science, 13:773–785, 2004.

[151] A. P. Singh and D. L. Brutlag. Protein structure alignment: a comparison of meth-ods. 2001.

[152] D. D. Sleator and R. E. Tarjan. A data structure for dynamic trees. J. Comput. Sys.Sci., 26(3):362–391, 1983.

[153] G. R. Smith and M. J. E. Sternberg. Prediction of protein-protein interactions bydocking methods. Current Opinion in Structural Biology, 12:29–35, 2002.

[154] B. Solomon. Tantrices of spherical curves. Amer. Math. Monthly, 103:30–39, 1996.

[155] R. Srinivasan and G. D. Rose. Linus: A hierarchic procedure to predict the fold ofa protein. Proteins: Structure, Function, and Genetics, 22:1143 – 1159, 1995.

[156] M. J. E. sternberg. Protein structure prediction: a pratical approach. Oxford Uni-versity Press, 1996.

158

[157] D. Swigon, B. D. Coleman, and I. Tobias. The elastic rod model for DNA andits application to the tertiary structure of DNA minicircles in mononucleosomes.Biophysical Journal, 74:2515–2530, 1998.

[158] M. B. Swindells, C. A. Orengo, D. T. Jones, E. G. Hutchinson, and J. M. Thornton.Contemporary approaches to protein structure classification. BioEssays, 20:849–891, 1998.

[159] M. Tabor and I. Klapper. The dynamics of knots and curves (Part I). NonlinearScience Today, 4(1):7–13, 1994.

[160] W. R. Taylor, A. C. W. May, N. P. Brown, and A. Aszodi. Protein structure: ge-ometry, topology and classification. Reports on Progress Physics, 64:517 – 590,2001.

[161] M. L. Teodoro, G. N. Phillips, and L. E. Kavraki. Understanding protein flexibilitythrough dimensional reduction. Journal of Computational Biology, 10:617–634,2003.

[162] A. Sali, E. Shakhnovich, and M. Karplus. How does a protein fold? Nature,369:248 – 251, 1994.

[163] I. A. Vakser. Protein docking for low-resolution structures. Protein Engineering,8:371–377, 1995.

[164] P. Veerapandian. Structure-based drug design. Marcel Dekker, 1997.

[165] R. C. Veltkamp and M. Hagedoorn. State of the art in shape matching. Principlesof Visual Information Retrieval, edited by M. S. Lew, Series in Advances in PatternRecognition, 2001.

[166] A. V. Vologodskii, V. V. Anshelevich, A. V. L. ka shin, and M. D. Frank-Kamenetskii. Statistical mechanics of supercoils and the torsional stiffness of theDNA double helix. Nature, 280:294 – 298, 1979.

[167] R. Weibel. Generalization of spatial data: principles and selected algorithms. InM. van Kreveld, J. Nievergelt, T. Roos, and P. Widmayer, editors, Algorithmic Foun-dations of Geographic Information System. Springer-Verlag Berlin Heidelberg NewYork, 1997.

[168] J. White. Self-linking and the Gauss integral in higher dimensions. Amer. J. Math.,XCI:693–728, 1969.

159

[169] D. L. Wild and M. A. S. Saqi. Structural proteomics: Inferring function from proteinstructure. Current Proteomics, 1:59 – 65, 2004.

160

Biography

Yusu Wang was born on June 28th, 1976 in Shanxi Province, China. After receiving her BS

from Tsinghua University in 1998, Yusu Wang joined in Computer Science Dept. of Duke

University, where she received her MS in year 2000 and is now pursuing a PhD in the area

of geometric computing. Her research focuses on designing efficient computational meth-

ods for shape analysis problems, especially for protein structure analysis, by combining

both geometry and topology. She is currently involved in an interdisciplinary collabarotive

project, BioGeometry, at Duke Univ., Stanford Univ., UNC-Chapel Hill, and UNCA � T, to

address fundamental computational problems in representing, searching, simulating, ana-

lyzing, and visualizing biological structures.

Related Publications.

1. P. K. AGARWAL, H. EDELSBRUNNER AND Y. WANG Computing the writhing number of

a knot. Discrete Comput. Geom. 32 (2004), 37–53.

2. S. HAR-PELED AND Y. WANG Shape fitting with outliers. SIAM J. Comput., to appear.

3. P. K. AGARWAL, S. HAR-PELED, N. MUSTAFA AND Y. WANG Near-linear time approxi-

mation algorithm for curve simplification. Algorithmica, to appear.

4. P. K. AGARWAL, H. EDELSBRUNNER, J. HARER AND Y. WANG Extreme elevation on a

2-Manifold. In “Proc. 20th Annu. Sympos. Comput. Geom., 2004”, 357–365.

5. P. K. AGARWAL, Y. WANG AND H. YU A 2D kinetic triangulation with near-quadratic

topological changes. In “Proc. 20th Annu. Sympos. Comput. Geom., 2004”, 180–189.

6. P. K. AGARWAL, S. HAR-PELED, M. SHARIR AND Y. WANG Hausdorff distance under

translation for points and balls. In “Proc. 19th Annu. Sympos. Comput. Geom., 2004”,

282–291.

161

geometric and topological methods in protein structure...

Documents