geometric and topological methods in protein structure...
TRANSCRIPT
GEOMETRIC AND TOPOLOGICAL METHODSIN PROTEIN STRUCTURE ANALYSIS
by
Yusu Wang
Department of Computer ScienceDuke University
Date:Approved:
Prof. Pankaj K. Agarwal, Supervisor
Prof. Herbert Edelsbrunner, Co-advisor
Prof. John Harer
Prof. Johannes Rudolph
Dissertation submitted in partial fulfillment of therequirements for the degree of Doctor of Philosophy
in the Department of Computer Sciencein the Graduate School of
Duke University
2004
ABSTRACT
GEOMETRIC AND TOPOLOGICAL METHODSIN PROTEIN STRUCTURE ANALYSIS
by
Yusu Wang
Department of Computer ScienceDuke University
Date:Approved:
Prof. Pankaj K. Agarwal, Supervisor
Prof. Herbert Edelsbrunner, Co-advisor
Prof. John Harer
Prof. Johannes Rudolph
An abstract of a dissertation submitted in partial fulfillment of therequirements for the degree of Doctor of Philosophy
in the Department of Computer Sciencein the Graduate School of
Duke University
2004
Abstract
Biology provides some of the most important and complex scientific challenges of our
time. With the recent success of the Human Genome Project, one of the main challenges in
molecular biology is the determination and exploitation of the three-dimensional structure
of proteins and their function. The ability for proteins to perform their numerous functions
is made possible by the diversity of their three-dimensional structures, which are capable
of highly specific molecular recognition. Hence, to attack the key problems involved, such
as protein folding and docking, geometry and topology become important tools. Despite
their essential roles, geometric and topological methods are relatively uncommon in com-
putational biology, partly due to a number of modeling and algorithmic challenges. This
thesis describes efficient computational methods for characterizing and comparing molecu-
lar structures by combining both geometric and topological approaches. Although most of
the work described here focuses on biological applications, the techniques developed can
be applied to other fields, including computer graphics, vision, databases, and robotics.
Geometrically, the shape of a molecule can be modeled as (i) a set of weighted points,
representing the centers of atoms and their van der Waals radii; (ii) as a polygonal curve,
corresponding to a protein backbone or a DNA strand; or (iii) as a polygonal mesh cor-
responding to a molecular surface. Each such representation emphasizes different aspects
of molecular structures at various scales, the choice of which depends on the underlying
applications. Characterizing molecular shapes represented in various ways is an important
step toward better understanding or manipulating molecular structures. In the first part of
the thesis, we study three geometric descriptions: the writhing number of DNA strands, the
level-of-details representation of protein backbones via simplification, and the elevation of
molecular surfaces.
The writhing number of a curve measures how many times a curve coils around itself
iii
in space. It describes the so-called supercoiling phenomenon of double stranded DNA,
which influences DNA replication, recombination, and transcription. It is also used to
characterize protein backbones. This thesis proposes the first subquadratic algorithm for
computing the writhing number of a polygonal curve. It also presents an algorithm that
is easy to implement and runs in near-linear time on inputs that are typical in practice,
including DNA strands, which is significantly faster than the quadratic time needed by
algorithms used in current DNA simulation softwares.
The level-of-detail (LOD) representation of protein backbone helps to extract its main
features. We compute LOD representations via curve simplification under the so-called
Frechet error measure. This measure is more desirable than the widely used Hausdorff
error measure in many situations, especially if one wants to preserve global features of
a curve (e.g, the secondary structure elements of a protein backbone) during simplifica-
tion. In this thesis, we present a simple approximation algorithm to simplify curves under
Frechet error measure, which is the first simplification algorithm with guaranteed quality
that runs in near-linear time in dimensions higher than two.
We propose a continuous elevation function on the surface of a molecule to capture
its geometric features such as protrusions and cavities. To define the function, we follow
the example of elevation as defined on Earth, but we go beyond this simpler concept to
accommodate general 2-manifolds. Our function is invariant under rigid motions. It scales
with the surface and provides beyond the location, also direction and size of shape features.
We present an algorithm for computing the points with locally maximum elevation. These
points corresponds to locally most significant features. This succinct representation of
features can be applied to aligning shapes and we will present one such application in the
second part of the thesis.
The second part of the thesis focuses on molecular shape matching algorithms. The
importance of shape matching, both similarity matching and complementarity matching,
iv
arises from the general belief that the structure of a protein decides its function. Efficient
algorithms to measure the similarity between shapes help identify new types of protein
architecture, discover evolutionary relations, and provide biologists with computational
tools to organize the fast growing set of known protein structures. By modeling a molecule
as the union of balls, we study the similarity between two such unions by (variants of) the
widely used Hausdorff distance, and propose algorithms to find (approximately) the best
translation under Hausdorff distance measure.
Complementarity matching is crucial to understand or simulate protein docking, which
is the process where two or more protein molecules bind to form a compound structure.
From a geometric perspective, protein docking can be considered as the problem of search-
ing for configurations with maximum complementarity between two molecular surfaces.
Using the feature information generated by the elevation function, we describe an efficient
algorithm to find promising initial relative placements of the proteins. The outputs can
later be refined to locate docking positions independently using a heuristic that improves
the fit locally, using geometric but possibly also chemical and biological information.
v
And indeed there will be time
To wonder, “ Do I dare? ” and “Do I dare? ”
Time to turn back and descend the stair,
With a bald spot in the middle of my hair
�����
Do I dare
Disturb the universe?
— T. S. Eliot, The love-song of J. Alfred Prufrock
vi
Acknowledgements
I came to take of your wisdom:
And behold I have found that which is greater than wisdom.
— Kahlil Gibran, The prophet
It is not without regret that I am writing this acknowledgment — while being grateful
for all those who made my life in the past few years a joyful and fruitful one, I know sadly
that our lives will part soon. The path towards obtaining a PhD was a struggle for me in
many ways. I can’t imagine how it would have been without their support.
It has been a great opportunity to have worked under the supervision of Profs. Pankaj
K. Agarwal and Herbert Edelsbrunner. The experience helped shape my attitude and ap-
proaches towards research both in computational geometry and in general. Pankaj led
me into the world of computational geometry with his broad knowledge. Besides sup-
port and guidance, he gave me great freedom in doing research, and is always patient and
understanding. It is hard to overestimate how much I have benefited from the numerous
discussions with him. Herbert showed me the “friendly” side of computational topology,
with his deep insights accompanied by illustrative explanation. His philosophy and vision
in research have greatly influenced me. I am deeply indebted to both of them for their
guidance and inspiration throughout the course of this dissertation. I would also like to
thank Profs. John Harer and Johannes Rudolph not only for being on my committee of
this thesis, but also for various discussions and collaborations. Support for this work was
provided by NSF under the grant NSF-CCR-00-86013 (the BioGeometry project).
The Duke CS department is a wonderful place. In particular, I wish to thank Drs Lars
Arge and Ron Parr who are always open and ready to help on my career concerns. Dr.
Sariel Har-Peled, now a post-postdoc (well, an assistant professor) in UIUC, has been a
tremendous mentor and friend for me, especially during a period when I was swinging
vii
among various career choices, and at a time when I was learning to walk on the ropes of
research. I learned from him to have an approximate perspective towards problems both in
computer science and in life.
I would like to thank all the graduate students and postdocs in the theory group who
provided a vibrant research environment that I have enjoyed and benefited from so much,
especially Nabil Mustafa, Hai Yu, Peng Yin, and Vijay Natarajan. I had a lot of fun both in
research and in life with friends such as Vicky Choi, David Cohen-Steiner, Ho-lun Cheng,
Ashish Gehani, Sathish Govindarajan, Jingquan Jia, Tingting Jiang, Dmitriy Morozov,
Nabil Mustafa, Vijay Natarajan, Jeff Phillips, Nan Tian, Eric Zhang, Hai Yu, and Haifeng
Yu. Those inspiring discussions with Nabil, Sathish and David will always mark my mem-
ories of the PhD life. Special thanks to my best friend Peng Yin. His humor, energy,
understanding and advices accompany me to this day. I also want to thank his wife Xia
Wu for kindly feeding me uncountable times.
I wish to thank all staffs in the department for being so friendly and helpful, especially
Ms. Celeste Hodges and Ms. Diane Riggs.
Last, but not least, I would like to thank my family for their love, support and confi-
dence in me. My parents and grandparents encouraged me to pursue my own dream from
the childhood, and have never tried to pressure me into any life that others may consider
as successful. My sister and brother-in-law have always been there for me, full of under-
standing and support. What I have achieved was possible only because they were all by
my side. This thesis is dedicated to them.
viii
Contents
Abstract iii
Acknowledgements vii
List of Tables xii
List of Figures xiii
1 Introduction 1
1.1 Protein Structure and Geometric Models . . . . . . . . . . . . . . . . . . 2
1.2 Related Research Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Shape Analysis in Molecular Biology . . . . . . . . . . . . . . . . . . . 9
1.3.1 Describing Shapes . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Matching Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Writhing Number 18
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Prior and New Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Writhing and Winding . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Closed knots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Open knots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Computing Directional Writhing . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ix
2.6 Notes and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Backbone Simplification 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Prior and New Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Frechet Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Notes and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Elevation Function 58
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Defining Elevation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Height and Elevation . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Pedal Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Capturing Elevation Maxima . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 Elevation Maxima . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.7 Notes and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5 Matching via Hausdorff Distance 94
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
x
5.2 Collision-Free Hausdorff Distance between Sets of Balls . . . . . . . . . 100
5.2.1 Computing ������������� in 2D and 3D . . . . . . . . . . . . . . . 101
5.2.2 Partial matching . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Hausdorff Distance between Unions of Balls . . . . . . . . . . . . . . . . 105
5.3.1 The exact 2D algorithm . . . . . . . . . . . . . . . . . . . . . . 105
5.3.2 Approximation algorithms . . . . . . . . . . . . . . . . . . . . . 109
5.4 RMS and Summed Hausdorff Distance between Points . . . . . . . . . . 113
5.4.1 Simultaneous approximation of Voronoi diagrams . . . . . . . . . 113
5.4.2 Approximating �� ���������� . . . . . . . . . . . . . . . . . . . . . . 115
5.4.3 Approximating ������������� . . . . . . . . . . . . . . . . . . . . . . 117
5.4.4 Maintaining the 1-median function . . . . . . . . . . . . . . . . . 119
5.4.5 Randomized algorithm . . . . . . . . . . . . . . . . . . . . . . . 123
5.5 Notes and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6 Coarse Docking via Features 126
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2.1 Scoring function . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2.2 Computing features. . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.3 Coarse alignment algorithm . . . . . . . . . . . . . . . . . . . . 134
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.4 Notes and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Bibliography 147
Biography 161
xi
List of Tables
2.1 Comparisons on protein data . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Table of singularities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Number of Maxima for 1brs . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Number of Max with different resolution . . . . . . . . . . . . . . . . . . 90
4.4 Covering density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 Index-k Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.2 Complex 1brs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3 25 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.4 Two-step for 25 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Unbound Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
xii
List of Figures
1.1 Protein structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Protein models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Protein docking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 DNA supercoiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Sign of crossings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Worst case of writhe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Critical directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Winding number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Spherical triangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Open knot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8 Oriented edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.11 Protein backbones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Frechet matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Frechet simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Comparison between Frechet and Hausdorff simplifications . . . . . . . . 47
3.4 Relation between ���� ��� ����� and ��� ��� ����� . . . . . . . . . . . . . . . . . . 49
xiii
3.5 Results of Frechet simplification . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 Comparisons between DP and GreedyFrechetSimp algorithms . . . 55
3.8 Simplification of a protein . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Four types of maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Extended persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Elevation on 1-manifold . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Pedal curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Co-dimensional 2 singularities . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Discontinuity in elevation . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.8 Mercedes star property . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.9 Another neighborhood pattern for a triple point . . . . . . . . . . . . . . 78
4.10 Other neighborhood patterns . . . . . . . . . . . . . . . . . . . . . . . . 78
4.11 Parameterization of Gaussian neighborhood . . . . . . . . . . . . . . . . 80
4.12 Height difference for 2-legged maximum . . . . . . . . . . . . . . . . . 82
4.13 Height difference for 3-legged maximum . . . . . . . . . . . . . . . . . 82
4.14 Decaying of maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.15 Top 100 maxima for 1brs . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.16 Elevation on 1brs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xiv
5.1 Valid and forbidden regions . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 Voronoi for union of balls . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 Exponential grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1 Predict docking configurations . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Max types again. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3 Coarse alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4 Align features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.5 Align Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
xv
Chapter 1
Introduction
If a living cell is viewed as a biochemical factory, then its main workers are protein
molecules, acting as catalysts, transporting small molecules, forming cellular structures,
and carrying signals, among other roles. As Jacques Monod states in his book Chance and
Necessity: “ . . . it is in proteins that lies the secret of life. ”. Their functional diversity
is made possible by the diversity of their three-dimensional structures. Understanding or
simulating molecular processes involved in the formation of protein structures and their
biological functions is a major challenge of molecular biology. For most of the key prob-
lems involved in this challenge, such as protein folding, docking, structure classification,
and structure prediction, naturally, geometry and topology play important roles. However,
currently, geometric methods are not fully utilized and investigated when attacking these
key problems, partly due to a number of representational and algorithmic challenges.
To close this gap, in this thesis, we study shape analysis problems arising in molecular
biology by combining both geometric and topological approaches. In particular, we focus
on algorithms for describing and matching protein structures. Note that, in general, shape
characterization and matching are central to various application areas other than structural
biology, including computer vision, pattern recognition, and robotics [17, 131, 165]. Most
of the techniques that we have developed are applicable to these other fields as well.
In the remainder of this chapter, we first give a brief biological background on protein
structures and introduce some related research areas. More details can be found in standard
textbooks [34, 68, 125]. We then describe shape analysis problems, arising in molecular
biology, from the computational side. We state our main contributions at the end of this
chapter.
1
1.1 Protein Structure and Geometric Models
A protein is a polymer consisting of a long chain of small building blocks, called amino
acids or residues. All amino acids have a 3-atom backbone, � - ��� - � , to which a side chain
(denoted by ��� and ��� in Figure 1.1 (a)) is attached. Besides the side chain, a hydrogen
atom is bonded to the backbone nitrogen atom, and an oxygen is doubly bonded to the
carboxy carbon. There are � standard amino-acids residues, distinguishable by their side
chains. The amino end ( � ) of an amino acid connects to the carboxy end ( � ) of the
preceding amino acid, forming a peptide bond. Thus the chemical structure of a protein
molecule can be viewed as a linear sequence of amino acids interconnected by peptide
bonds.
C
C
N
H
H
N
O� �
� �O
�
�
(a) (b)Figure 1.1: (a) Protein structure: with the backbone structure in the dotted boxes. (b) The foldedstate of a protein; each atom is modeled as a ball.
Though a linear sequence, a protein molecule folds into a compact and typically unique
three-dimensional structure under certain physiological conditions (see Figure 1.1 (b)).
This is the result of various atomic interactions, such as van der Waals and electrostatic
forces. The resulting structure, refered to as the native structure or the folded state, is how
a protein molecule exists in nature, and is the conformation in which a protein is able to
perform its physiological functions. In fact, given a protein molecule, its three-dimensional
structure decides its functionality to a large extent [68, 125]1. Therefore, knowledge of the�For example, disruption of the native structures of proteins is the primary cause of several neurodegenera-
2
protein structures is essential for understanding the principals that govern their functions
in nature. This three-dimensional structure is the focus of our study. In general, protein
structures are examined at different levels, referred to as protein structure architecture [34].
We present a brief description below.
� primary structure: the amino acid composition of the protein, i.e., the linear se-
quence of amino acids;
� secondary structure: common patterns in the conformation of protein backbones
observed in nature. There are four major types of secondary structure elements
(SSEs): � -helix,�
-sheet,�
-turn, and coils;
� supersecondary structure: the higher scale structures organized by secondary struc-
ture elements, e.g, how SSEs are connected.
� tertiary structure: the global folded three-dimensional structure of the protein;
� quaternary structure: the structure, or the complex of two proteins bound together.
How to model proteins appropriately is a crucial first step before we can visualize or
manipulate them. Several models have been proposed, depending on the objectives of
the underlying applications and/or what information one would like to emphasize. In the
literature, the term modeling refers to a broad collection of methods to describe not only
the geometric structures, but also the energetic aspects of a molecule2 [84, 125]. In this
thesis, we focus on geometric models of the three dimensional structure of proteins [64].
Geometric shapes typically refer to a finite set of points, a space curve, or a surface.
In the context of molecular biology, a set of points corresponds to the set of centers of
atoms, possibly weighted by the van der Waals radius of the atoms. Sometimes, points
tive diseases, such as Alzheimer’s disease and Parkinson’s disease.�For example, by using quantum mechanics, one can describe in detail the energy of any arrangement ofatoms and molecules in a particular system.
3
are connected by “sticks” that represent covalent bonds between atoms (Figure 1.2 (a)).
Such representations not only specify the positions of each atom in a molecule, but also
the chemical information.
(a) (b) (c)
Figure 1.2: A protein molecule represented as (a) the set of atom centers connected by sticks torepresent covalent bonds; (b) a space curve (the main chain representation); and (c) the (van derWaals) surface of the union of atoms, each represented by a ball.
Sometimes, the details presented in the above representation are not necessary, or even
undesirable. The main chain representation is often exploited in such situations, where a
protein molecule is modeled as a curve in ��� following the trace of the backbone atoms of
the amino acids (see Figure 1.2 (b)). Such a representation emphasizes the linear nature
of protein molecules, and shows clearly how this linear sequence of amino acids folds in
space. It provides a much simplified representation of a protein, while still maintaining its
main structural features. Consequently, this representation is popular in many applications,
especially those of high computational complexity, such as protein folding and structure
classification.
A surface representation of proteins is useful when the object of study is the space occu-
pied by a protein molecule, or when the global structure of the molecule is more important
than its local geometry. There are many ways to represent the surface of a molecule. If
we model each atom by a ball in ��� with its van der Waals radius, then the surface of the
union of these balls is refered to as the van der Waals surface (Figure 1.2 (c)). The solvent
4
accessible surface, originally proposed by Lee and Richard [126], is the surface traced out
by the center of a probe sphere (which typically represents the water molecule) rolling on
top of the VDW surface. The surface traced out by the inward-facing surface of this probe
sphere is called molecular surface. The skin surface developed by Cheng et al. [55] is
more complicated, but has many elegant (mathematical) properties.
1.2 Related Research Areas
Despite the important role that protein structures play in understanding life, early research
in computational biology, or bioinformatics, focused on sequence analysis, rather than on
protein structures. One of the key reasons is that it is significantly easier to both acquire and
manage sequence data.3 Nevertheless, with the tremendous success in sequence analysis,
the study of protein structures has become increasingly critical. For example, in the post-
genomic era, a major obstacle to the exploitation of the large volume of genome sequence
data is the functional characterization of the gene products (protein structures). Since
the three-dimensional structures of proteins are more conserved than the corresponding
sequences, many large-scale protein structure determination projects have been initiated
recently [169] to help to analyze the functionally unannotated protein sequences space.
These initiatives are widely referred to as structural genomics or structural proteomics. In
this subsection, we briefly mention several (not necessarily disjoint) research areas related
to protein structures.4
Protein folding and protein structure prediction. Predicting a protein’s structure
from its amino acid sequence is one of the most significant tasks tackled in computational
biology (Figure 1.3). Solving this problem will have enormous impacts on rational drug
design, cell modeling, and genetic engineering. It is therefore not surprising that it is�
For example, the Human Genome Project has made massive amounts of protein sequences data available,while the output of experimentally determined protein structures, typically by time-consuming and relative
5
Figure 1.3: From left to right we show snapshots of a periscope and staphylococcal pro-tein A, B domain (in mainchain representation), at different stage during the folding process(http://parasol.tamu.edu/dsmft/research/folding).
considered as the “ holy grail ” in the structural biology community; see e.g. [123].
Two issues are involved here: (i) to understand the mechanism behind protein folding
process, i.e., how does a protein fold in nature; and (ii) to predict the final folded confor-
mation given a sequence of amino acids. These two aspects are obviously related, but not
necessarily equivalent: several successful approaches to predict protein structures do not
mimic the folding process, but rely on the knowledge of known protein structures.
There has been a long history of tackling the folding problem. Current approaches
can be classified into two categories, which we mention without going into any detail
here. For more information, refer to [26, 123, 156]. The first class, including comparative
modeling and threading, start by using a known template structure or known folds. The
second class, de novo or ab initio methods, predict structure from sequence directly, using
principles of atomic interactions and protein architecture. Despite the success of these
methods for some cases, especially in predicting structures of small protein molecules, the
protein folding problem remains largely unsolved. Main reasons include that the structure
is defined by a large number of degrees of freedom5, and that the physical basis of protein
expensive X-ray crystallography and NMR spectroscopy, is lagging far behind.�
It is impossible to survey these areas in a comprehensive manner in this thesis — each would require awhole book to do so!
�
As highlighted by the Levinthal paradox which results from the observation that proteins are folded intotheir specific three-dimensional conformation, in a timespan significantly short (milliseconds) than whatwould be expected if the molecule actually searched the entire conformation space for the lowest energystate.
6
structural stability is not fully understood. The vigor of this field can be seen from the
high participation and great performance output of the Critical Assessment of Structure
Prediction (CASP) experiments (http://predictioncenter.llnl.gov).
Protein interactions. Two or more molecules interact with each other by forming an
intermolecular complex (either in a stable manner, or temporarily), the process of which
is called docking or receptor-ligand recognition and binding. Such interactions are critical
to various biological processes, such as cell-cell recognition and enzyme catalysis and in-
hibition. The target macromolecules (receptors) are usually large, mostly proteins, while
the ligands can be either large, such as proteins, or small, such as drugs or cofactors (see
Figure 1.4). The sites where binding happens are called active sites or binding sites. They
(a) (b)
Figure 1.4: Examples of (a) protein-small molecule docking: mainchain representation of HIV-1protease bound to an inhibitor (in VDW representation); and (b) protein-protein docking: humangrowth hormone.
are usually places on the surfaces of proteins where chemical reactions or conformational
changes happen. Hence knowledge of the interaction between molecules is crucial in un-
derstanding, and even manipulating, their functions. As an example, many drug molecules
work by acting as inhibitors: they bind to the receptor proteins to block the active sites,
thus stopping undesired chemical reactions or molecular processes from happening. As a
result, efficient docking algorithms of drug molecules to target receptor proteins is one of
the major ingredients in a rational drug design scheme [85, 102, 164].
The docking problem has attracted great attention from computer scientists as well
7
as biochemists, due to its strong geometric and algorithmic flavor [102, 81, 114]. Much
success has been achieved for docking a protein with a small molecule, or docking two
rigid proteins (i.e., each protein can only undergo rigid transformations). However, the
field remains rather open, especially in the case of protein-protein docking without the
“rigid” assumption [97]. In this case, as protein structures are complicated, modeling their
conformational changes introduces many degrees of freedom. Nevertheless, progress is
being made, and interested readers should refer to the results of the CAPRI experiments,
i.e., the Critical Assessment of PRediction of Interaction, for the newest advances in this
field (http://capri.ebi.ac.uk).
Protein structure comparison and classification. As a protein molecule with some
functional role evolves in the context of a living cell, its overall three-dimensional struc-
ture tends to remain unaltered, even when all sequence memory may have been lost [80].
This evolutionary resilience of protein three-dimensional structures is the fundamental rea-
son for comparing protein structures in molecular biology. Numerous comparison methods
have been proposed and developed in the past 20 years [122, 160]. The problem, however,
is difficult and remains unsolved. There is typically no clear definition for structural sim-
ilarity. Structural similarity is of interest at many levels: from the fine detail of backbone
and side-chain conformation at the residue level to the coarse similarity at the tertiary struc-
ture level. Besides, many situations require one to capture local similarity, which is hard
to describe.
Moreover, more and more protein structure data are becoming available: At the time of
this writing, there are more than � � proteins structures in the Protein Data Bank [31],
and the number almost doubles every 18 months. It is therefore crucial to bring certain
order into protein structures by classifying them into families. Other than organizing the
large structure database, such classification can aid our understanding of the relationships
between structures and functions. For example, it is shown that almost all enzymes have
8
the so-called ��� �folds [132], i.e., they have both � -helices and
�-sheets in their struc-
tures. Furthermore, while each sequence typically generates a unique three-dimensional
structure, multiple sequences may produce similar folded structures, or folds. A natural
question arising is then: how many different folds are there in nature? Classification helps
to answer this question, and its solution is useful in annotating the sequence space by struc-
tures, thus functions, which is a central aspect in structural genomics that we mentioned
earlier [26]. Classification also enables us to experimentally determine many fewer number
of protein structures — we can now afford to determine only those that possibly produce
novel folds [160].
Currently three most popular classifications are: SCOP [138], CATH [139, 141], and
FSSP [104], all of which are accessible via the world wide web. Similar to structure
comparison, one main difficulty of the classification problem arises from the fact that there
is no consensus on defining the organization of different categories. Thus how to classify
protein structures in a fully automatic way is still a daunting problem.
1.3 Shape Analysis in Molecular Biology
Above we have sketched some key research areas closely related to protein structures. Two
issues appear repeatedly there — how to describe and characterize structures and how to
develop efficient computational methods (algorithms). These two issues are obviously not
new to many research fields in computer science. In this section, we address these two
issues by identifying shape-analysis problems in molecular biology and describing them
from a computational perspective. Shape analysis problems have been studied extensively
in many fields including computer graphics, vision, geometric computing, robotics, and so
on. On the one hand, many of the techniques there can be adopted to attacking problems in
molecular biology directly. On the other hand, protein structure analysis has many unique
9
properties and new techniques are greatly needed.6
At a high level, we classify shape analysis problems into two broad categories, each
including many subtopics. Though our classification below are tailored towards molecu-
lar biological applications, one should note that techniques developed are not necessarily
constrained to biological applications. Once again, we will only sample a few techniques
exploited in attacking these problems, as a full enumeration will go beyond the scope of this
thesis. For surveys on a subset of topics in shape analysis in general, refer to [17, 131, 165].
1.3.1 Describing Shapes
Modeling flexibility. We have introduced some basic geometric representations for
three-dimensional structures of proteins in Section 1.1. In some applications, more sophis-
ticated representations are required: protein molecules are in constant motion (vibration)
in solution, and they might undergo significant conformational changes at times (such as
in a protein-protein docking process). Therefore, it is important to incorporate flexibility
in modelling protein structures. The question then is of course how to model flexibility,
which is typically complex as the protein structures have too many degrees of freedom.
On the one hand, special data structures are desirable to efficiently support changes in con-
formations: For example, in [6], a chain hierarchy has been proposed which can detect
collisions for deforming protein backbones efficiently. On the other hand, it is important to
characterize motions: what are the types of motions molecules undergo and where do they
happen. Techniques from robotics, motion planing and graph theory have been exploited
successfully in several cases for either identifying possible motions [112] or for reducing
the degree of freedoms of motions [85, 161].
Simplified representations. In some applications, simplified structures are needed to�
We remark here that the shape analysis problems are extremely hard for proteins structures due to the factthat the connection between protein structures and their functions is not yet well understood. In manysituations, it is not clear what aspects about structures give rise to a particular functionality.
10
help to manage complex problems. Hence many approaches use simplified protein struc-
tures, such as representing the backbone of a protein molecule as a set of fragments, each
corresponding to a secondary structure element [160, 155]. As another example, one model
proposed by Dill [49, 74] simplifies the protein backbone as beads chained together on a
unit lattice. The beads can either be hydrophobic or hydrophilic, with contacts between
hydrophobic beads being favored. Although fairly simplistic, this model yields results
surprisingly similar to those derived from experimental data when applied to the protein
folding problem [73, 162].
Shape descriptors/signature. One aspect of shape characterization is to extract key
features or information of a given shape. For example, this is central to many approaches
that compare protein structures: The key information is typically stored in a shape de-
scriptor or signature, and similarity between two shapes can now be measured by some
distance between the corresponding descriptors. As another example, in order to under-
stand protein-protein interaction, there have been much research on characterizing the in-
terface where the two proteins interact with each other [27, 130]. Features that contribute
to describe such interfaces include the buried surface areas, the tightness of the binding,
hydrophobicity and so on.
Extracting features and generating shape descriptors are widely used in graphics, vi-
sion, and robotics. Many techniques borrowed from there can be applied to applications in
molecular biology: statistical methods (such as histograms and harmonic maps) [118, 22],
geometric-based methods (such as turning angles) [59], and topological-based methods
(such as Connolly function) [66, 136] have all been exploited to generate shape descriptors
in structural biological applications.
11
1.3.2 Matching Shapes
Similarity matching. Measuring similarity between protein structures is essential to
protein structure classification, and is needed when applying comparative modeling meth-
ods for predicting protein structures. There are in general two types of approaches for this
problem. The first type of methods are alignment-based. They consider matching as an
optimization problem by finding the best alignment (i.e., relative placement) of two input
structures under some scoring function (where the score evaluates how similar structures
are). Example of this approach include DALI [103], STRUCTAL [129], and CE [149].
How to define the scoring function, or the distance between two structures, is an intriguing
problem in itself, and has received much attention [80, 158].
The second approach computes the similarity/distance between input structures di-
rectly, without producing any alignment to superimpose them. Many of methods in this
category exploit shape descriptors [22, 145]. Another example is the contact-map over-
lay approach, which converts each protein structure into some special type of graph, and
similarity is measured as the size of the largest congruent subgraph between two such
graphs [94].
In general, alignment-based methods involve searching in a large configuration space,
and thus have higher time complexity than the second type of approaches. They are also
less efficient when querying in a structure database. However, they are more reliable and
discriminative in measuring similarities. Refer to [150, 151] for a comparison between
current popular matching approaches in structural biology.
Complementarity matching. The main motivation to study complementarity match-
ing in molecular biology is to understand protein-ligand interactions. A simple geometric
formulation for protein-protein docking problem is the following: Given protein � and
� , find the best transformation of � such that they best complement each other. In other
12
words, this is a partial surface matching under the constraints that the two surfaces do not
intersect. Two main issues involved here are:7 (i) evaluate the alignments generated, i.e.,
find a good score function that produces few false-positives [97, 153]; and (ii) reduce the
complexity of the search procedure, e.g., by exploiting more efficient computation (such
as FFT or spherical harmonics) [53, 143], by better searching strategies (such as by ge-
netic algorithms) [39, 91], or by reducing the number of transformations inspected (such
as geometric hashing) [86].
Of course in nature both molecules can change their conformation during the dock-
ing process. For large protein-protein interactions, it is complex to model flexibility in
matching procedure, and this is one of the main focus of the current research [97, 153].
Classification and structure database. We mentioned protein structure classification
earlier for the purpose of organizing the rapidly-expanding collection of protein structures
available. There are also needs to manage structures into a database that can support effi-
cient queries (i.e, given a query structure, return one or more structures from the database
that contain it (in the case of motif query), or that are similar to it). The pairwise structure
comparison that we discussed above (in similarity matching) is obviously a fundamen-
tal component in classification and query problems, and a straightforward way to classify
protein structures is by all-against-all comparisons. This is the method adopted by most
current classifications of protein structures, such as CATH and FSSP [139, 104]. It is,
however, rather inefficient, especially when combined with the alignment-based pairwise
matching procedures. Part of the reason for the usage of this straightforward approach
despite its inefficiency is because most past focus is on how to classify protein structures
in a reliable and automatic way (the problem of which is still not satisfactorily solved
even now) [122]. With better understanding of protein structures, and with the number
of known structures increasing rapidly, efficient clustering techniques become essential.�
References are for molecular biological applications.
13
Several recently developed protein structure comparison techniques aim at developing a
similarity measure that satisfies triangle inequality [60, 47, 145], so that many known clus-
tering algorithms can be applied. In particular in [60], Choi et al. exploits techniques from
information retrieval in building their classification system.
1.4 Main Contributions
Our research touches both shape description and shape matching categories. We focus
on developing efficient computational methods for describing or matching structures. Our
approaches rely on both geometric and topological techniques 8. Softwares produced from
this thesis work are available at the BioGeometry website (http://biogeometry.duke.edu/).
Part I. Shape description.
(1) Writhing number: The writhing number of a curve measures how many times a curve
coils around itself in space. It characterizes the so-called supercoiling phenomenon of dou-
ble stranded DNA, which influences DNA replication, recombination, and transcription. It
is also used to characterize protein backbones. We establish a relationship between the
writhing number of a space curve and the winding number, a topological concept. This
enables us to develop the first subquadratic algorithm for computing the writhing number
of a polygonal curve. We have also implemented a simpler algorithm that runs in near-
linear time on inputs that are typical in practice [5], including protein backbones and DNA
strands, in contrast to the quadratic-time algorithms used by current softwares.
(2) Simplification: The level-of-detail (LOD) representation of protein backbone helps to
single out its main features. One way to obtain LOD representations is via curve simpli-
fication. We study the simplification problem under the so-called Frechet-error measure.�
We remark here that in the literature of molecular biology, the word topology is typically used in a differentmeaning from our usage: It mainly refers to the topology of the molecule itself, such as how the elementsof a molecules are interconnected; while we exploits knowledge and understanding from theories and toolsfrom classic topology, such as Morse theory, in our approaches.
14
This measure is more desirable than the widely used Hausdorff error measure in many sit-
uations, especially if one wants to preserve global features of a curve (e.g, the secondary
structure elements of a protein backbone) during simplification. We propose and imple-
ment a simple algorithm to simplify curves under Frechet error measure [7], which is the
first simplification algorithm that runs in near-linear time in dimensions higher than two
with guaranteed quality.
(3) Elevation function: Given a molecular surface, to capture geometric features such
as protrusions and cavities, we design a continuous elevation function on the surface, and
compute the points with locally maximum elevation [4]. The intuition of the function
follows from elevation on Earth, by which we identify mountain peaks and valleys. But
the concept is more technical to extend to general 2-manifolds. This function is scale-
independent and provides beyond the location, also direction and size of shape features.
By elevation function, we can describe the above geometric features in a reliable and suc-
cinct manner, the results of which can aid to attack the protein docking problem.
Part II. Shape matching.
(4) Matching via Hausdorff distance: By modelling a molecule as the union of a set of
balls (each representing an atom), we measure the similarity between two molecules by
variants of Hausdorff distance. In particular, we present algorithms that compute exactly
or approximately the minimum Hausdorff distances between two such unions under all
possible translations [8]. We also investigate the version in which we are constrained to
only translations where the two sets remain collision-free (i.e., no ball from one set inter-
sects the other sets).
(5) Docking via features: As mentioned earlier, from a geometric perspective, protein
docking can be considered as the problem of searching for configurations with the maxi-
mum complementarity between two molecular surfaces. Our goal is to efficiently compute
a small set of potentially good docking configurations based on the geometry of the two
15
structures. Given such a set, more sophisticate procedures can then be performed on each
of its members independently to locate the “real” docking configuration. To find such a
potential set, we would like to align cavities from one protein with protrusions from the
other, and these “meaningful” features are captured by the elevation function we have de-
signed. Our approach can compute important matching positions while inspecting many
fewer configurations than the exhaustive search or earlier geometric-hashing approaches.
16
Chapter 2
Writhing Number
2.1 Introduction
The writhing number is an attempt to capture the physical phenomenon that a cord tends to
form loops and coils when it is twisted. We model the cord by a knot, which we define to
be an oriented closed curve in three-dimensional space. We consider its two-dimensional
family of parallel projections. In each projection, we count���
or � � for each crossing,
depending on whether the overpass requires a counterclockwise or a clockwise rotation
(an an angle between 0 and � ) to align with the underpass. The writhing number is then
the signed number of crossings averaged over all parallel projections. It is a conformal
invariant of the knot and useful as a measure of its global geometry.
The writhing number attracted much attention after the relationship between the linking
number of a closed ribbon and the writhing number of its axis, expressed by the White
formula, was formally discovered independently by Calugareanu [40], Fuller [88], Pohl
[142], and White [168].
��� � � � ���� (2.1)
Here the linking number,���
, is half the signed number of crossings between the two
boundary curves of the ribbon, and the twisting number,�
, is half the average signed
number of local crossing between the two curves. The non-local crossings between the
two curves correspond to crossings of the ribbon axis, which are counted by the writhing
number,���
. The linking number is a topological invariant, while the twisting number and
the writhing number are not. A small subset of the mathematical literature on the subject
can be found in [20, 79].
18
Besides the mathematical interest, the White Formula and the writhing number have
received attention both in physics and in biochemistry [70, 117, 128, 157]. For example,
they are relevant in understanding various geometric conformations we find for circular
DNA in solution, as illustrated in Figure 2.1 taken from [37]. By representing DNA as
Figure 2.1: Circular DNA takes on different supercoiling conformations in solution.
a ribbon, the writhing number of its axis measures the amount of supercoiling, which
characterizes some of the DNA’s chemical and biological properties [30].
As another example, the writhing number and some of its variants have also been ap-
plied to protein backbones, modeled as open curves, as shape descriptors to classify pro-
tein structures [128, 145]. The intuition for such approaches follows from the fact that the
writhing number of a space curve measures the relative position between any two points in
the curve and the relative orientation between the tangents at those points. (This view will
become more clear after we introduce Equation 2.3 in the next section. ) When extended
to a polygonal curve, this means that the writhing number measures the relative position
and orientation between any two edges of the curve. Hence two protein backbones with
similar arrangements of secondary structure elements produce similar writhing number.
This chapter studies algorithms for computing the writhing number of a polygonal
knot. Section 2.2 introduces background work and states our results. Section 2.3 relates
the writhing number of a knot with the winding number of its Gauss map. Section 2.4
shows how to compute the writhing number in time less than quadratic in the number of
edges of the knot. Section 2.5 discusses a simpler sweep-line algorithm and presents initial
experimental results.
19
2.2 Prior and New Work
In this section, we formally define the writhing number of a knot and review prior al-
gorithms used to compute or approximate that number. We conclude by presenting our
results.
Definitions. A knot is a continuous injection � ��� ��� � � or, equivalently, an oriented
closed curve embedded in ��� . We use the two-dimensional sphere of directions, � � , to
represent the family of parallel projections in ��� . Given a knot � and a direction ���� � ,the projection of � is an oriented, possibly self-intersecting, closed curve in a plane normal
to � . We assume � to be generic, that is, each crossing of � in the direction � is simple
and identifies two oriented intervals along � , of which the one closer to the viewer is the
overpass and the other is the underpass. We count the crossing as���
if we can align the
two orientations by rotating the overpass in counterclockwise order by an angle between
and � . Similarly, we count the crossing as � � if the necessary rotation is in clockwise
order. Both cases are illustrated in Figure 2.2. The Tait or directional writhing number of
+1 −1Figure 2.2: The two types of crossings when two oriented intervals intersect.
� in the direction � , denoted as � � � , is the sum of crossings counted as���
or � � as
explained. The writhing number is the averaged directional writhing number, taken over
all directions ����� � ,��� � �
�
����� � � ��� � � (2.2)
We note that a crossing in the projection along � also exists in the opposite direction, along
��� , and that it has the same sign. Hence � � � � � ��� � , which implies that the
20
writhing number can be obtained by averaging the directional writhing number over all
points of the projective plane or, equivalently, over all antipodal points pairs � � � ����� of the
sphere.
Computing the writhing number. Several approaches to computing the writhing number
of a smooth knot exactly or approximately have been developed. Consider an arc-length
parameterization � � � � � � � , and use ��� and � to denote the position and the unit
tangent vectors for � ��� � . The following double integral formula for the writhing number
can be found in [142, 159]:
��� � � �
� ��� ��� �� � � ��� � � � � ��� � � � � � � ��� � (2.3)
If the smooth knot is approximated by a polygonal knot, we can turn the right hand side of
(2.3) into a double sum and approximate the writhing number of the smooth knot [33, 128].
This can also be done in a way so that the double sum gives the exact writhing number of
the polygonal knot [28, 121, 166].
Alternatively, we may base the computation of the writhing number on the directional
version of the White formula,� � � � � � � � � � � for � � � � . Recall that both the
linking number and the twisting number are defined over the two boundary curves of a
closed ribbon. Similar to the definition of � � � , the directional twisting number,� � � � ,
is defined as half the sum of crossings between the two curves, each counted as���
or � �
as described in Figure 2.2. We get (2.1) by integrating over � � and noting that the linking
number does not depend on the direction. This implies
��� � �����
� � � � � � � � � � � �
� � �� ��� �����
� (2.4)
To compute the directional and the (average directional) twisting numbers, we expand �to a ribbon, which amounts to constructing a second knot that runs alongside but is disjoint
21
from � . Expressions for these numbers that depend on how we construct this second knot
can be found in [121]. Le Bret [35] suggests to fix a direction � and define the second
knot such that in the projection it runs always to the left of � . In this case we have� � � � � and the writhing number is the directional writhing number for � minus the
twisting number.
A third approach to computing the writhing number is based on a result by Cima-
soni [62], which states that the writhing number is the directional writhing number for a
fixed direction � , plus the average deviation of the other directional writhing numbers from � � � . By observing that � � � is the same for all directions � in a cell � of the de-
composition of � � formed by the Gauss maps
and �
(also referred to as the tangent
indicatrix or tantrix in the literature [56, 154]), we get
��� � � � � � � �
����� � � � � ��� � � � (2.5)
where � is ��� � for any one point � in the interior of � , and ��
is the area of � .
If applied to a polygonal knot, all three algorithms take time that is at least proportional
to the square of the number of edges in the worst case.
Our results. We present two new results. The first result can be viewed as a variation of
(2.4) and a stronger version of (2.5). For a direction � � � � not on
and not on �
, let ��� � be its winding number with respect to
and �
. As explained in Section 2.3, this
means that
and �
wind ��� � times around � .
THEOREM A. For a knot � and a direction � , we have
��� � � � � � � � � � � �
� � � ��� � ���
�
Observe the similarity of this formula with (2.4), which suggests that the winding number
can be interpreted as the directional twisting number for a ribbon one of whose two bound-
22
ary curves is � . We will prove Theorem A in Section 2.3. We will also extend the relation
in Theorem A to open knots and give an algorithm that computes the average winding
number in time proportional to the number of edges. Our second result is an algorithm that
computes the directional writhing number for a polygonal knot in time sub-quadratic in the
number of edges.
THEOREM B. Given a polygonal knot � with � edges and a direction � ��� � , � � � can
be computed in time O � � ��� ����� � , where � is an arbitrarily small positive constant.
Figure 2.3: A knot whose directional writhing number is quadratic in the number of edges.
Theorems A and B imply that the writhing number for a polygonal knot can be computed
in time O � � ��� ����� � . As shown in Figure 2.3, the number of crossings in a projection can
be as large as quadratic in � . The sub-quadratic running time is achieved because the
algorithm avoids checking each crossing explicitly. We also present a simpler sweep-line
algorithm that checks each crossing individually and therefore does not achieve the worst-
case running time of the algorithm in Theorem B. It is, however, fast when there are few
crossings.
2.3 Writhing and Winding
In this section, we develop our geometric understanding of the relationship between the
writhing number of a knot and the winding number of its Gauss map. We define the Gauss
23
map as the curve of critical directions, prove Theorem A, and give a fast algorithm for
computing the average winding number.
2.3.1 Closed knots
Critical directions. We specify a polygonal knot � by the cyclic sequence of its vertices,
��� � � � � � ��� � ����� � in � � . We use indices modulo � and write ��� � � � � � ��� � � � � � � � � � � � ��
for the unit vector along the edge � � � � � � . Note that �� is also a direction in � � and a point in� � . Any two consecutive points ��� and �� � � determine a unique arc, which, by definition, is
the shorter piece of the great circle that connects them. The cyclic sequence � � � � � � ��� � � � �� �thus defines an oriented closed curve
in � � . We also need the antipodal curve, �
, which
is the central reflection of
through the origin.
Figure 2.4: In all three cases, the viewing direction slides from left to right over the orientedgreat circle of directions defined by the hollow vertex and the solid edge. The directional writhingnumber changes only in the third case, where we lose a positive crossing.
The directions � on
and �
are critical, in the sense that the directional writhing
number changes when we pass through � along a generic path in � � , and these are the only
critical directions [62]. We sketch the proof of this claim for the polygonal case. It is clear
that � � � � is critical only if it is parallel to a line that passes through a vertex � � and
a point on an edge � ����� � � of the knot that is not adjacent to � � . There are � � � � � such
vertex-edge pairs, each defining a great circle in � � . First, we note that only � of these great
24
circles actually carry critical points, namely, the great circles that correspond to ����
� �
and ���� � . The reason for this is shown in Figure 2.4, where we see that the writhing
number does not change unless � � is separated from � ����� � � by only one edge along the knot.
Second, assuming ����� � we observe that the subset of directions along which � � projects
onto � � � � � � � � is the arc � ����� from � � to the direction � � � � � � � � � � � � � � � � � � � � ��
in � � ,and symmetrically the arc � � ����� � � from � �� to ����� . The subset of directions along which
� � � � projects onto � � � � � � are the arcs � � � � � � and � � ��� � � � � � . The points �� , ��� , and �� � � lie
on a common great circle and � � lies on the arc �� � � � � . This implies that the concatenation
of � ���� and ��� � � � � is the arc � � � � � � , and that of � � � ���� � and � � ��� � � � � � is the arc � � �� � � � � � . It
follows that
and �
indeed comprise all critical directions.
Decomposition. The curves
and �
are both oriented, which is essential. We say a
direction � � � � lies to the left of an oriented arc �� if it lies in the open hemisphere to
the left of the oriented great circle that contains �� . Equivalently, � sees that great circle
oriented in counterclockwise order. If � passes from the left of an arc �� of
to its right,
then we either lose a positive crossing (as in the third row of Figure 2.4), or we pick up
a negative crossing. Either way the directional writhing number decreases by one. This
motion corresponds to � � passing from the right of the arc � � �� � of �
to its left. Since
the directional writhing numbers at � and � � are the same, we decrease the directional
writhing number by one in the opposite view as well. In other words, if � moves from the
left of an arc of �
to its right, then the effect on the directional writhing number is the
opposite from what it is for an arc of
. These simple rules allow us to keep track of the
directional writhing number while moving around in � � . The curves
and �
decompose
� � into cells within which the directional writhing number is invariant. We can thus rewrite
(2.2) as
��� � � �
� � � � � �25
where the sum ranges over all cells � of the decomposition, and � is the directional
writhing number of any one point in the interior of � . Equation (2.5) of Cimasoni can now
be obtained by subtracting � � � from � inside the sum and adding it outside the sum.
This reformulation provides an algorithm for computing the writhing number.
Step 1. Compute � � � for an arbitrary but fixed direction � .Step 2. Construct the decomposition of � � into cells, label each cell � with � �
� � � , and form the sum as in (2.5).
The running time for Step 2 is � � � � � in the worst case as there can be quadratically many
cells. We improve the running time to O � � � and, at the same time, simplify the algorithm.
First we prove Theorem A.
Winding numbers. We now introduce a function
over � � that may be different from
but changes in the same way. In other words, � � � � � � � � � � � � � � �
for all � � � � � � . This function is the winding number of a point � � � � with respect
to the two curves
and �
that do not contain � . Observe that the space obtained by
removing two points from the two-dimensional sphere is topologically an annulus. We
fix non-critical, antipodal directions � and ��� and define � � � equal to the number of
times
winds around the annulus obtained by removing � and ��� plus the number of
times �
winds around the annulus obtained by removing � and � . This is illustrated in
Figure 2.5, where � � � � � ��� � � � and
� � � � . Here we count the winding of
in
counterclockwise order as seen from � positive, and winding in clockwise order negative.
Symmetrically, we count the winding of �
in clockwise order as seen from � positive,
and winding in counterclockwise order negative. Imagine moving a point � along
and
connecting � to � with a circular arc. Specifically, we use the circle that passes through
� , � , and ��� and the arc with endpoints � and � that avoids ��� . Symmetrically, we move
� � along �
and connect � to � � with the appropriate arc of the circle passing through
26
−T
T
−z
z
x
Figure 2.5: The winding number counts the number of times � separates � from ��� and ���separates � from � .
� , � � , and � . Locally at � we observe continuous movements of the two arcs. Clockwise
and counterclockwise movements cancel, and ��� � is the number of times the first arc
rotates in counterclockwise order around � plus the number of times the second arc rotates
in clockwise order around � . The winding number of � is always an integer but can be
negative.
Observe that
indeed changes in the same way as does. Specifically,
drops by
1 if � crosses
from left to right, and it increases by 1 if � crosses �
from left to right.
Starting from the definition (2.2) of the writhing number, we thus get
��� � � �
� ��� ��� ��� �� � � � � �
�
� � � � ��� � � � � ����� �� � � � � �
�
� � � � ��� � � � � � ������ � � � � � � � � �
�
� � � ��� � � � �
which completes the proof of Theorem A.
Signed area modulo 2. Observe that the writhing number changes continuously under
deformations of the knot, as long as � does not pass through itself. When � performs a
small motion during which it passes through itself there is a �� jump in � � � , while the
27
average winding number changes only slightly. We use these observations to give a new
proof of Fuller’s relation [13, 89],
� � ��� � ��� � � ����� � � � (2.6)
where ��� � ��� � � � ��� is the signed area of the curve
in � � . Note first that
� �
��� � ��� � � ����� � � � because both � � � and � � � are integers. We start with �
being a circle in � � , in which case (2.6) holds because��� � and ��� � � � . Other
than continuous changes, we observe jumps of � in���
when � passes through itself.
Theorem A together with the fact that the fractional parts of��� ���
and �� � � are the
same implies that (2.6) is maintained during the deformation. Fuller’s relation follows
because every knot can be obtained from the circle by continuous deformation.
Computing the average winding number. Three generic points ��� �� ��� � define three
arcs, which bound the spherical triangle �� . Recall that the area of �� is the sum of angles
minus � . We define the signed area of �� as � � �� � ��� � � if lies to the left of
the oriented arc �� , and as � � � � ��� � � � if it lies to the right. Let �� � � be a
non-critical direction. As shown in Figure 2.6, every arc � � � � � � forms a unique spherical
triangle � � � � � � � . Let � � be its signed area. The corresponding arc � � � � � � � � � of �
forms
the antipodal spherical triangle � � � ��� � � � � � with signed area � � � . The winding number of a
it
+1i−t−ti
ti+1
−z
z
Figure 2.6: The two spherical triangles defined by an arc of � and its antipodal arc of ��� .
direction ���� � can be obtained by counting the number of spherical triangles that contain
it. To be more specific, we call a spherical triangle positive if its signed area is positive and
negative if its signed area is negative. Let ��� ��� � and � � � � � be the numbers of positive and
28
negative spherical triangles � �� � � � � that contain � , and similarly let � � � ��� � and � � � ��� � be
the numbers of positive and negative spherical triangles � � � � � � � � � � that contain � . Then
� � � � � � � ��� � � � � ��� ��� � � � � � ��� � � � � � � � ����
To see this note that the equation is correct for a point � near � and remains correct as �moves around and crosses arcs of
and of �
. The average winding number is thus
� �
� � � � � � ��� � �
�
�� ��� � �
� � �� �
��� ��� � �
� � � � �
� �
�
�� ��� � �
� � �
Computing the sum in this equation is straightforward and takes only time O( � ).
2.3.2 Open knots
We define an open knot as a continuous injection� � � � � � � � � . Equivalently, it is an
oriented curve, embedded in ��� , with endpoints. The directional writhing number of�
is
well-defined, and the writhing number is the directional writhing number averaged over
all parallel projections, as before. Assume�
is a polygon specified by the sequence of its
vertices, � � � � � � ����� � � �� � , and let � be the knot obtained by adding the edge � ��� � � � . The
critical directions of�
differ in two ways from those of � :
(i) there are critical directions of � that are not critical for�
, namely the ones whose
definition includes a point of � ��� ��� � ;
(ii) there are new critical directions, namely those defined by an endpoint (� � or � ��� � )
and another point of the polygon but not on the two adjacent edges.
To see that the directions in (ii) are indeed critical for�
, examine the first two rows of
Figure 2.4. The hollow vertex is now an endpoint of�
, so we remove one of the two
29
dashed edges. Because of this change, the directional writhing number changes at the
moment the hollow vertex passes over the solid edge. Changing the critical curve
of �to the critical curve � of
�can thus be achieved by removing the arcs of Case (i) and adding
the arcs of Case (ii). We illustrate this process in Figure 2.7. To describe the process, we
..
tn−1
.3
.
t
t0
t −2n
−3=u −3 wnn
...
wn−4
w 2
vn−3
nt −3
1
.
u0 = v2
v
.
......
.. .
n 0=−1
−un
nv=−1−t
= vn−2−2
−1 1=−un w
w
Figure 2.7: The critical curves of the knot � are marked by hollow vertices, and the additionsrequired for the critical curves of the open knot � are marked by solid black vertices.
define � � � � � � � � � � � � � � � � � , for��� � � ��� � , and
� � � ����� � � ��� � � � � �� � � ��� � , for
� � �� � . Observe that � � � � , ��� � � � � �� � , �� � � � � � � �� � , � � ��� �� � ,
and ��
��� ��� � . We get the critical curve � from
by
1. removing the partial arcs � ��� � � ��� � and � � � � , and the arcs � �� � � ��� � and � ��� � � � ,2. adding new paths � � � � � �
������
� �� � � � � �� � and � � ��� � �� � � � � ����� � ��� ��
� �� � .
Note that Step 2 adds a piece of �
, namely � � � �� ��� ��� � � and � � � ��� � � ��� � � , to the new
critical curve � . Symmetrically, we get ��� from �
. Everything we said earlier about the
winding number of the critical curve
of � applies equally well to the critical curve � of�
. Similarly, all algorithms described in the subsequent sections apply to knots as well as
to open knots.
30
2.4 Computing Directional Writhing
In this section, we present an algorithm that computes the directional writhing number of
a polygonal knot with � edges in time roughly proportional to ���� � . The algorithm uses
complicated subroutines that may not lend themselves to an easy implementation.
Reduction to five dimensions. Assume without loss of generality that we view the knot
� from above, that is, in the direction of � � � � � � � � . Each edge � � � � � � � � � of� is oriented. Another edge � � � ��� ��� � � that crosses � � in the projection either passes
above or below and it either passes from left to right or from right to left. The four cases
are illustrated in Figure 2.8 and classified as positive and negative crossings according to
Figure 2.2. Letting � � and � � be the numbers of edges that form positive and negative
+1 −1 −1 +1
Figure 2.8: The four ways an oriented edge can cross another.
crossings with � � , the directional writhing number is
� � � � �
� ��� ��� � �
� � ��� ��� � �
� ��� �
To compute the sums of the � � and � � efficiently, we map edges in ��� to points and half-
spaces in ��� . Specifically, let � � be the oriented line that contains the oriented edge � � and
use Plucker coordinates as explained in [52] to map � � to a point � � � � � or alternatively to
a half-space � � in �� . The mapping has the property that � � and � � form a positive crossing
if and only if � � lies in the interior of � � . We use this correspondence to compute � � � in
two stages: first we collect the ordered pairs of oriented lines that form positive crossings,
and second we count among them the pairs of edges that cross.
Recursive algorithm. It is convenient to explain the algorithm in a slightly more general
31
setting, where � and � are sets of � and � oriented edges in � � . Let � � � � � � denote
the number of pairs � � ��� � ��� ��� that form positive crossings, and note that � � ��
�� � � � � if � is the set of edges of the knot � and � � � . We map � to a set � of
points and � to a set � of half-spaces in � � . Let�� be a sufficiently large constant.
A � -cutting of � and � is a collection of pairwise disjoint simplices covering � such that
each simplex intersects at most � � � hyperplanes bounding the half-spaces in � . We use
the algorithm in [10] to compute a � -cutting consisting of � simplices in time O( � � � ),
where � is at most���� ��� � times a constant independent of
�. For each simplex ��� in the
cutting, define
��� � � � � ��� � � � ���� � ���� � � � � ������� � � ��� ��� �� � � �! � � � � � ���������#" � � � �
Letting � � � $&%(' � ��� and �)� � $&%(' � ��� , we have � � � � � and �)� � � � � . By construc-
tion, every � � ��� � ���*� � ! � defines a pair of lines that form a positive crossing. For each
simplex ��� , we count the edge pairs � � ��� � ���*� � ! � that form positive crossings, and let
� � be the number of such pairs. Then
� � � � � � ��� � �
� �� ��� � ��� � � � � � �
Note that � � is the number of crossings between projections of the line segments in �+�and in
! � . We can therefore use the algorithm in [51] to compute all numbers � � , for
� � � � � , in time , ��� � � � � O ��� �.- � � �.- � ��� � � � ��� � � � ��� � � . We recurse to compute
the �� ��� � ��� � and stop the recursion when � � � . The running time of this algorithm is at
32
most
� ��� � � � � , � � � � � ��� � �
� ��� � � � � � ��
O � � � ��� � � �)� � � � �
for any � � , provided� � � � � � is sufficiently large.
Improving the running time. We improve the running time of the algorithm by taking
advantage of the symmetry of the mapping to � � . Specifically, a point � � lies in the interior
of a half-space � � if and only if the point � � lies in the interior of the half-space ��� . We
proceed as above, but switch the roles of points and half-spaces when � � becomes less than
� . That is, if � ��� � then we map the edges in � to half-spaces and the edges in � to points.
By our above analysis, the running time is then less than� � � ��� � � O ��� � ��� � � ��� � � � �
O � � � ��� � . The overall running time is thus less than
� � � � � � � � , ��� � � � � � � �
� ��� � ��� � if � �� � � � � ��� if � ��� �
�O � ��� � � � � ���� � ��� � � � � ��� � �
where is a positive constant and � is any real larger than � . It follows that � � � can
be computed in time O( � ��� ����� � , for any constant �� . Similarly, � � � and therefore
the directional writhing number, � � � , can be computed within the same time bound,
thereby proving Theorem B.
We remark that the technique described in this section can also be used to compute the
linking number between two polygonal knots with � and �� edges in time O( � ��� ����� ).
2.5 Experiments
In this section, we sketch a sweep-line algorithm that computes the writhing number of a
polygonal knot using Theorem A. We implemented the algorithm in C++ using the LEDA
33
software library and compared it with two versions of the algorithm based on the double
integral in (2.3). We did not implement any version of Le Bret’s algorithm mentioned in
Section 2.2 since it is based on a formula similar to Theorem A and can be expected to
perform about the same as our sweep-line algorithm, and since it only works for closed
knots.
2.5.1 Algorithms
Sweep-line algorithm. Theorem A expresses the writhing number of a knot � as the
sum of three terms. Accordingly, we compute the writhing number in three steps.
Step 1. Compute the directional writhing number for an arbitrary but fixed, non-critical
direction � , � � � .Step 2. Compute the winding number of � relative to the Gauss maps
and �
, � � � .
Step 3. Compute the average winding number by summing the signed areas of the
spherical triangles � �� � � � � , ���� � � � .
Return � � � � � � � � ���� � � � .
Instead of using the algorithm described in Section 2.4, we implemented Step 1 using a
sweep-line algorithm [71], which reports the crossing pairs formed by the � edges in
time O � � � � � ��� � � . Steps 2 and 3 are both computed in a single traversal of the spherical
polygons
and �
, keeping track of the accumulated angle and the signed area as we go.
The running time of the traversal is only O � � � .
Double-sum algorithm. We compare the implementation of the sweep-line algorithm
with two implementations of (2.3). Write � � � � � � � � � � for the unnormalized tangent
vector. Following [33, 128], we discretize (2.3) to
� � � � �
��� ��� � �
����� �
� � � � � � � ��� � � � � ��� � � �� � � (2.7)
34
We note that� � is not the writhing number of the polygonal knot, but it converges to the
writhing number of a smooth knot if the polygonal approximation is progressively refined
to approach that knot [43].
Alternatively, we may discretize the double integral in such a way that the result is
the writhing number of the approximating polygonal knot. Given two edges ��� and � � , we
measure the area of the two antipodal quadrangles in � � along whose directions we see the
edges cross. The area of one of the quadrangles is the sum of angles minus one full angle,
�� � ��� � � � � . The absolute value of the signed area � � � is the same, and its sign
depends on whether we see a positive or a negative crossing. We thus have
� � � � �
��� ��� � �
����� �
� � � � (2.8)
Straightforward vector geometry and trigonometry can be used to derive analytical formu-
las for the � � � [28, 121].
2.5.2 Comparison
We compare the three implementations using a sequence of polygonal approximations of
an artificially created smooth knot. It has the form of the infinity symbol, � , and is fairly
flat in � � , with only a small gap in the middle. Because the knots are fairly flat, most of
their parallel projections have one crossing and the writhing number is just a little smaller
than�� . Figure 2.9 shows that the algorithms that compute the exact writhing numbers
for polygonal approximations converge faster to the writhing number of the smooth knot
than the algorithm implementing (2.7). Figure 2.10 shows how much faster the sweep-line
algorithm is than both implementations of the double-sum algorithm. Let � be the number
of edges. The graphs suggest that the running time of the sweep-line algorithm is O( � ) and
the running times of the two implementations of the double-sum algorithm are� � � � � . We
35
Figure 2.9: Comparing convergence rates between ��� (upper curve) and � � (lower curve). Foreach tested approximation of the � -knot, we draw the number of vertices along the horizontal axisand the writhing number along the vertical axis.
Figure 2.10: Comparing the running times of the sweep-line algorithm (lower curve) and thetwo implementations of the double-sum algorithm: approximate (middle curve) and exact (uppercurve). The � -axis and � -axis represent the number of vertices in the curve, and the running timeof the algorithm respectively.
observe the linear bound whenever we approximate a smooth knot by a polygon, since for
generic projections the number of crossings as well as the number of edges simultaneously
intersected by the sweep-line are independent of the total number of edges.
Protein backbones. We present some preliminary experimental results obtained with
the three implementations. All experiments are carried out on a SUN workstation, with
a 333 MHz UltraSPARC-IIi CPU, and 256 MB memory. Short of conformation data of
long DNA strands, we decided to run our algorithms on a modest collection of open knots
representing protein backbones, down-loaded from the protein data bank [2]. We modified
the algorithms to account for the missing edge in the data, as explained in Section 2.3.
Figure 2.11 displays the four backbones chosen for our experimental study. Table 2.1
presents some of our findings.
36
Figure 2.11: The open knots modeling the backbone of the protein conformations stored in thePDB files 1AUS.pdb (upper left), 1CDK.pdb (upper right), 1CJA.pdb (lower left), and 1EQZ.pdb(lower right).
Thick knots. Even though the writhing number of a polygonal knot can be as large as
quadratic in the number of edges, all four protein backbones in Figure 2.11 have writhing
numbers that are significantly smaller than the numbers of edges. If a knot is made out of
rope with non-zero thickness, then the quadratic bound can be achieved only if the ratio
of length over cross-section radius is sufficiently high. Specifically, the writhing number
of a knot of length�
with an embedded tubular neighborhood of radius � is less than
�� � � � � � � - � [44]. Such “thick” knots can be used to capture the fact that the edges of a
protein backbone are about as long as they are thick. A backbone with � edges thus has
writhing number at most some constant times �� - � . Examples which show that the upper
37
Data Size Time Writhing #� � ����� ��� ����� ��� � �
1AUS 439 122 0.09 3.93 9.28 22.70 17.871CDK 343 111 0.06 2.39 5.62 7.96 6.011CJA 327 150 0.06 2.19 5.10 12.14 10.431EQZ 125 18 0.02 0.31 0.73 4.78 3.37
Table 2.1: Four protein backbones modeled by open polygonal knots. The size of the problemis measure by the number of edges, � , and the number of crossings in the chosen projection, � .The time the sweep-line ( � ��� ), the approximate double-sum ( ���� ), and the exact double-sum( ����� ) algorithms take is measured in seconds. � � is an approximation of the writhing number forpolygonal data.
bound is asymptotically tight can be found in [38, 45, 72].
2.6 Notes and Discussion
In this chapter, we have developed an efficient algorithm to compute the writhing number
of a space curve in � � . A fast method is important as the writhing number of DNA strands
is computed at each step in some molecular simulations. Other than this computational
aspect, it would be interesting to further investigate the concept and see whether there is a
correlation between the writhing numbers and the common classification of protein folds.
As mentioned in Chapter 2.1 there has been some initial work in this direction [82, 145].
It seems that although the writhing number of a protein backbone describes the spatial
arrangements of its secondary structure elements, it alone is not discriminative enough in
classifying protein structures. One major reason is that the writhing number is mainly
effective in describing the global geometry of a given space curve. To solve this prob-
lem, it might be necessary to consider backbones on a range of scale levels and compute
the writhing number as a function of scale. Another possible approach is to combine the
writhing number with other topological or geometric measures that describe different as-
pects, especially the local geometry, of protein structures.
38
Chapter 3
Backbone Simplification
3.1 Introduction
Protein structures are examined at different levels of details in various applications. Sim-
pler structures are exploited in cases either when time complexity is too high, or when
excessive details might obscure crucial features or principals that one would like to ob-
serve. It is therefore desirable to build up a level-of-detail (LOD) representation for pro-
tein structures. One way to achieve such a representation for protein backbones is via curve
simplifications.
Given a polygonal curve, the curve-simplification problem asks for computing another
polygonal curve that approximates the original curve based on a predefined error criterion
and whose size is as small as possible. Other than their potential usage in simplifying
protein structures, curve simplification is widely applied to numerous application areas,
such as geographic information systems (GIS), computer vision, computer graphics and
data compression. Simplification helps to remove unnecessary cluttering due to excessive
details, to save memory space needed to store a curve, and to expedite the processing
of a curve. For example, one of the main problems in computational cartography is to
visualize geo-spatial information as a simple and easily readable map. To this end, curve
simplification is used to represent rivers, road-lines, coast-lines and other linear features at
an appropriate level of detail when a map of a large area is being produced.
In this chapter, we study the curve simplification problem under the so-called Frechet er-
ror measure, and propose the first near linear time algorithm to simplify curves in � � with
guaranteed quality. Below we first introduce the curve simplification problem formally.
39
Problem definition. Let � denote a polygonal curve in ��� with� � � � ����� � � � as its se-
quence of vertices. A polygonal curve ��� � � � � � � ����� � � ��� simplifies � if� � � � ������� �
� � � � . Given an error measure
and a pair of indices� � � � � �
� , let �� � � � ��� �����denote the error of the segment � � � � with respect to � under the error measure
. Intu-
itively, �� � � � ��� ����� measures how well � � ��� approximates the portion of � between � � and
��� . The error of a simplification � � � � � � � � � � � � ��� � � � � � � � of � is defined as
�� � � � ����� � � %����� ��� � �� � � ��� � ����� � ����� �
We call � � an � -simplification of � , under the error measure
, if �� � � � ����� � � . Let
� � � � ��� denote the minimum number of vertices in an � -simplification of � under the
error measure
. Given a polygonal curve � , an error measure
, and a parameter � , the
curve-simplification problem asks for computing an � -simplification of size � ��� ����� .We now define the error measure
we study in this chapter: Let � � ��� � ��� � � be
a distance function, e.g., the Euclidean distance, or� � or
���norm. Now given two curves
� � � � � � � � � , and � � � � ����� � � � � , the Frechet distance under metric � , �! � ��� � � , is
defined as
�" � ��� � � � #%$'&� � � � � � � �
� � �� � � � � � � � � � �(� �
� %���*),+ �.- �0/ � � ��� � � � � � � � � � � � � � � (3.1)
where � and�
range over continuous and monotonically non-decreasing functions with
� � � � � � � � � � � � � � � � � and� � � � � ��� . If �� and ��
are the maps that realize the
Frechet distance, then we refer to the map 1 � ��32��� � as the Frechet map from � to � .
For a pair of indices� � � � � �
� , the Frechet error of a segment � � ��� is defined to be
� � � � � ��� ����� � �" � � � ��� ��� � � � � ��� � � �where � � � � � � denotes the portion of � from � to � .
40
Most previous work has been focused on the so-called Hausdorff error measure. We
define it here as well, as we will compare the simplification under Frechet- and Hausdorff-
error measures later in this chapter. If we define the distance between a point � and a line
segment � as � � � � � � � � #%$�� )��'� � � � � � , then the Hausdorff error under the metric � , also
referred to as � -Hausdorff error measure, is defined as
��� � � � ��� ����� � � %���%� � � � � � � � � � � � � � �
An � -simplification under Hausdorff error measure and � � � � ����� are defined similarly to
the case of Frechet error measure.
If we remove the constraint that the vertices of ��� are a subset of the vertices of � , then
� � is called a weak � -simplification of � . Let ���� � � � ��� denote the minimum number of
vertices in a weak Frechet � -simplification of � .
3.2 Prior and New Work
Previous work. The problem of approximating a polygonal curve � has been studied
extensively during the last two decades; see [100, 167] for surveys. Imai and Iri [109]
formulated the curve-simplification problem as computing a shortest path between two
nodes in a directed acyclic graph ��� : each vertex of � corresponds to a node in ��� , and
there is an edge between two nodes � � � ��� if �� � � � ��� � ��� � � . A shortest path from � �
to � � in �� corresponds to an optimal � -simplification of � under the error measure
.
In � � , under the Hausdorff measure with the so-called uniform metric1, their algorithm
takes � � � ��� � � time. Chin and Chan [50], and Melkman and O’Rourke [134] improve
the running time of their algorithm to quadratic. Agarwal and Varadarajan [3] improve the�The uniform metric in � �
is defined as follows: given two points �� ������ �,
��� �� ������ ��� �! #"��$ � if �!%&�'�$%& ( otherwise.
41
running time to � � � - � ��� � for� � - and uniform-Hausdorff error measures , for an arbitrarily
small constant � � , by implicitly representing the graph � � . In dimensions higher than
two, Barequet et al. compute the optimal � -simplification under� � - or
���-Hausdorff error
measures in quadratic time [29]. For� � -Hausdorff error measure, an optimal simplification
can be computed in near-quadratic time for � � �and in � � � � �.-������
�� � � polylog � � in � �
for � � �.
Curve simplification using the Frechet error measure was first proposed by Godau [92],
who showed that ��� � � ����� � �� � ��� ��� � ��� . Alt and Godau [16] also proposed an � � � -time algorithm to determine whether � � � � � � � � � for given polygonal curves � and � of
size � and , respectively, and for a given error measure � � . Following the approach
of Imai and Iri [109], an � -simplification of � , under the Frechet error measure, of size
� � � � � ��� can be computed in � � � � time.
The problem of developing an optimal near-linear � -simplification algorithm remains
elusive. Among the several heuristics that have been proposed over the years, the most
widely used is the Douglas-Peucker method [75] (together with its variants). Originally
proposed for simplifying curves under the Hausdorff error measure, its worst-case running
time is � � � � in � � . For � � the running time is improved by Snoeyink et al. [101]
to � � �)� � � . However, the Douglas-Peucker heuristic does not offer any guarantee on
the size of the simplified curve—it can return an � -simplification of size � � � � even if
� � ��� ����� � � � � .Much work has been done on computing a weak � -simplification of a polygonal curve
� . Imai and Iri [108] give an optimal � � � -time algorithm for finding an optimal weak
� -simplification (under the Hausdorff error measure) of a � -monotone curve in � � . As for
weak � -simplification of planar curves under Frechet distance, Guibas et al. [96] proposed
an � � ��� � � -time factor-2 approximation algorithm and an � � � � -time exact algorithm.
They also proposed linear-time algorithms for approximating some other variants of weak
42
simplifications.
The problem of simplifying curves becomes much harder when additional constraints
such as topology preservation and non-intersection requirements are introduced. Given
a set of non-intersecting curves, the problem of simplifying the curves optimally so that
the simplified curves are also non-intersecting is NP-hard – in fact, it is hard to approxi-
mate within a factor of �� - � � � , for any � � , where � is the total number of vertices of
the curves [83]. Guibas et al. [96] show that the problem of computing an optimal non-
intersecting simplification of a simple polygon is NP-hard, and that computing the optimal
weak simplification of a set of non-intersecting curves is also NP-hard.
Our results. Let � be a polygonal curve in � � , and let �� be a parameter. In Sec-
tion 3.3, we present simple, near-linear algorithms for computing � -simplifications of �
with size at most � ��� � � ��� under the Frechet-error measure.
Theorem 3.2.1 Let � be a polygonal curve in ��� with � vertices, and let � � be a
parameter. We can compute in � � ��� � � time a simplification � � of � of size at most
� � � � �� ����� so that � � � � � ����� � � , assuming that the distance between points is measured
in any���
-metric.
To our knowledge, this is the first simple, near-linear approximation algorithm for curve
simplification with guaranteed quality that extends to � � for arbitrary curves. We
illustrate its simplicity and efficiency by comparing its performance with Douglas-Peucker
and exact algorithms in Section 3.4. Our experimental results on various data sets show
that our algorithm is efficient and produces � -simplifications of near-optimal size.
Also in Section 3.3, we compare curve simplification under Hausdorff and Frechet er-
ror measure, and we show that ��� � � � ��� � �� � ��� � ����� , thereby improving the result by
Godau [92].
43
3.3 Frechet Simplification
Let �� � � � � ����� � � � be a polygonal curve in � � , and let �
� be a parameter. In this
section, we first prove a few properties of the Frechet error measure. We then present an
approximation algorithm for simplification under the Frechet error measure. At the end of
this section, we compare Frechet simplification with some other versions of simplification.
Let �" � � � � � be as defined in (3.1), and let � � � � � � denote the Euclidean distance between
two points in � � .
Lemma 3.3.1 Given two directed segments �� and � � in � � ,
�" � �� ��� � � � � %�� � � � � ��� � � � � � � � � �
PROOF. Let � � � %�� � � � � ��� � � � � � � � � . First �! � �� ��� � � � , since � (resp. ) has to
be matched to � (resp. � ). Assume the natural parameterization for segment �� , ��� � � �� � � � � �� , such that ��� � � � � � � � � � � . Similarly, define � � � � � � � � � � � � for
segment � � , such that � � � � � � � � � ��� � � . For any two matched points ��� � � and � � � � , let
� � � � � ��� � � � � � � � � � � � � � � � � � � � � � � � � �
Since � � � � is a convex function,�� � � � � � � for any � � �
� � � . Therefore � � �� � � � � � � .
Lemma 3.3.2 Given a polygonal curve � in � � and two directed segments �� and � � ,
���" � �� ����� � �" ��� � ����� � � �" � �� ��� � � �
PROOF. Assume that ��� � � � � � � � � �� is the natural parameterization of �� , such that
��� � � � � � � � ��� � . Let �� � � , � � � � � � , be a parameterization of the polygonal curve
44
� such that � � � � and ��� � � realize the Frechet distance between � and �� . As in the proof
of Lemma 3.3.1, let � � � � � � � � � � � � be such that � � � � � � � � � � � � � . By triangle
inequality, for any � � � � � � , � � � � � � ��� � � � � � � � � � � � � ��� � � � � � ����� � � ��� � � � � � yielding the
lemma.
������
���� ����� ����
Figure 3.1: Dashed curve is�� ��� �� , �� and
are mapped to ��� and � on � � , respectively.
Lemma 3.3.3 Let �� � � � � � � � ����� � � � be a polygonal curve in � � . For � ��� � � � �
,
� � � � � � � ��� � � � � � � � ��� ����� �
PROOF. Let ��� � � � � � � � � ����� . Let � � � � � � � ��� � � � � ��� be the Frechet map from � � � � to
� � � � � ��� � (see Section 3.1 for definition). For any vertex � � � � � � � � ��� � , set �� � � � � � � � ;see Figure 3.1 for an illustration. By definition, �� � �� � �� ��� � � � � � � � � ��� . In particular,
� � � � � �� � � � � � � � �� � � � � . By Lemma 3.3.1, �! � � � � � �� � �� � � � � . It then follows from
Lemma 3.3.2 that
� � � � � � ����� � � � � � � ��� � � � � � � � � �" � �� � �� ��� � � � � � � � � � � � � � �3.3.1 Algorithm
Our simplification algorithm is a greedy approach (Figure 3.2). Suppose we have already
added� � � � � � � � ����� � � ��� to � � . We then find an index
� � � � such that (i) � � � � ��� � � ����� � �
and (ii) � � � � ��� � � � � � ��� � � . We set � � � � � � and add � ��� � � to � � . We repeat this process
until we encounter ��� . We then add � � to � � .45
ALGORITHM GreedyFrechetSimp ( � , � )Input: � � � � � � ����� � � � ; � � �Output: � � " � such that � � � � � � ��� � � .
begin��� �; � � � � ; � � � � � � � ;
while ( � � � � ) do� � ;while ( � � � � ��� � ��� � � � � � ����� � � ) do� � � � �
;end while ��� �
�;� # � � �
�� � ;
while ( ��� � � � # � � � � � ) do
� # � � � ��� � � # � � � � ;if ( � � � � � � � � � ������ ����� � � ) ��� � � # � ;else
� # � � � � # � ;end while� � � � � � � �� ��� ; � � � � � � � � ����� � ; � � � � � ;
end whileend
Figure 3.2: Computing -simplification under the Fr echet error measure.
Alt and Godau [16] have developed an algorithm that, given a pair� � � � � �
� , can
determine in � � � � � time whether � � � � � ��� ����� � � . Therefore, a first approach would be
to add vertices greedily one by one, starting with the first vertex � � � , and testing each edge
� � � � � , for� � � � , by invoking the Alt-Godau algorithm, until we find the index
�. However,
the overall algorithm could take � � � � time. To limit the number of times that Alt-Godau
algorithm is invoked when computing the index�
, we proceed as follows.
First by an exponential search, we determine an integer� so that � � � � � � � ��� � � � ����� �
� and � � � � ��� � ��� � � � � � ����� � � . Next, by performing a binary search in the interval�� �
�� � � ,
we determine an integer� � �
� �
�� � � such that � � � � ��� � ��� � ����� � � and � � � � � � � � � � � � ����� �
� . Note that in the worst case, the asymptotic costs of the exponential and binary searches
are the same. Set� �
� � � � . See Figure 3.2 for pseudo-code of this algorithm. Since com-
puting the value of�
requires invoking the Alt-Godau algorithm � � � � � ��� � � times,
46
each with a pair � � � � � such that�� �
� � , the total time spent in computing the value of�
is � � � � � � � � � ��� � � . Hence, the overall running time of the algorithm is � � ��� � � .
Theorem 3.3.4 Given a polygonal curve � � � � � � ����� � � � in � � and a parameter � � ,we can compute in � � ��� � � time an � -simplification � � of � under the Frechet error
measure so that � � � � � � � � � �� � .
PROOF. Compute � � by the greedy algorithm described above. By construction � � � � � ����� �� , so it suffices to prove that � � � � � � � � � � ����� . Let � � � � � � � � � � � ����� � � ��� � ��� , and let
� � �� ��� � � � � � ����� � ��� � � � � be an optimal ��� � � -simplification of � of size � � ��� � � ��� .
We claim that ��� �� for all . This would imply that
� � � � ��� � � �� ����� .We prove the above claim by induction on . For � �
, the claim is obviously
true because � � � � � � �. Suppose ��� � � �
�� � . If
��� ��� � � , we are done. So
assume that��� ��� � � . Since � � is an � � �� � -simplification, � � � ��� ��� � � � � � ��� � � �� .
Lemma 3.3.3 implies that for all ��� � � � � � � �� , � � � � � ��� � � � ����� � � . But by construc-
tion, � � � � � ��� � � � � � � ����� � � , therefore ��� � �#� � � and thus ��� �� .
Remark.Our algorithm works within the same running time even if we measure the
distance between two points in any� �
-metric.
3.3.2 Comparisons
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
(a) (b) (c)
Figure 3.3: (a) Polygonal chain (a piece of protein backbone) composed of three alpha-helices, (b)its Fr echet -simplification and (c) its Hausdorff -simplification.
47
Hausdorff vs. Frechet. One natural question is to compare the quality of simplifi-
cations produced under the Hausdorff- and the Frechet- error measures. Given a curve
� � � � � � � ��� � ��� , it is not too hard to show that � � � � � ��� ����� � � � � � � ��� ����� under any���
-metrics, which implies that � � ��� ����� � � � � � ����� . The inverse however does not
hold, and there are polygonal curves � and values of � for which � � � � ����� � � � � and
� � � � � ��� � � � � � .
The Frechet error measure takes the order along the curve into account, and hence is
more useful in some cases especially when the order of the curve is important (such as
curves derived from protein backbones). Figure 3.3 illustrates a substructure of a protein
backbone, where the � -simplification under Frechet error measure preserves the overall
structure, while the � -simplification under Hausdorff error measure is unable to preserve
it.
We remark that the Douglas-Peucker algorithm is also based on Hausdorff error mea-
sure. Therefore the above discussion holds for it as well.
Weak-Frechet vs. Frechet. In Section 3.3.1 we described a fast approximation algorithm
for computing an � -simplification of � under the Frechet error measure, where we used the
Frechet measure in a local manner: we restrict the curve� � � � ����� � ��� to match to the line
segment � � ��� . We can remove this restriction to make the measure more global by consid-
ering weak � -simplification. More precisely, given � and � �� � � � � � � ��� � � � � , where � �
does not necessarily lie on � , � is a weak � -simplification under Frechet error measure if
�" � � � � � � � . The following lemma shows that for Frechet error measure, the size of the
optimal � -simplification ( ��� ��� � ) can be bounded in terms of the size of the optimal weak
� -simplification ( ���� � � � ):
48
�������� � � � � � ����� � �
����� � �
���������� � � ����
� � � �
�������� � �
����� � �� � � � � � � ���
��
(a) (b)
���� ����� � � �� �
���������� � � ���������� � �
� � ��� � � �(c)
Figure 3.4: Relationship between �� ��� � ��� and � ��� � ��� : (a) �! #" �%$ � � and &' (" �%$ � � � � ; (b)depicts the map ) , * � �� � � � � +) � �� � � � � , and
+) � �� � � . (c) the case for , � .- � and , � � � /- � � � :0 1* � �� ����� � 12 � �� ���� � .
Theorem 3.3.5 Given a polygonal curve � ,
� � � � � � �� � � � � � � � � � � � � �PROOF. The second inequality is obvious, so we prove the first one. Let � �
� � � � ����� � � � be an optimal weak ��� � � -simplification of � , and let � � � � � be a Frechet map so
that � � � � � � � � � � � � for all � � � (see Section 3.1 for the definition). For� � � � � ,
let � � � � � � � � � � � be the edge of � that contains � � � � � , and let ��� � be the endpoint of � � that
is closer to � � � � � . We set � � � � ��� � ����� � ���43 ; we remove a vertex from this sequence if it
is the same as its predecessor. Clearly,� � � � � � � , so � � is a simplification of � . Next we
show that � � � ��� � ��� � � � ����� � � , for all� � � � � , which then implies the theorem.
Let ��
� � � � � and �
� � � � � � � . See Figure 3.4 (a) for an illustration.
Claim 3.3.6 �" � �� ��� � � � � � � � � �
49
PROOF. By construction, � � � � � � � , � � � � � � � � � � � . Therefore, �! � �� � � � � � � � � � � � (by Lemma 3.3.1). On the other hand, since � is a weak ( � � )-simplification of � ,
�" � � � � � � � ��� � � � � � � � � . The claim then follows from Lemma 3.3.2.
Let� � � � � � � � �� be a Frechet map such that � � � � � ��� � � � � � , for all � � � � � � � .
Let � be the line containing �� . We define a map� � � � � � that maps a point � � � �
to the intersection point of � with the line through � and parallel to � � � � � � � � � � � � � . See
Figure 3.4 (b) for an illustration. If � � � � � � � , then � � � � � � � � ��� � � � � � �� since � � � � ��� � � �� � � � � � � � � � : This is the case depicted in Figure 3.4. Hence, we can infer that
�" � ��� � � � � � � � � � � � � � � � � � � � � � � � � �� � (3.2)
Similarly, we define a map � � � � � � � � that maps a point � � � � � � to the intersection point
of � with the line through � and parallel to � � � � � � � � � � � � � . As above,
�" � � � � � � ��� � � � � � � � � � � � � � � � � � � � � � � � � � (3.3)
Let � (resp. � ) denote the point� � � � � � (resp. � � ��� � � � � ); see Figure 3.4 (c).
Claim 3.3.7 �" � � � � ��� � ��� � � � � � � � �
PROOF. Since � � ��� � � � � , � � ��� � � � � � � � � � , the claim follows from Lemma 3.3.1.
Claim 3.3.8 �" � � � � � � ��� � � ��� � � � � � � � � �
PROOF. By definition of Frechet distance,
�" � � � ��� � � � � � ��� � � � � � � � %�� � �" � � � � � � � � � � � ��� � � � � � � � ��" � � � � � � � � � � � � � � � � � ��� � � � � � � � � � � � � � � ��" � � � � � � ��� � � � � � � � � � � � � � � � �50
By (3.2) and (3.3), the first and the third terms are at most � � . To bound the second
term, observe that� � � � � � � � � � � � � � � � � and � � � � � � � � � � � � � � � � � . It then follows from the
definition of the map�
and Claim 3.3.6 that
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �" � �� ��� � � � � � � � � �
which implies that the second term is at most � �� as well. Thus proves the claim.
Claim 3.3.7 and 3.3.8 along with Lemma 3.3.2 imply that � � � ��� � � � � � � � ��� � � , and
therefore � � � � � � ��� � � . This completes the proof of the theorem.
3.4 Experiments
We have implemented our simplification algorithm, GreedyFrechetSimp, and the
� � � � -time optimal Frechet-simplification algorithm (referred to as Exact) by construct-
ing the shortest path in certain graphs, outlined in Section 3.2, that computes an � -simplification
of � of size � ��� ����� . In this section, we measure the performance of our algorithm in terms
of the output size and the running time.
Data sets. We test our algorithms on two different types of data sets, each of which is a
family of polygonal curves in � � .
� Protein backbones. The first set of curves are derived from protein backbones by
adding small random “noise” vertices along the edges of the backbone. We have
chosen two backbones from the protein data bank [2]: PDB file ids� � and
� � � .The number of vertices in the original backbones of
� � and� � � are
� �� and� �
,
respectively. Protein A is the original protein� � . Protein B is
� � � with � � � random vertices added: the new vertices are uniformly distributed along the curve
edges, and then perturbed slightly in a random direction.
51
� Stock-index curves. The second family of curves is generated from the daily NAS-
DAQ index values over the period January 1971 to January 2003 (data is obtained
from [1]). We take a pair of index curves � � � � � � � � � � � � � � and � � ��� � � � � � � � � � �and generate the curve � ��� � � �� � � � � � � � � � � in � � . In particular, we take telecom-
munication index and Bio-technology index as the � - and � -coordinates and time
as the � -coordinates to construct the curve Tel-bio in ��� . In the second case, we
take transportation, telecommunication index, and time as � -, � -, and � -coordinates,
respectively, to construct curve Trans-Tel in ��� .
These two families of curves have different structures. The stock-index curves ex-
hibit an easily distinguishable global trend; however, locally there is a lot of noise. The
protein curves, though coiled and irregular-looking, exhibit local patterns that represent the
structural elements of the protein backbones (commonly referred to as the secondary struc-
tures). In each of these cases, simplification helps identify certain patterns (e.g., secondary
structure elements) and trends in the data.
Output size. We compare the quality (size) of simplifications produced by our algorithm
(GreedyFrechetSimp) and optimal algorithm (Exact) in Figure 3.5 for curves from
the above two families respectively. The simplifications produced by our algorithm are
almost always within���
of the optimal.
To provide a visual picture of the simplification produced by various (commonly used)
algorithms for curves in � � , Figure 3.8 shows the simplifications of protein A computed
by the GreedyFrechetSimp, exact (i.e., optimal) Frechet simplification algorithm, and
the Douglas-Peucker heuristic (using Hausdorff error measure under� � metric).
Running time. As the running time for the optimal algorithm is orders of magnitude
slower than our algorithm, we compare the efficiency of GreedyFrechetSimpwith
the widely used Douglas-Peucker simplification algorithm under the Hausdorff measure –
52
Output sizeProteinA ProteinB
(327) (9,777)� GreedyFrechetSimp Exact GreedyFrechetSimp Exact
0.05 327 327 6786 64310.12 327 327 1537 6511.20 254 249 178 1681.60 220 214 140 1322.00 134 124 115 885.00 37 36 41 3910.0 22 22 24 2020.0 10 8 8 650.0 2 2 2 2
(a)
Output sizeTrans-Tel Tel-Bio(7,057) (1,559)
� GreedyFrechetSimp Exact GreedyFrechetSimp Exact
0.05 6882 6880 1558 15580.50 4601 4469 1473 14711.20 2811 2637 1292 12793.00 1396 1228 974 9425.00 890 732 772 72010.0 414 329 490 40220.0 168 124 243 20050.0 47 35 94 73
(b)
Figure 3.5: The sizes of Fr echet simplifications on (a) protein data and (b) stock-index data.
53
Running time (ms.)ProteinA ProteinB
(327) (9,777)� GreedyFrechetSimp DP GreedyFrechetSimp DP
0.05 3 16 146 7720.50 3 16 171 5241.20 4 16 176 4881.60 5 12 202 3942.00 5 11 210 3545.00 5 11 209 35610.0 5 10 222 32920.0 5 8 233 26350.0 2 1 87 50
(a)
Running time (ms.)Trans-Tel Tel-Bio(7,057) (1,559)
� GreedyFrechetSimp DP GreedyFrechetSimp DP
0.05 82 599 16 1130.50 103 580 17 1141.20 113 559 19 1133.00 119 510 22 1095.00 121 472 24 10910.0 127 411 25 9620.0 146 360 27 8550.0 162 271 27 71
(b)
Figure 3.6: The running time of GreedyFrechetSimp and Douglas-Peucker algorithm on (a)protein data and (b) stock-index data.
54
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 5 10 15 20 25 30 35 40 45 50
Runnin
g tim
e (
secs)
Error
DPFS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 5 10 15 20 25 30 35 40 45 50
Runnin
g tim
e (
secs)
Error
DPFS
Figure 3.7: Comparison of running time of GreedyFrechetSimp and DP algorithms for varying � onProtein B.
we can extend the Douglas-Peucker algorithm to simplify curves under the Frechet error
measure; however, such an extension is inefficient and can take � � � � � time in the worst
case.
Figures 3.6 illustrates the running time of the two algorithms. Note that as � increases,
resulting in small simplified curves, the running time of Douglas-Peucker decreases. This
phenomenon is further illustrated in Figure 3.7, which compares the running time of our
algorithm with the Douglas-Peucker on Protein B (with artificial noise added) with � � ��� �vertices. This phenomenon is due to the fact that, at each step, the DP algorithm deter-
mines whether a line segment � � ��� simplifies � � � � � ��� � . The algorithm recursively solves
two subproblems only if � � � � � ��� ����� � � . Thus, as � increases, it needs to make fewer
recursive calls. Our algorithm, however, proceeds in a linear fashion from the first ver-
tex to the last vertex using exponential and binary search. Suppose the algorithm returns
� � � � ��� � � ��� � � ��� � for an input polygonal curve � � � � � � ����� � � � . The exponential search
takes � � � time, while the binary search takes � � �� � � � � � ��� � � � where � � ��� � � � � � � and�
is the number of vertices of � � . Thus as � increases, � � increases, and therefore the time
for binary search increases, as Figure 3.7 illustrates. Note however that if � is so large that� � , i.e., the simplification is just one line segment connecting � � to � � , the algorithm
55
does not perform any binary search and is much faster, as the case for �� � � illustrates
in Figure 3.7.
3.5 Notes and Discussions
In this chapter, we have proposed and developed a simple near-linear time curve simplifi-
cation algorithm. Other than being the first near-linear simplification algorithm for curves
in � � , our algorithm tends to preserve long but relatively skinny features (Figure 3.3) as we
use the Frechet error measure. This property is desirable when simplifying protein back-
bones as it helps to maintain a trace of secondary structural elements such as alpha-helices
and beta-strands in the simplified structures.
It would be interesting to see whether our algorithm can help to produce an automatic
method to identify protein secondary structure elements. It is also possible to generate
a level-of-detail representation of a protein backbone via simplification and compute its
writhing number (or other shape descriptors) at different scales, in order to characterize
protein structures. Such a level-of-detail representation can also be used when comparing
protein backbones, as current structural alignments methods are usually of high computa-
tional complexity.
We end this chapter by mentioning a few open problems related to curve simplification.
(i) Does there exist a near-linear algorithm for computing an � -simplification of size at
most � � � � ��� for a polygonal curve � , where � is a constant?
(ii) Is it possible to compute the optimal � -simplification under Hausdorff error measure
in near-linear time, or under Frechet error measure in sub-cubic time?
(iii) Is there any provably efficient exact/approximation algorithm for curve simplifica-
tion in � � that returns a simple curve if the input curve is simple.
56
GreedyFrechetSimp EXACT DP �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
� � �� �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
� � � �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
� � � �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
� � � �
Figure 3.8: Simplifications of a protein (Protein A) backbone.
57
Chapter 4
Elevation Function
4.1 Introduction
The starting point of the work described in this chapter is the desire to identify features
that are useful in finding a fit between solid shapes in ��� . We are looking for cavities
and protrusions and a way to measure their size. The problem is made difficult by the
interaction of these features, which typically exist on various scale levels. We therefore
take an indirect approach, defining a real-valued function on the surface that is sensitive to
the features of the shape. We call this the elevation function because it has similarities to
the elevation measured on the surface of the Earth, but the problem for general surfaces is
more involved and the analogy is not perfect.
Related work in protein docking. The primary motivation for designing elevation func-
tion to characterize protein surfaces is protein docking, which is the computational ap-
proach to predicting protein interactions, a biophysical phenomenon at the very core of life.
The phenomenon is clearly important and the interest in protein docking is correspondingly
wide-spread. The related work on attacking the docking problem will be surveyed in Chap-
ter 6, and here we only mention some survey articles on docking algorithms [81, 97, 114].
The idea of docking by matching cavities with protrusions goes back to Crick [69] and
Connolly [67]. Connolly also introduced the idea of using the critical points of a real-
valued function defined on the protein surface to identify cavities and protrusions. The
particular function he used is the fraction of a fixed-size sphere that is buried inside the
protein volume as we move the sphere center on the protein surface. In the limit, when
the size of the sphere goes to zero, this function has the same critical points as the mean
58
curvature function [48]. A similar but different function suggested for the same purpose
is the atomic density [136]. Here we take the buried fraction of the ball bounded by the
sphere but we also vary its radius from zero to about ten Angstrom. At every point of the
protein surface, the function value is the fraction of buried volume averaged over the balls
centered at that point.
Our results. The main contribution of this chapter is the description and computation
of a new type of feature points that mark extreme cavities and protrusions on a surface
embedded in � � . More specifically,
� we extend the concept of topological persistence [77] to form a pairing between all
critical points of a function on a 2-manifold embedded in ��� ;
� we use the pairings obtained for a 2-parameter family of height functions to define
the elevation function on the 2-manifold;
� we classify the generic local maxima of the elevation function into four types;
� we develop and implement an algorithm that computes all local maxima of the ele-
vation function.
The elevation differs from Connolly’s and the atomic density functions in two major ways:
it is independent of scale and it provides, beyond location, estimates for the direction and
size of shape features. Both additional pieces of information are useful in shape character-
ization and matching. The four generic types of local maxima are illustrated in Figure 4.1.
In each but the first case, the maximum is obtained at an ambiguity in the pairing of critical
points. In all cases, the endpoints of the legs share the same normal line, and the legs have
the same length if measured along that line. The case analysis is delicate and aided by a
transformation of the original 2-manifold to its pedal surface, which maps tangent planes
to points and thus expresses points with common tangent planes as self-intersections of the
59
Figure 4.1: From left to right: a one-, two-, three-, and four-legged local maximum of the elevationfunction. In the examples shown, the outer normals at the endpoints of the legs are all parallel (thesame). Each of the four types also exists with anti-parallel outer normals in various combinations.
pedal surface. The algorithm we describe for enumerating all local maxima is inspired by
our analysis of the smooth case but works on piecewise linear data.
Outline. Section 4.2 defines the pairing of the critical points, based on which we then
introduce the height and elevation as functions on a 2-manifold. Section 4.3 describes a
dual view of these concepts based on the pedal surface of the 2-manifold. Section 4.4 uses
surgery to make elevation continuous and to define a stratified Morse function on the new
2-manifold. We then characterize the four types of generic local maxima of the continuous
elevation function. Section 4.5 sketches an algorithm for enumerating all local maxima.
Section 4.6 presents experimental results for protein data.
4.2 Defining Elevation
4.2.1 Pairing
The elevation function is based on a canonical pairing of the critical points, which we
describe in this section.
Traditional persistence. Let�
be a connected and orientable 2-manifold and � � � � �
a smooth function.1 A point � � �is critical if the derivative of � at � is identical to 0,
and it is non-degenerate if the Hessian at the point is invertible. It is convenient to assume
that � is generic:�We remark that the algorithms we describe below work for 2-manifolds with multiple components as well.We assume there is only one components for simplicity.
60
I. all critical points are non-degenerate;
II. the critical points have different function values.
A function that satisfies Conditions I and II is usually referred to as a Morse function
[135]. It has three types of critical points: minima, saddles and maxima distinguished by
the number of negative eigenvalues of the Hessian. Imagine we sweep�
in the direction
of increasing function value, proceeding along a level set of closed curves. We write��� �
� � � � � ����� � � � for the swept portion of the 2-manifold. This portion changes the
topology whenever the level set passes through a critical point. A component of���
starts
at a minimum and ends when it merges with another, older component at a saddle. A hole
in the 2-manifold starts at a saddle and ends when it is closed off at a maximum. After
observing that each saddle either merges two components or starts an new hole, but not
both, it is natural to pair up the critical point that starts a component or a hole with the
critical point that ends it. This is the main idea of topological persistence introduced in
[77]. It is clear that a small perturbation of the function that preserves the sequence of
critical events does not affect the pairing, other than by perturbing each pair locally. The
method pairs all critical points except for the first minimum, the last maximum, and the ��saddles starting the �� cycles that remain when the sweep is complete. Here � is the genus
of�
. These � �� unpaired critical points are the reason we need an extension to the
method, which we describe next.
Extended persistence. It is natural to pair the remaining minimum with the remaining
maximum. The remaining �� saddles, � , contains � up-forking and � down-forking sad-
dles. We wish to pair up-forking saddles with down-forking ones, and this can be achieved
in a way that reflects how they introduce cycles during the sweep. This pairing is best
described using the Reeb graph obtained by mapping each component of each level set to
a point, as illustrated in Figure 4.2. As proved in [63], the Reeb graph has a basis of �
61
B
2
B
1
2
24
3
AA
0
3
1
4
0
B
1
2
24
A
0
1
B
A
3
Figure 4.2: Left: a 2-manifold whose points are mapped to the distance above a horizontal plane.Middle: the Reeb graph in which the critical points of the function appear as degree-1 and degree-3nodes. The labels indicate the pairing. Right: the tree representing the Reeb graph from slightlyabove
�downwards.
cycles such that each cycle is the sum (modulo 2) of a subset of basis cycles. Each cycle
has a unique lowest and a unique highest point, referred to as lo-point and hi-point. We say
the lo- and hi-point span this cycle but note that they may span more than one cycle. There
is a one-to-one correspondence between lo- (hi-) points and up- (down-) forking saddles,
thereby giving exactly � lo-points and � hi-points. We pair each lo-point � with the lowest
hi-point � that spans a cycle with � . Note that � is also the highest lo-point that spans a
cycle with � . Indeed, if it were not the case, then we could add the cycle spanned by � and
� and the cycle spanned by � and the lo-point higher than � to get a cycle spanned by � and
lower hi-point than � , a contradiction. This implies that each lo-point and each hi-point
belongs to exactly one pair, giving a total of � pairs between up- and down-forking saddles
from � as desired.
The Reeb graph of a piecewise-linear function on a triangulation with � edges can be
constructed in time � � �)� � � using the algorithm in [63]. We now describe an algorithm
that computes both the traditional persistence pairing and the extended persistence pairing
as introduced above, given the Reeb graph � of�
.2 It simulates the sweep of � , maintain-
ing a forest for� �
during the course. In particular, it takes the following steps at reaching�We remark that the algorithm can in fact construct the Reeb graph and the pairing simultaneously in onesweep. We assume the Reeb graph is given for simplicity.
62
a critical point � (i.e., a node in � ), merging two arcs across a degree-2 node whenever one
is created.
Case 1: � is a minimum. We add a new tree, consisting of a single node, to the forest.
Case 2: � is an up-forking saddle. We turn the corresponding leaf into an internal node,
adding two new leaves as its children.
Case 3: � is a down-forking saddle, connecting leaves � and . We glue the two down-
ward paths from � and to their roots, and ends the gluing at � . In one case, � is the
root of one tree (the higher one); � is a minimum, which we now pair with � (this
corresponds to a traditional persistence pairing). In the other case, � is the lowest
common ancestor of � and ; � is an up-forking saddle, and we pair it with � (this
corresponds to an extended persistence pairing).
Case 4: � is a maximum. We pair it with its parent � and remove the joining edge
together with the two nodes; � can be either an up-forking saddle, producing a tradi-
tional persistence pairing, or a minimum, producing an extended persistence pairing.
In order to perform these operations efficiently, we use the linking and cutting tree data
structure proposed by Sleator and Tarjan [152]. It decomposes the forest into a family of
vertex-disjoint paths, and each path is represented using a biased binary search tree. By
maintaining a linking and cutting tree
, cases� � � and
can be handled in � � ��� � �
overall time. So we focus only on case�. Given an instance of case
�, assume that the
common ancestor of � and , � , exists (the case where it does not exist can be processed
similarly). We can find � in � ��� � � time using the operations supported by the linking
and cutting tree data structure. The only extra operation we need is to glue the path in
from � to � with that from to � . Let and�
be the length of these two paths with� � . We can perform the gluing operation by inserting each node from the shorter
path into the longer one, which takes � � ��� � time as each path is represented using
63
a weighted binary search tree.3 Assume that there are a sequence of � gluing operations
during the entire sweep, and the � ’th operation glues a path of length� � with one of length
� for� � � � . The overall time complexity is
� � � � � ��� � ��� � � �� � �)� � � .
If we regard the parent of each node as its successor, then
induces a partial order on
the � nodes of � . Let 1 � be the number of total orders that are extensions of the partial
order induced by
after the � ’th gluing operation. Since � �
initially and a single path
after all operations, 1 � � ��� and 1 � � . The � ’th gluing operation merges two paths of
length � and� � , 1�� � � � � � � � � � ��
� � � � � � � 1�� . Therefore,
$ 1 � � � � $ 1 � � $ � � � � � � � ��� � ��� � � $ � ��� �
� ���
Hence,
�� � �� � �
� ��� � �
� $ 1�� � � � $ 1�� � �� $
� �
implying that the overall time for computing the persistence pairing is � � ��� � � � .Symmetry. The negative function, � � � � � � , has the same critical points as � . We
claim that it also generates the same pairing.
SYMMETRY LEMMA. Critical points � and � are paired for � iff they are paired for � � .
PROOF. The claim is true for the first minimum, � , and the last maximum, � . Every other
pair of � contains at least one saddle. We assume without loss of generality that � is a
saddle and that ����� � � ��� � � . Consider again the sweep of the 2-manifold in the direction
of increasing values of � . When we pass � ��� � � we split a cycle in the level set into
two. The two cycles belong to the boundary of �� �, the set of points with function value
or higher. If the two cycles belong to the same component of �� �, such as for the point
�
In fact, for our algorithm, to achieve an � ������ � � � bound, a balanced binary tree will suffice.
64
labeled 2 in Figure 4.2, then � is a lo-point and � is the lowest hi-point that spans a cycle
with � . The claim follows because � is also the highest lo-point that spans a cycle with � .
If, on the other hand, the two cycles belong to two different components of �� �, such as for
the point labeled � in Figure 4.2, then � is the lower of the two maxima that complete the
two components. In the backward sweep (the forward sweep for � � ), � starts a component
that merges into the other, older component at � . Again � and � are also paired for � � ,which implies the claimed symmetry.
4.2.2 Height and Elevation
In this section, we define the elevation as a real-valued function on a 2-manifold in � � .
Measuring height and elevation on Earth. Even on Earth, defining the elevation of a
point � on the surface is a non-trivial task. Traditionally, it is defined relative to the mean
sea level (MSL) in the direction of the measured point. In other words, the MSL elevation of
a point � is the difference between the distance of � from the center of mass and the distance
of the MSL from the center of mass in the direction of � . The difficulty of measuring height
in the middle of a continent was overcome by introducing the geoid, which is a level surface
of the Earth’s gravitational potential and roughly approximates the MSL while extending
it across land. The orthometric height above (or below) the geoid is thus more general
and about the same as the MSL elevation. It is perhaps surprising that the geoid differs
significantly from its best ellipsoidal approximation due to non-uniform density of the
Earth’s crust [87]. Standard global positioning systems (GPS) indeed return the ellipsoidal
height, which is elevation relative to a standard ellipsoidal representation of the Earth’s
surface. They also include knowledge of the geoid height relative to the ellipsoid and
compute the orthometric height of � as its ellipsoidal height minus the ellipsoidal height of
the geoid in the direction of � .
65
A simplifying factor in the discussion of height and elevation on Earth is the existence
of a canonical core point, the center of mass. For general surfaces, distance measurements
from a fixed center make much less sense. We are interested in this general case, which
includes surfaces with non-zero genus for which there is no simple notion of core. As
on Earth, we define the elevation of a point � as the difference between two distances,
except we no longer use a reference surface, such as the mean sea level or the geoid, but
instead measure relative to a canonically associated other point on the surface. To explain
how this works, we give different meanings to the ‘height’ of a point � , which we define
for every direction, and the ‘elevation’ of the point, which is the difference between two
heights. While height depends on an arbitrarily chosen origin, we will see that elevation
is independent of that choice. Indeed, the technical concept of elevation, as introduced
shortly, will be similar in spirit to the idea of orthometric height, with the exception that it
substitutes the canonical associated point for a globally defined reference surface.
Height, persistence and elevation. Let�
be a smoothly embedded 2-manifold in � � .
We assume that�
is generic but it is too early to say what exactly that should mean. We
define the height in a given direction as the signed distance from the plane normal to that
direction and passing through the origin. Formally, for every unit vector � � � � , we call
��� ��� � � � � � � the height of � in the direction � . This defines a 2-parameter family of
height functions,
��� # � ��� � � ��� � � � �
where��� # � ��� ��� � � � � ��� ��� � . The height is a Morse function on
�for almost all direc-
tions. We pair the critical points of �� as described in Section 4.2.1. Following [76], we
define the persistence of a critical point as the absolute difference in height to the paired
point: � '�� ��� � �
� ' � � � � � � ��� � � � � ��� � � � � .Each point � � �
is critical for exactly two height functions, namely for the ones in
66
the direction of its outer and inner normals: ������� . We proved in Section 4.2.1 that the
pairs we get for the two opposite directions are the same. Hence, each point � � �has a
unique persistence, which we use to introduce the elevation function,
� ���(% � # � $ � � � � �
defined by� ���(% � # � $ � � � �
� '�� � � � . We note that the elevation is invariant under transla-
tion and rotation of�
in ��� .
Two-dimensional example. We illustrate the definitions of the height and elevation func-
tions for a smoothly embedded 1-manifold�
in � � . The critical points of �� � � � �
are the points � � �with normal vectors ���
�� � . Figure 4.3 illustrates a sweep in the
vertical upward direction � . Each critical point of � � starts a component, ends a compo-
nent by merging it into an older component, or closes the curve. The critical points that
start components get paired with the other critical points. The elevation is zero at inflexion
Figure 4.3: A 1-manifold with marked critical points of the vertical height function. The shadedstrips along the curve connect paired critical points. The black and grey dots mark two- and one-legged elevation maxima.
points and increases as we move away in either direction. The function may experience a
discontinuity at points that share tangent lines with others, such as endpoints of segments
that belong to the boundary of the convex hull. On the way towards a discontinuity, the
elevation may go up and down, possibly several times. The elevation may reach a local
maximum at points that either maximize the distance to a shared tangent line or the distance
67
to another critical point in the normal direction. Examples of the first case are the black
dots in Figure 4.3, where the elevation peaks in a non-differentiable manner. An example
of the second case is the grey point, where the elevation forms a smooth maximum.
Singular tangencies. The elevation is continuous on�
, except possibly at points with
singular tangencies. These points correspond to transitional violations of the two genericity
conditions of Morse functions. Such violations are unavoidable as� ��# � ��� is a 2-parameter
family within which we can transition from one Morse function to another:
� two critical points may converge and meet at a birth-death point where they cancel
each other;
� two critical points may interchange their positions in the ordering by height, passing
a direction at which they share the same height.
The first transition corresponds to an inflexion point of a geodesic on�
. Such points are
referred to as flat or parabolic, indicating that their Gaussian curvature is zero. The second
transition corresponds to two points � �� � that share the same tangent plane,�� � .
Both types of singularities are forced by varying one degree of freedom and are turned into
curves by varying the second degree of freedom. These curves pass through co-dimension
two singularities formed by two simultaneous violations of the two genericity conditions.
There can be two concurrent birth-death points, a birth-death point concurrent with an
interchange, or two concurrent interchanges. In each case, the singularity is defined by
two pairs of critical points and we get two types each because these pairs may be disjoint
or share one of the points. See Table 4.1 for the features on�
that correspond to the six
types of co-dimension two singularities. We can now be more precise about what we mean
by a generic 2-manifold.
GENERICITY ASSUMPTION A. The 2-parameter family of height functions on�
has no
violations of Conditions I and II for Morse functions other than the ones mentioned
68
above (and enumerated in Table 4.1 below).
Some of these violations will be discussed in more detail later as they can be locations of
maximum elevation. A second genericity assumption referring specifically to the elevation
function will be stated in Section 4.4.1.
4.3 Pedal Surface
In this section, we take a dual view of the height and elevation functions based on a trans-
formation of�
to another surface in � � . We take this view to help our understanding of
the singularities of��� # � ��� , but it is of course also possible to study them directly using
standard results in the field [23, 99].
Pedal function. Recall that� is the plane tangent to
�that passes through � � �
. The
pedal � of � is the orthogonal projection of the origin on� . We write � ��� � � % ��� � and
obtain a function
� � � % � � � � � �
whose image � ��� � � % � � � is the pedal surface of�
[36]. If the line � is normal to�
then � � � . More generally, we can construct � by drawing the diameter sphere with center
� � passing through and � . This sphere intersects� in a circle with center ��� � � � ��
that passes through � and � ��� � � % � � � . In fact, � is the evolute of the diameter spheres
defined by the origin and the points � � �, as illustrated in Figure 4.4. The following
three properties are useful in understanding the correspondence between�
and its pedal
surface:
� points on�
have parallel and anti-parallel normal vectors iff their images under the
pedal function lie on a common line passing through the origin;
69
Figure 4.4: A smoothly embedded closed curve (boldface solid) and the image of the pedal func-tion (solid) constructed as the evolute of the diameter circles (dotted) between the curve and theorigin.
� the height of a point � � �in the direction of its normal vector is equal to plus or
minus the distance of� � � % � � � from the origin;
� from � � � and the angle � between the vector � and the normal ��
of � at � we
can compute the radius � of the corresponding diameter sphere and the preimage �at distance �� � #%$ � from � in the direction normal to � and � � �
�
.
The third property implies that the pedal surface determines the 2-manifold.
Tangents, heights, and pedals. We are interested in singularities of the pedal function as
they correspond to directions along which the height function is not generic. For example,
a birth-death point of� ��# � � � corresponds to a cusp point of � . To see this recall that the
birth-death point corresponds to a flat point � � �. A generic geodesic curve through this
point has an inflexion at � , causing the tangent plane to reverse the direction of its rotating
motion as we pass through � . Similarly, it causes a sudden reversal of the motion of the
image point thus forming a cusp at� � � % � � � . In contrast, an interchange of
����# � ��� , which
corresponds to a plane tangent to�
in two points, maps to a point of self-intersection (a
xing) of � . These two cases exhaust the co-dimension one singularities of��� # � ��� , which
are listed in the upper block of Table 4.1.
Co-dimension two singularities. There are six types of co-dimension two singularities
listed in the lower block of Table 4.1. Perhaps the most interesting is formed by two
70
Dictionary of Singularities� ����� ����
flat point birth-death (bd) point cuspdouble tangency interchange xing
Jacobi point 2 bd-points dovetail pointtriple tangency 3 interchanges triple point
bd-pt. and interchange cusp xing2 bd-points cusp-cusp overpass2 interchanges xing-xing overpassbd-pt. and interchange cusp-xing overpass
Table 4.1: Correspondence between singularities of tangents of the manifold, the 2-parameterfamily of height functions, and the pedal surface. There are two singularities of co-dimension one:curves of cusps and curves of self-intersections (xings). There are six singularities of co-dimensiontwo.
concurrent birth-death points that share a critical point. As illustrated in Figure 4.5, left,
the corresponding dovetail point in the pedal surface is endpoint of two cusps but also of
a self-intersection curve. The second most interesting type is formed by two concurrent
triple point cusp intersectiondovetail point
Figure 4.5: Left: a portion of the pedal surface in which a self-intersection and two cusps end at adovetail point. Middle: three sheets of the pedal surface intersecting in a triple point. Right: a cuspintersecting another sheet of the pedal surface.
interchanges that share a critical point and therefore force a third concurrent interchange
of the other two critical points. It corresponds to three self-intersection curves formed by
three sheets of � that intersect in a triple point, as shown in Figure 4.5, middle. Third,
we may have a concurrent birth-death point and interchange that share a critical point. As
illustrated in Figure 4.5, right, this corresponds to a cusp curve that passes through another
sheet of the pedal surface. There are three parallel types in which the concurrency happens
in the same direction � but not in space. They correspond to two curves on the pedal surface
that cross each other as seen from the origin but do not meet in � � . As before, a birth-death
point corresponds to a cusp curve and an interchange to a curve of self-intersections.
71
4.4 Capturing Elevation Maxima
4.4.1 Continuity
We are interested in the local maxima of the elevation function, which are the counterparts
of mountain peaks and deepest points in the sea. But they are not well defined because the
elevation can be discontinuous. We remedy this shortcoming through surgery, and establish
a stratified Morse function on the new 2-manifold.
Discontinuities at interchanges. As mentioned in Section 4.2.1, the pairs vary con-
tinuously as long as the height function varies without passing through interchanges and
birth-death points (Conditions I and II). It follows that the elevation is continuous in re-
gions where this is guaranteed. Around a birth-death point, the elevation is necessarily
small and goes to zero as we approach the birth-death point. The only remaining possibil-
ity for discontinuous elevation is thus at interchanges, which happen when two points share
the same tangent plane. As mentioned in Table 4.1, this corresponds to a point at which
the pedal surface intersects itself. Figure 4.6 shows that discontinuities in the elevation
can indeed arise at co-tangent points. We see four points with common vertical normal
direction, of which � and � are co-tangent. Consider a small neighborhood of the vertical
direction, � , and observe that the critical points vary in neighborhoods of their locations
for ��� . The critical point near � changes its partner from the right side of � to the left side
of � as it varies from left to right in the neighborhood of � . Similarly, the critical point near
changes its partner from the right side of � to the left side of � as it varies from left to
right in the neighborhood of
. Since the height difference is the same at the time of the
interchange, the elevation at � and
is still continuous. However, it is not continuous at
� and at � , which both change their partners, either from � to
or the other way round.
Not all interchanges cause discontinuities, only those that affect the pairing. These are the
interchanges that affect a common topological feature arising during the sweep of�
in the
72
y
w
x
z
Figure 4.6: The four white points share the same normal direction, as do the four light shadedand the four dark shaded points. The strips indicate the pairing, which switches when the heightfunction passes through the vertical direction. The insert on the right illustrates the effect of surgeryat � and � on the pedal curve.
height direction.
Continuity through surgery. We apply surgery to�
to obtain another 2-manifold�
on which the elevation function is continuous. Specifically, we cut�
along curves at
which� ���(% � # � $ � � � � is discontinuous, resulting in a 2-manifold with boundary, � .
Then we glue � along its boundary, making sure that glued points have the same elevation.
Formally, we cut by applying the inverse of a surjective map from � to�
, and glue by
applying a surjective map from � to�
:
������� � �� ���� � �� � �
�
As argued above, the boundary curves of � occur in pairs, and each pair is defined by an
interchange, thus corresponding to a self-intersection curve (a xing) of the pedal surface.
The latter view is perhaps the most direct one in which surgery means cutting along xings
and gluing the resulting four sheets in a pairing that resolves the self-intersection. This
is illustrated in Figure 4.6 where on the right we see a self-intersection being resolved by
cutting the two curves and gluing the upper and lower two ends. In the original boldface
curve on the left, this operation corresponds to cutting at � and � and gluing the four ends
to form two closed curves: one from � to � to � � � and the other from � to
to � � � .
As mentioned earlier, not all xings correspond to discontinuities and we perform surgery
73
only on the subset that do. In general, a discontinuity follows a xing until it runs into a
dovetail or a triple point. In the former case, the xing and the discontinuity both end. In the
latter case, the xing continues through the triple point and the discontinuity may follow,
turn, or even branch to other xings passing through the same triple point. Two possible
configurations created by surgery in the neighborhood of a triple point � are illustrated
in Figure 4.8 and 4.9. Their particular significance in the recognition of local maxima
will be discussed shortly. Whatever the situation, the subset of xings along which the
elevation is discontinuous together with the gluing pattern across these xings provides a
complete picture of how to use surgery to change � into a new surface, � . The 2-manifold�
is the one for which this is the pedal surface: � � � � � % � � � . That�
is indeed a
manifold can be shown by (tediously) enumerating and examining all cases of cut-and-glue
patterns that may occur. After surgery, we have a continuous function� ���(% � # � $ � � � � .
Furthermore, we have continuously varying pairs of critical points. To formalize this idea,
we introduce a new map� $ � #
� � � � � � �
that maps a point � to its paired point � � � $ � # � � � ��� � . The function
� $ � # � � � is a
homeomorphism and its own inverse. We note in passing that we could construct yet
another 2-manifold by identifying antipodal points. Each local maximum of the elevation
function on this new manifold corresponds to a pair of equally high maxima in�
. This
construction is the reason we will blur the difference between maxima and antipodal pairs
of maxima in the next few sections.
Smoothness of� ���(% � # � $ . The elevation function on
�is smooth almost everywhere. To
describe the violations of smoothness, let � � � denote the boundary of the intermediate
manifold. Let � ��� �� � � � � � � and define � � ��� � $ � # � � � � ��� ; � is the set of points
at which the elevation function is not smooth. By Genericity Assumption A, � is a graph,
consisting of nodes and arcs. We have degree-1 and degree-3 nodes that correspond to
74
dovetail points and triple points in the pedal surface respectively, as well as degree-4 nodes
that correspond to overpasses between xings. Each degree-4 node is the crossing of an arc
in � and an arc in the antipodal image of � . We think of this construction as a stratification
of�
. Its strata are
� the three kinds of nodes;
� the open and closed arcs;
� the open connected regions in�� � .
Figure 4.7 illustrates the construction by showing how such a stratification may look like.
When restricted to every stratum, the elevation function is smooth, but still not a Morse
function. For example, all points from lines of inflexion have elevation identical to 0,
forming lines of local minima for� ���(% � # � $ . We now complete our description of what we
mean by a generic 2-manifold.
Figure 4.7: Stratification of the 2-sphere obtained by overlaying a spherical tetrahedron with itsantipodal image. The (shaded) degree-4 nodes are crossings between
�and its antipodal image.
GENERICITY ASSUMPTION B. The local maxima of� ���(% � # � $ on
�are isolated.
The implication of this assumption becomes more clear after we enumerate the generic
types of local maxima of the elevation function in next section. In particular, this means
that surfaces such as spheres and cylinders are not generic under this assumption.
75
4.4.2 Elevation Maxima
In this section, we enumerate the generic types of local maxima of the elevation function.
They come in pairs in�
which, by inverse surgery, form multi-legged creatures in�
.
Classification of local maxima. Depending on its location, a point � � � can have
one, two, or three preimages under surgery. We call this number its multiplicity, � � � � .Specifically, � has multiplicity three if it is a node of the graph � , it has multiplicity two
if it lies on an arc of � , and it has multiplicity one otherwise. Degree-4 nodes in the
stratification correspond to antipodal pairs of points with multiplicity two each. Let now
� � � be a local maximum of the elevation function. We know that � is not a flat point
of�
, else its elevation would be zero. This simple observation eliminates five of the eight
singularities in Table 4.1. Furthermore, the assumption of a generic 2-manifold�
implies
that the sum of the multiplicity of � and that of � � � $ � # � � � � � � is at most four (where
the xings intersect each other transversely, otherwise, we can deform the manifold slightly
to enforce it). This leaves the following four possible types of local maxima � :
one-leggedtwo-leggedthree-leggedfour-legged
�������� if
���� ���� � � � � � � � � � � �� � � � � � and � � � � � �� � � � � � and � � � � � � �� � � � � � � � � � �
where � � � $ � # � � � ��� � ; see Figure 4.1. We sometimes call the preimages of � the heads
and those of � the feet of the maximum. The most exotic of the four types is perhaps the
four-legged maximum, which corresponds to an overpass of two xings in the pedal surface
or, equivalently, a degree-4 node in the stratification. The image under� � � % of � lies on
one xing and the image of � lies on the other. Both maxima have two preimages under
surgery, which makes for a complete bipartite graph with two heads, two feet, and four
legs.
Neighborhood patterns. Given a point � � �, take an open neighborhood of � on
�.
76
Denote by � � � � � " � � the image of this neighborhood under Gauss map4, and refer to it
as the neighborhood of ��� . If � is not a flat point (i.e., the Gaussian curvature at � is not
zero), then � � � � � is homeomorphic to an open disk, and there is a one-to-one map from the
neighborhood of � to that of ��� under Gauss map. In the following discussion, we study
only non-flat points since flat points will not possibly be maxima of the elevation function.
It is instructing to look at the local neighborhood of a maximum � in�
. Most in-
teresting is the three-legged type, with feet � � � � � � � � . A small perturbation of the normal
direction can change the ambiguous pairing of � with all three to an unambiguous pairing
of a point in the neighborhood of � with a point in the neighborhood of one of the feet. We
indicate this by labeling the points in the neighborhood of � � (i.e., � � � � � ) with the indices
of the feet, as shown in Figure 4.8. The three curves passing through � � correspond to the
nxn
xnx
3
21
3 3
p
1 2
3 3
1 2
p
2 1
12
2 1
p
3
Figure 4.8: The three sheets of � after cutting and gluing the neighborhood of a triple point
in�at the top, and the corresponding pairing patterns in the neighborhood of � � , at the bottom. The
(shaded) Mercedes star is necessary for a three-legged maximum.
three xings passing through the triple point � � � . Note that in generic cases, such curves
should pass through each other at ��� in a transversal manner as long as � is not a flat point.
Hence, they decompose the neighborhood into six slices corresponding to the six permu-
tations of the three feet. The labeling indicates the pairing and reflects the surgery at these
feet and, equivalently, at the corresponding triple point in the pedal surface. Only the right-�
The Gauss map takes each point on�
to its normal vector on � �.
77
most pattern in Figure 4.8 corresponds to a maximum, the reason of which will become
clear later after we introduce and prove the necessary projection conditions for elevation
maxima. We call this pattern the Mercedes star property of three-legged maxima.
There are in fact two ways to apply surgery at a three-legged maximum, one of which
is already shown in Figure 4.8. We illustrate the neighborhood patterns of the other in
Figure 4.9. The neighborhood pictures for the remaining three types of maxima are simpler.
For a one-legged maximum we have an undivided disk, which requires no surgery. For a
two- (resp. four-) legged maximum we have a disk divided into two halves (resp. quarters)
and there is only one way to do the surgery (see Figure 4.10).
p
xn xnxn
3
21
2 1
p
1 2
3 3
1 2
2 1
12
3 3
p
3
Figure 4.9: The second type of surgery pattern at a triple point: The three sheets of � after cuttingand gluing the neighborhood of a triple point
in
�, at the top, and the corresponding pairing
patterns in the neighborhood of � � , at the bottom.
xn 21
2, 1
2, 2
nx
1, 1
n1, 2
x
(a) (b)
Figure 4.10: Neighborhood patterns for (a) a two-legged maximum and (b) a four-legged maxi-mum, where we mark a region by � � , if � � is paired with � � in it.
Necessary projection conditions. Given a maximum � of� ���(% � # � $ � � � � with
� � � $ � # � � � ��� � � � , Let � � be the two directions at which the heads ( � � ’s) and feet
78
( � � ’s) of this maximum in�
are critical. Recall that all � � ’s (resp. � � ’s), are co-tangent.
Furthermore, they should satisfy the following necessary properties, stated as projection
conditions:
PROJECTION CONDITIONS. The point � is a maximum of the elevation function only if
�legs
� � : � is parallel or anti-parallel to � � � � ;
�legs
� : � , � � � � , and ��� � � are linearly dependent and the orthogonal projection
of � � onto the line of the two feet lies between � � and ��� ;�
legs� �
: the orthogonal projection of � onto the plane of the three feet lies inside
the triangle spanned by � � , ��� and � � ;�
legs�
: the orthogonal projections of the segments � � � � and � �� � onto a plane
parallel to both have a non-empty intersection.
In summary, � is a local maximum only if � is either a positive or a negative linear combi-
nation of the vectors � � � � � . Below we first prove the necessary conditions for one-legged
maxima. We then sketch the proof for all the remaining types of maxima.
For a one-legged maximum � , assume without loss of generality that � lies at the origin,
� � � � � � � , � , and ��
� ��� � � � � � � � � . By definition,
� ���(% � # � $ ��� � �� ���(% � # � $ � � � � , which is the height difference between � and � in the vertical direction.
We parameterize points in � � � � " � � by � ��� � � � as illustrated in Figure 4.11, where
� � � � � � and � � � � � � for a sufficiently small � � . Next, let � ��� � � � � �
(resp.
� ��� � � � � �) be the point in the neighborhood of � (resp. � ) with normal � ��� � � � . Denote
by � ��� � � � the height difference between � ��� � � � and � ��� � � � in direction � ��� � � � , i.e.,
� ��� � � � � � ���(% � # � $ ��� ��� � � � � � � ���(% � # � $ � � ��� � � � �� �
� ��� � � � � � ��� � � ��� � ��� � � � �
79
Obviously, � is a maximum if and only if
� � ��� � � � ��� � � � � � � � for a sufficiently small � � � (4.1)
Moreover,
��������
Figure 4.11: Illustration of normal � and the parameterization of its neighborhood � � � � , which isa spherical patch from � � .
� ��� � � � � � �.# $ �$ � � � � �.# $ �
�.# $ � � $ � � � � � and
� ��� � � � � � � � � ��� � � #%$ �$ � � � ��� � � � ��� � � #%$ �
� #%$ � �� � � � ��� � � � � $ � � � � � �
where � � ��� � denote the radius of curvature at position � in direction parameterized by � .
Similarly, we can compute � ��� � � � . It then follows that
� ��� � � � � �� ��� � � � � � ��� � � ��� � ��� � � �
��.# $
�$ � � � � � �.# $ �
�.# $ � � $ � � �
� � � � ��� � � � � ��� � � � � � $ � � � � � (4.2)
Hence,
� � ��� � � � � � � � $ � � � �� � � � � ��� � � � � ��� � � � $ � � ��� � $ � � � � � �.#%$ � ���
� � � � $ � � � �� � � � � ��� � � � � ��� � � �� � � � � $ � � ��� $ � � ��� � 1 ��� �
where�.# $ 1 � � � � � � � and
$ � � 1 � � � � � � ( 1 is the angle from vector � � � � to
vector � ��� � in the � � -plane). The third term in the bracket dominates the above value for
80
sufficiently small � , as$ � � � � tends to infinity in that case. Hence, (4.1) implies that � is a
maximum if and only if (i) � � � � � , i.e., � and �
� ; and (ii) for any � � � � � � ,
� � ��� � � � � ��� � . Note that the projection condition for one-legged maxima is the same
as (i) and is thus indeed necessary. Furthermore, if we add (ii), then the conditions are also
sufficient. In fact, the necessary Projection Conditions for all other types of maxima can be
made sufficient by adding appropriate constraints on the curvature of�
at � � ’s and at ��� ’s,
and, if � is three-legged, it also needs the Mercedes star property. We sketch the proofs
for the remaining types of maxima below. As these curvature conditions are not used in
our algorithm for the piecewise-linear manifolds, we omit terms related to curvatures in
the following sketch for simplicity. Hence, the above discussion suggests that it suffices to
consider only
� ��� � � $ � � ��� � 1 � �
which has opposite sign as � � ��� � � � as � tends to zero.
For a�-legged maximum � , where
� � �, the complication is that a point in the
neighborhood of � might be paired with points from the neighborhood of different � � ’s,
as illustrated by the neighborhood patterns in Figures 4.8 – 4.10. The disk in the neigh-
borhood pattern corresponds to the projection of � � � � in direction � . Therefore, we can
parametrize it using � as well, and each region � in the pattern corresponds to an interval� � � � in
� � � � , refered to as the range of this region.
First consider� � or
�, and assume again that � is at the origin, and �
� � � � � � .Define � � ��� � � � , 1 � , and � � ��� � as before by substituding � with � � . Obviously, � is a
maximum if and only if for each region � in the neighborhood pattern,
� � ��� � � � � � � � � � � � (4.3)
For a two-legged maximum, the length of the range of each region is � . Hence, (4.3)
imposes that the graph � � ��� � � � ��� ��� � as shown in Figure 4.12 (a). In other words, we
81
(a) (b)
Figure 4.12: Solid (dotted) curve corresponds to the graph of � � ��� � ( � � ��� � ). (a) is necessary for �to be a maximum. In (b), there exists � (hollow dot) such that both � � ��� ����� and � � ��� ����� . Thus� cannot be a maximum.
have 1 � � 1 � � � , implying that the � � -projections of segment � �� � contains the origin
(i.e., � ).
(a) (b)
Figure 4.13: Dark, light, and dotted curves are graph of � � ��� � , � � ��� � and � ���� � , respectively. (a) is
necessary for � to be a maximum. In (b), where we move � ���� � slightly, there exists � (hollow dot)
such that � � ��� ����� for ,� �� � and � . Thus � cannot be a maximum.
For a three-legged maximum, first note that the Mercedes star property is necessarily
true. This is because for any other neighborhood patterns in Figure 4.8 or 4.9, there exist
pairs of antipodal normals in � � ��� � that are marked by the same index. For example, in
the middle pattern in Figure 4.8, there are such normals both marked by index�. As � � ��� �
and � � ��� � � � have opposite signs (unless � � ��� � � � � ��� � � � � ), � cannot be a maximum
in this case. Furthermore, (4.3) imposes that 1 � � 1 � � � � 1 � � 1 �� � � 1 � � see Figure
4.13. This implies that the � � -projection of triangle � �� � � � covers the origin (i.e., � ).
The case for a four-legged maximum is slightly more involved as there are two heads
� � and � � , as well as two feet � � and ��� . Assume that � � is at the origin, and �� � � � � � .
By construction, � ���� is parallel to the horizontal plane and � � � � � � � � � lies within it.
82
We define 1 � ’s and � � ’s for� � � � as before by substituting � with � � . Denote by � � ��� � � �
the height of � � ��� � � � minus � � ��� � � � in direction � ��� � � � . By (4.2), � ����� � � � has the same
sign as � � ��� � � � $ � � ��� � 1 � � where 1 � is the angle from vector � � � � to vector � � � � � . Let� � � � � �
� � � �,� � ��� � � �� . If � is a maximum, then for any � � �, either � � ��� � � or
� � ��� � � holds. This is because that � � ��� � � � is higher than � � ��� � � � for � � �. Thus the
height difference between � � ��� � � � and � � ��� � � � can only be larger than � � ��� � � � . It then
follows that 1 � � 1 � � 1 � , implying that the � � -projections of segment � ���� intersects the
line passing through � � � � in the interior. By switching the role of � � ’s and � � ’s in the above
argument, we can show that segment � � � � also intersects, in its interior, the line passing
through the projections of � � and � � . Thus proves the necessary condition for four-legged
maxima.
4.5 Algorithm
In this section, we describe an algorithm for constructing all points with locally maximum
elevation. The input is a piecewise linear 2-manifold embedded in � � . The running time
of the algorithm is polynomial in the size of the 2-manifold.
Smooth vs. piecewise linear. We consider the case in which the input is a two-dimensional
simplicial complex in ��� . This data violates some of the assumptions we used in our math-
ematical considerations above. This causes difficulties which, with some effort, can be
overcome. For example, it makes sense to require that � be a 2-manifold but not that it
be smoothly embedded. The 2-parameter family of height functions is well-defined and
continuous but not smooth. The definition of the elevation function is more delicate as it
makes reference to point pairs in all possible directions. For any given direction, we get a
well-defined collection of pairs, but how can we be sure that the pairs for different direc-
tions are consistent? The difficulty is rooted in the fact that a vertex in � can be critical for
more than one direction and it may be paired with different vertices in different directions.
83
To rationalize this phenomenon, we follow [76] and think of � as the limit of an infinite
series of smoothly embedded 2-manifolds. A vertex of � gets resolved into a small patch
with a two-dimensional variety of normal directions. Even as the patch shrinks toward the
vertex, the variety of normal directions may remain fixed or at least not contract. For dif-
ferent directions in this variety, the corresponding points on the patch may be paired with
points from different other patches. It thus seems natural that in the limit a vertex would
be paired to more than one other point.
To make this idea concrete, we introduce a combinatorial notion of the variety of nor-
mal directions. Let � be a simplex in � (a vertex, edge, or triangle), let � be a point in the
interior of � , and let � � � � be a direction. We say � is critical for the height function in
the direction � if
(i)�� � � � � �� for all points � of � ;
(ii) the lower link of � is not contractable to a point.
For example, the empty lower link of a minimum and the complete circle of a maximum
are both not contractible. Let � � � � " � � be the set of directions along which � is critical.
Generically, the set � for a point inside a triangle of � is an antipodal pair of points,
that for a point on an edge is an antipodal pair of open great-circle arcs, and that for a
vertex is an antipodal pair of open spherical polygons. Here, the word ‘generic’ applies to
a simplicial complex in � � , where it simply means that the vertices are in general position.
Computationally, this assumption can be simulated by a symbolic perturbation [78]. We
write � ��� � � � ��� � � for the common intersection of the sets � of � , � and so on.
Finite candidate sets. Given a candidate for a maximum, we can use the extended per-
sistence algorithm to decide whether or not it really is a maximum. More specifically, we
need a point � and a direction � along which the sweep defining the pairing proceeds.
The details of this decision algorithm will be discussed shortly. We use the Projection
84
Conditions, which are necessary for local maxima, to get four kinds of candidates:
�legs
� � : pairs of points � and � on � with the direction � � � � � � � ��� � � contained in
� ��� � � � ;�
legs� : triplets of points � � � � � � � such that the orthogonal projection � of � onto the
line of � � and ��� lies between the two points and the direction � � � � � � � � � � � is
contained in � ��� � � � � � � � ;�
legs� �
: quadruplets of points � � � � � � � � � � such that the orthogonal projection � of �onto the plane of � � , ��� , � � lies inside the triangle and the direction � � � � � � � � � � �is contained in � � � � � � � � � � � � � ;
�legs
� : quadruplets of points � � ��� � � � � � ��� such that the shortest line segment � con-
necting the lines of � � ��� � and � � � ��� also connects the two line segments and the
direction � � � � � � � � � is contained in � ��� � ��� � � � � � � � � .With the assumption of a generic simplicial complex � , we get a finite set of candidates
of each kind. Since this might not be entirely obvious, we discuss the one-legged case
in some detail. Let � and � be two simplices and � and � points in their interiors. For
a generic � , the intersection of normal directions, � ��� � � � , is non-empty only if one of
the two simplices is a vertex or both are edges. If � � � is a vertex then � is necessarily
the orthogonal projection of � onto � , which may or may not exist. If � and � are both
edges then � � is necessarily the line segment connecting � and � and forming a right angle
with both, which again may or may not exist. In the end, we get a set of O( � � ) candidate
pairs � and � , where � is the number of edges in � . For the two-legged case, we get
O( � � ) candidates, each a triplet of vertices or a pair of vertices together with a point on an
edge. For the three- and four-legged cases, we get O( ��) candidates, each a quadruplet of
vertices, giving a total of O( ��) candidates.
Verifying candidates. Let � � � � � be a pair of points whose heads and feet all have
85
parallel or anti-parallel normal directions. In the smooth case, the necessary and sufficient
conditions for � and � to define an elevation maximum consists of three parts:
(a) the Projection Conditions of Section 4.4.1;
(b) the requirement that � � � $ � # � � � ��� � ;(c) the curvature constraint alluded to in Section 4.4.1.
We subsume the Mercedes star property in (b) since it depends on the antipodality map
or, equivalently, on the pairing by extended persistence. In the piecewise linear case, we
only have (a) and (b) because the concentration of the curvature at the edges and vertices
renders (c) redundant. We have seen above how to translate (a) to the piecewise linear
case. It remains to test (b), which reduces to answering a constant number of antipodality
queries: given a direction � and a critical point � of � � , find the paired critical point � . This
is part of what the algorithm described in Section 4.2.1 computes if applied to a sweep of� in the direction � . More precisely, the algorithm computes one of the possible pairs,
if applied in non-generic directions in which two or more vertices share the same height.
Most of our candidates generate non-generic directions, and we cope with this situation by
running the algorithm several times, namely once for each combination of permutations
of the heads and of the feet. Each combination corresponds to a generic direction that
is infinitesimally close to the non-generic direction. The largest number of combinations
is six, which we get for three-legged maxima. This is also how we decide the Mercedes
star property: each of the three feet is the answer to exactly two of the six antipodality
queries. Letting � be the number of edges, the algorithm takes time O( � ��� � � ) to answer
the antipodality query. Since we have O( ��) candidates to test, this amounts to a total
running time of O( � � ��� � � ).
86
4.6 Experiments
We implemented the algorithm described in Section 4.5 and used it on surface representa-
tions of a few protein structures. We describe the findings to illustrate how the concepts
introduced in the earlier sections might be applied.
Elevation on surface. We discuss the experimental findings for chain � of the protein
complex with pdbid 1brs, which we downloaded from the protein data bank. It contains
864 atoms, not counting the hydrogens which are too small to be resolved in the x-ray
experiment and are not part of the structure. The particular surface representation we use
is the molecular skin [55], which is similar to the better known molecular surface [65]. The
reason for our choice is the availability of triangulating software and the guaranteed smooth
embedding. The computed triangulation displayed in Figure 4.16 has slightly more than
50 thousand vertices after some simplification. Given the triangulation of the skin surface,
we compute the elevation for each vertex of it and visualize it in Figure 4.16: Recall that
each vertex has a range of directions associated with it that makes it critical, from which
we choose an arbitrary one to compute its elevation value.
Number of maxima. Table 4.2 gives the number of maxima of each type computed for
the skin triangulation for protein 1brs. We show in a separate row the number of additional
maxima paired by extended persistence as introduced in Section 4.2.1. Since the genus of
this particular surface is zero, all these maxima lie on the convex hull of the surface.
�legs one two three four
�max (trad.) 5 3,617 728 1,103
�max (addl.) 15 0 6 0
Table 4.2: The number of maxima for the molecular skin of the 1brs protein structure obtainedvia traditional persistence (second row) and the additional maxima obtained by its extension (thirdrow).
We notice that there are significantly more two-legged than other types of maxima. The
reasons is perhaps the particular shape of molecules in which covalently bonded atoms
87
Figure 4.14: The percentage of maxima with elevation exceeding the threshold marked on thevertical axis. From top to bottom: the curves for the three-legged, four-legged, and two-leggedmaxima.
form small dumbbells which invite two-legged maxima with one foot on each atom. These
dumbbells are rotationally symmetric and form surface patches with non-generic elevation
function, which further contributes to the abundance of two-legged maxima. The con-
figurations required to form one-legged and three-legged maxima are considerably more
demanding, but when they occur the maxima tend to have higher elevation. This obser-
vation is quantified in Figure 4.14, which sorts the maxima in the order of decreasing
elevation. We see that for each threshold, the fraction of three-legged maxima higher than
Figure 4.15: The one hundred pairs of maxima with highest elevation. The heads are marked bylight and the feet by dark dots.
that elevation is significantly larger than the fractions of two- and four-legged maxima.
The difference is even more pronounced for one-legged maxima of which four of the five
88
have elevation exceeding 5 Angstrom. The statistics for other proteins are similar.
High elevation maxima. We are indeed mostly interested in high elevation maxima as
the others are likely consequences of insignificant surface fluctuations or artifacts of the
piecewise linear nature of the data. Figure 4.15 shows the top one hundred maxima on the
skin surface of 1brs. Each antipodal pair of maxima is represented by its one or two heads
and one, two, or three feet.
One might expect that the binding site of a protein would perhaps have more or higher
maxima. We did not observe any such trend in the few cases we studied. It seems that
maxima are more or less uniformly distributed over the surface. This should be contrasted
to the finding that in many cases the pocket with the largest volume identifies the location
of the binding site [130]. The elevation is indeed a less specific measurement with respect
to a single surface, and we expect its primary use to be in the study of interactions between
two or more shapes.
Meshes of different resolution. To see how the resolution of surface meshes influences
the behavior of the elevation function, we generate two meshes for the molecular surface
of chain E of the protein complex (pdbid 1cho), using the msms program (also available
as part of the VMD package [105]). These two meshes, � � and ��� , have � � � � �and
� � � � �
vertices, respectively. Let �
(resp.
� ) be the set of maxima of the elevation function
from surface � � (resp. ��� ) and � ��� � " �
(resp.
� ��� � " � ) the subset of maxima
with an elevation greater than some threshold � . We show in Table 4.3 the size of � ��� � and
� ����� for various � ’s. We note that the size of
� ����� (from the coarser mesh)
differs less from that of
� ��� � as � increases. Furthermore, Table 4.4 shows that points
from � ��� � , � � ����� , roughly covers points from
� ����� , i.e., � � ��� � . In particular, a point
� � � � ����� is � -covered if there is some point � � � � ����� such that � � � � � � � � , where
� � is the covering radius. Table 4.4 shows the percentage of points from � � ��� � that are
89
Threshold 0.0 0.1 0.2 0.3 0.5 0.8 1.0 2.0 4.0� � ��� � 2292 1445 1010 734 500 345 272 115 26�� ��� � 3714 2233 1290 931 489 357 287 104 34
Table 4.3: The number of maxima whose elevation is greater than the threshold from surfaces � �and � � .
� -covered by � � ��� � , called covering density � � � ����� of �� ����� over � � ��� � , for various � ’s and
for � � � ˚�
. Similarly, � � � denotes the covering density of � � ����� over �� ����� . The covering
density increases in general as � increases, indicating that larger features are preserved
better than smaller ones as the surface mesh becomes coarser.
� 0.1 0.2 0.3 0.5 0.8 1.0 2.0 4.0� � � ��� � ���� � � ��� ��� � � � � � � � ����� � � ��� � � � � � � � � � � � � ����� ����� ���� � � � �
Table 4.4: As � changes (upper row), the covering density of� � ��� � over
�� ��� � (middle row) and
that of�� ��� � over
� � ��� � (lower row).
4.7 Notes and Discussion
The main contribution of this chapter is the definition of elevation as a real-valued function
on a 2-manifold embedded in � � and the computation of all local maxima. The definition
of this function can be extended to a � -manifold embedded in � � � � .The logical next step in this research is the exploitation of the maxima in protein dock-
ing and other shape matching problems. We will describe in Chapter 6 one such approach.
It would also be worth exploring extensions of our results to manifolds with boundary. A
crucial first step will have to be the generalization of the concept of extended persistence
to these more general topological spaces. Another interesting direction of research is iden-
tifying “features” (as those computed by elevation function) directly from a point cloud
representing some underlying surface
: Surface reconstruction in general is hard with
oversampled and/or noisy point sets. However, it is easy to construct a simplicial complex
90
� (which may not be a manifold) that roughly describes
and may have different topol-
ogy as
. By computing points with maximal elevation on � , and keeping only those
with high elevation, it is still possible to identify important features on
.
Finally, the algorithm presented in Section 4.4.1 enumerates all local maxima of the
elevation function, without computing the elevation function itself, other than at a collec-
tion of candidate points. This approach is suggested by the ambiguities that arise in the
definition of the elevation function for piecewise linear data. Unfortunately, it implies the
fairly high running time of O( � � ��� � � ) in the worst case. Can the maxima be enumerated
more efficiently than that? Is there an algorithm that enumerates all maxima above some
elevation threshold without computing the maxima below the threshold?
91
Figure 4.16: Visualization of elevation on skin surface for protein 1brs. Roughly, the higher theelevation is, the darker the color is.
92
Chapter 5
Matching via Hausdorff Distance
5.1 Introduction
The problem of shape matching in two and three dimensions arises in a variety of applica-
tions, including computer graphics, computer vision, pattern recognition, computer aided
design, and molecular biology [17, 97, 148]. For example, proteins with similar shapes
are likely to have similar functionalities, therefore classifying proteins (or their fragments)
based on their shapes is an important problem in computational biology. Similarly, the
proclivity of two proteins binding with each other also depends on their shapes, so shape
matching is central to protein docking problem in molecular biology [97].
Informally, the shape-matching problem can be described as follows: Given a distance
measure between two sets of objects in � � or � � , determine a transformation, from an al-
lowed set, that minimizes the distance between the sets. In many applications the allowed
transformations are all possible rigid motions. However, in certain applications there are
constraints on the allowed transformations. For example, if matching the pieces of a jigsaw
puzzle, it is important that no two pieces overlap each other in their matched positions. An-
other example is the aforementioned docking problem, where two molecules bind together
to form a compound, and, clearly, at this docking position the molecules should occupy
disjoint portions of space [97]. Moreover, because of efficiency considerations, one some-
times restricts further the set of allowed transformations, most typically to translations
only.
Several distance measures between objects have been proposed, varying with the kind
of input objects and the application. One common distance measure is the Hausdorff dis-
94
tance [17], originally proposed for point sets. In this chapter we adopt this measure, extend
it to non-point objects (mainly, disks and balls), and apply it to several variants of the shape
matching problem, with and without constraints on the allowed transformations. We are
primarily interested in the case of balls because of molecular-biology applications, where
a molecule is typically modeled as a set of balls, with each atom represented by a ball [19].
Problem statement. Let � and � be two (possibly infinite) sets of geometric objects
(e.g., points, balls, simplices) in � � , and let ��� � � � � � be a distance function between
the objects in � and in � . For � � , we define � � � ��� � #%$'&�� )�� � � � � � . Similarly, we
define � � � � � � � #%$'& � )�� � � ��� � , for � � � . The directional Hausdorff distance between �
and � is defined as � ��������� � � � � )�� � � ����� �
and the Hausdorff distance between � and � is defined as
� ��������� � � %�� � � ����� ��� � � ��� � � � � �(It is important to note that in this definition each object in � or in � is considered as a
single entity, and not as the set of its points.) In order to measure similarity between �
and � , we compute the minimum value of the Hausdorff distance over all translates of �
within a given set " � � of allowed translation vectors. Namely, we define
��������� � � #%$'&�*) � � ��� � � ����� �
where � � � � � � � � � � � . In our applications,
will either be the entire � � or
the set of collision-free translates of � at which none of its objects intersects any object of
� . The collision-free matching between objects is useful for applications (like the docking
problem) in which the goal is to determine a transformation � so that the shape of � � �best complement that of � . We use � ����� ��� to denote ��������� � � � .
95
As already mentioned, our definition of (directional) Hausdorff distance is slightly dif-
ferent from the one typically used in the literature [17], in which one considers the two
unions ��� , ��� as two (possibly infinite) point sets, and computes the standard Hausdorff
distance
� � ����� � ��� � � %�� � � � ����� � ��� � � � ���� ��� � � �where � � ����� � ��� � � � � )�� � � � � � � ���
� � � � )�� �#%$'&� )�� � � � � � � � �
We will denote ��� (resp., � � ) as�� (resp.,
�� ), and use the notation ��� ��������� to denote� � � � � � � � . Analogous meanings hold for the notations ��� ��������� and � � ������� � .
A drawback of the directional Hausdorff distance (and thus of the Hausdorff distance)
is its sensitivity to outliers in the given data. One possible approach to circumvent this
problem is to use “partial matching” [57], but then one has to determine how many (and
which) of the objects in � should be matched to � . Another possible approach is to use
the root-mean-square (rms, for brevity) Hausdorff distance between � and � , defined by
� ��������� � ��� � � � ����� � �
� � � � - � % $ �
� ��������� � � %�� � � ��������� � � ���� � � � � �with an appropriate definition of integration (usually, summation over a finite set or the
Lebesgue integration over infinite point sets). Define
� ������� � � # $'&�*) � � ���� � � ����� �
Finally, we define the summed Hausdorff distance to be
� ����������� � �� � � � ��� � �
� � �
96
and similarly define � � and � � . Informally, � ����� ��� can be regarded as an� �
-distance
over the sets of objects � and � . The two new definitions replace���
by� � and
� � ,
respectively.
Prior work. It is beyond the scope of this section to discuss all the results on shape
matching. We refer the reader to [17, 97, 148] and references therein for a sample of
known results. Here we summarize known results on shape matching using the Hausdorff
distance measure.
Most of the early work on computing Hausdorff distance focused on finite point sets.
Let � and � be two families of and � points, respectively. In the plane, � ���������can be computed in � � � � � ��� � � time using Voronoi diagrams [14]. In ��� , it
can be computed in time � � � � � � - � ��� � using the data structure of Agarwal and Ma-
tousek [9]. Huttenlocher et al. [107] showed that ����������� can be computed in � � � � � � � � � � ��� � � � � � time in � � , and in time � � � � � � � � � � ��� � in � � , where � � is
arbitrarily small. Chew et al. [57] presented an � � � � ��� � � - ��� � � ��� � � � � � � -time algo-
rithm to compute ����������� in � � for any � . The minimum Hausdorff distance between
� and � under rigid motion in � � can be computed in � � � � � � ��� � � time [106].
Faster approximation algorithms to compute � ����� ��� were first proposed by Goodrich
et al. [95]. Aichholzer et al. proposed a framework of approximation algorithms using
reference points [12]. In � � , their algorithm approximates the optimal Hausdorff dis-
tance within a constant factor in � � � � � ��� � � � � � time over all translations, and in
� � �)� � � � ��� � � � � � time over rigid motions. The reference point approach can be
extended to higher dimensions. However, it neither approximates the directional Hausdorff
distance over a set of transformations, nor can it cope with the partial-matching problem.
Indyk et al. [110] study the partial matching problem, i.e., given a query� � , com-
pute the maximum number of points � from � such that � � � ����� � �. They present
algorithms for � -approximating the maximum-size partial matching over the set of rigid
97
motions in � � +� � � � � � � ����� � ���� � � time in � � , and in � � +� � � � � � � � � �)��� � ���� � � time in
� � , where � is the maximum of the spreads of the two point sets1. Their algorithm can
be extended to approximate the minimum summed Hausdorff distance over rigid motions.
Similar results were independently achieved in [46] via different technique.
Algorithms for computing � � ��������� and/or � � ��������� where � and � are sets of seg-
ments in the plane, or sets of simplices in higher dimensions [11, 14, 15]. Atallah com-
puted � � ��������� for two convex polygons [25]. Agarwal et al. [11] provide an algorithm
for computing � � ��������� in time � � � � � ��� � � � � � . For the case where any rigid motion
is allowed, the minimum Hausdorff distance can be computed in time � � � � � ��� � � � � �
(Chew et al. [58]). Aichholzer et al. approximate the minimum Hausdorff distance under
different families of transformations for sets of points, segments in � � , and sets of trian-
gles in � � using reference points [12]. Other than that, little is known about computing
� � ��������� or ����������� where � and � are simplices or other geometric shapes in higher
dimensions.
Our results. In this chapter, we develop efficient algorithms for computing ��������� �
and � � ����� � � for balls, and for approximating � ���������� � � ����������� for sets of points in
� � . Consequently, the chapter consists of three parts, where the first two deal with the two
variants of Hausdorff distances for balls, and the third part studies the rms and summed
Hausdorff-distance problems for point sets.
Let � � � � denote the ball in � � of radius�
centered at . Let � � � � � � ����� � � � �and � � � � � � ����� ��� � � be two families of balls in � � , where � � � � � � � � � and � � �
��� � � � � � , for each � and�. Let � be the set of all translation vectors � � ��� so that no ball
of � � � intersects any ball of � .
Section 5.2 considers the problem of computing the Hausdorff distance between two
sets � and � of balls under the collision-free constraint, where the distance between two�The spread of a set of points is the ratio of its diameter to the closest-pair distance.
98
disjoint balls � � � � and � � � � is defined as � ��� � ��� � � � � � � ��� � � � � � � � � . We can
regard this distance as an additively weighted Euclidean distance between the centers of
� � and � � , and it is a common way of measuring distance between atoms in molecular
biology [97]. In Section 5.2 we describe algorithms for computing ��������� ���� in two and
three dimensions. The running time is � � � � � � ��� � � � in � � , and � � � � � �� � ��� � � � in � � . The approach can be extended to solve the (collision-free) partial-
matching problem under this variant of Hausdorff distance in the same asymptotic time
complexity.
Section 5.3 considers the problem of computing � � ����� ��� and � � ������� ���� , i.e., it
computes the Hausdorff distance between the points lying in the union of � and the
union of � , minimized over all translates of � in � � or in � . We first describe an
� � � � � � ��� � � � -time algorithm for computing � � ����� ��� and � � ������� ���� in � � ,
which relies on several geometric properties of the union of disks. A straightforward ex-
tension of our algorithm to ��� is harder to analyze, and does not yield efficient bounds on
its running time, so we consider approximation algorithms. In particular, given a parame-
ter �� , we compute a translation � , in time � � � � � � � � � � ��� � � � � � � in � � and in
time � � � � � �� � � � � � ��� � � � � � � in � � , such that � � ��� � � ����� � � � � � � � � ��������� .
We also present a “pseudo-approximation” algorithm for computing � � ������� ���� : Given
an � � , the algorithm computes a region � " ��� , an � -approximation of � (in a sense
defined formally in Section 5.3), and returns a placement � ��� such that
� � ��� � � ��� � � � � � � � � � � ������� � � �
in time � � � � � �� � � � � � ��� � � � in � � . This variant of approximation makes sense in
applications where the data is noisy and shallow penetrations between objects are allowed,
as is the case in the docking problem [97].
Finally, let � and � be two sets of points in � � . Section 5.4 describes an algorithm
99
that computes an � -approximation of � ������ ��� in time � � � � � � � ��� � � � � � � .2 It also
provides a data structure so that it can return in � ��� � � � � � � time an � -approximation
of � ��� � � ����� for a query vector � � � � . In fact, we solve a more general problem,
which is interesting in its own right. Given a family � � � ��� � ��� � of point sets in � � , with
a total of � points, we construct a decomposition of ��� into � � � � � � cells, which is
an � -approximation of each of the Voronoi diagrams of � � � � ��� ��� � , in the sense defined
in [24, 98]. Moreover, given a semigroup operation�
, we can preprocess this decom-
position in � � � � � � � ��� � � � � � � time, so that an � -approximation of �� � � � � � � ��� � � , for
a query point � , can be computed in � ��� � � � � � � time. We also extend the approach
to � -approximating ������������� in � � � � � � � � � �� ��� � � � � � � time. This result relies on
a dynamic data structure, which we propose, for maintaining an � -approximation of the
�-median of a point set in � � , under insertion and deletion of points.
5.2 Collision-Free Hausdorff Distance between Sets of Balls
Let � � � � � � ����� � � � � and � � � � � � ����� ��� � � be two sets of balls in � � , � � � � . For
two disjoint balls � � � � � � � � � � � and � � � � � � � � � � � � , we define
� ��� � ��� � � � � � � � � � � � � � � � � �
namely, the (minimum) distance between � � and � � as point sets. Let � be the set of
placements � of � such that no ball in � � � intersects any ball of � . In this section, we
describe an exact algorithm for computing ������������� , and show that it can be extended to
partial matching.�Indyk et al. [110] outline an approximation algorithm for computing ��� ��� �� � without providing anydetails. We believe that if we work out the details of their algorithm, the running time of our algorithm isbetter. Moreover our algorithm is more direct.
100
5.2.1 Computing ������������� in 2D and 3D
As is common in geometric optimization, we first present an algorithm for the decision
problem, namely, given a parameter � � , we wish to determine whether ������������� � � .We then use the parametric-searching technique [11, 133] to compute ��������� ���� . Given
� � , for� � � � , let ��� " � � be the set of vectors � � � � such that
(V1) � � � � does not intersect the interior of any � � � � ;
(V2) �# $ ��� � � � � ��� � � � ��� � � � � .
Let �� �� � � � � � � � � � � � � and �� �
� ��� � � � � � � � � � � � � . Then� ��
��� � � � �� �
is the set of vectors that satisfy (V2), and the interior of� ��� � � � � �
� � violates (V1).
Clearly, ��� � $ � � ���� ��� � . Let
� ����� ��� � ������%� � � �
� $ ��� ��� ��
�� ���
�� �����
�
See Figure 5.1 for an illustration. By definition, � ��������� " � is the set of vectors ��� �such that � ��� � � ����� � � . Similarly, we define
� � � � � � " � � � � � � � � � � � � � � � � � � �Thus ��������� ���� � � if and only if � ��������� � � � � � � ��� � �
Lemma 5.2.1 The combinatorial complexity of � ��������� in � � is � � � � .PROOF. If an edge of �� ��������� is not adjacent to any vertex, then it is the entire circle
bounding a disk of �� � or �� � . There are � � � such disks, so it suffices to bound the
number of vertices in � ��������� .Let be a vertex of � ��������� ; is either a vertex of � � , for some
� � � � , or an
intersection point of an edge in � � and an edge in ��� , for some� � � �
� � � . In the
latter case,
���� � ��� � � � �� � � �� � � � ��� � �
�� � �
101
���� � ��� �
(a) (b)
Figure 5.1: (a) Inner, middle and outer disks are� � ��� � , �
�� � , and
� �� � , respectively; (b) anexample of � � (dark region), which is the difference between �� (the whole union) and
�� (inner
light region).
In other words, a vertex of � ��������� is a vertex of� �� � � �� ,
� �� � ��� ,
� �� � ��� , or
� �� � �
�� ,
for� � � � � � . Observe that a vertex of
� �� � � �� (resp., of� �� � �
�� ) that lies on both
� � �� and � � �� (resp., � ��� ) is also a vertex of
� �� � � �� (resp.,� �� � � �
� ). Therefore,
every vertex in � ����� ��� is a vertex of� �� � � �� ,
� �� � � �� ,
� �� � � �� , or
� �� � � �
� ,
for some� � � � � � . Since each
� �� � ��� is the union of a set of � disks, each of
� �� � � �� ,
� �� � � �� ,
� �� � � �� ,
� �� � � �
� is the union of a set of � disks and thus has
� � � vertices [120]. Hence, � ��������� has � � � � vertices.
Lemma 5.2.2 The combinatorial complexity of � ��������� in ��� is � � � � � .PROOF. The number of faces or edges of � ��������� that do not contain any vertex is � � � � �since they are defined by at most two balls in a family of � balls. We therefore focus on
the number of vertices in � ��������� . As in the proof of Lemma 5.2.1, any vertex � ���������
satisfies:
� � � � � � � � � � � � �� � � �� � � �� � � � ��� � �
�� � � �� � �
for some� � � � � � � � . Again, such a vertex is also a vertex of � � � � � � �*� ,
where � � (or � � , �*� ) is� �� or
� �� . Since the union of
�balls in ��� has � � � � vertices,
� � � � � � ��� has � � � � vertices, thereby implying that � ����� ��� has � � � � � vertices.
102
Similarly, we can prove that the complexity of � � � � � � is � � � � in � � and � � � � �in � � . Extending the preceding arguments a little, we obtain the following.
Lemma 5.2.3 � ��������� � � ��� � � � has a combinatorial complexity of � � � � � � � in
� � , and � � � � � � � � � in � � .
Remark. The above argument in fact bounds the complexity of the arrangement of � �
� � � � � � � ��� � � � � � . For example, in � � , any intersection point of �� � and � � � lies on the
boundary of � � ��� � � � � , and we have argued that � � � � � has � � � vertices. Hence, the
entire arrangement has � � � � vertices in � � .
We exploit a divide-and-conquer approach, combined with a plane-sweep, to compute
� ����� ��� , � ��� � � � , and their intersections in � � . For example, to compute � ��������� , we
compute � � ���� - �� � � � � and � � � ���
�� � � - ��� � ��� recursively, and merge � ��������� � � � � � � �
by a plane-sweep method. The overall running time is � � � � � � �)� � � .To decide whether � ����� ��� � � ��� � � � � � in ��� , it suffices to check whether
��� � � ����� ��� � � ��� � � � � ��is empty for all balls � � �
� � � �� � � � � � � � � � � �� � . Using the fact that
the various �� � � �� � meet any �� in a collection of spherical caps, we can compute
���
in time � � � � � � ��� � � , by the same divide-and-conquer approach as computing
� ����� ��� � � � �� � � in � � . Therefore we can determine in � � � � � � � � ��� � � time
whether � ����� � ���� � � in � � .
Finally, the optimization problem can be solved by the parametric search technique [11].
In order to apply the parametric search technique, we need a parallel version of the above
procedure. However, above divide-and-conque paradigm uses plane-sweep during the
conque stage, which is not easy to parallelize. As such, we instead adopt the same al-
gorithm as in [11] to compute the union/intersection of two planar or spherical regions. It
103
yields an overall parallel algorithm for determining whether � ��������� � � � � � � � is empty
in � ��� � � � time using � � � �� � �)� � � processors in � � , and � � � � � �
� � ��� � � processors in � � . This implies the following.
Theorem 5.2.4 Given two sets � and � of and � disks (or balls), we can compute
������������� in time � � � � � � ��� � � � in � � and in time � � � � � � � � ��� � � �
in � � .
5.2.2 Partial matching
Extending the definition of partial matching in [110], we define the partial collision-free
Hausdorff distance problem as follows.
Given an integer�
, let ��� ��������� denote the� � � largest value in the set � � � ����� � �
� � ; note that � ��������� � � � ��������� . We define � � ��� � � � in a fully symmetric manner, and
then define ��� ��������� , � � ������� � as above. The preceding algorithm can be extended to
compute � � ������� ���� in the same asymptotic time complexity. We briefly illustrate the
two-dimensional case. Let � � � � � � � � � ����� � � � � be as defined above, and let � � � � be the
arrangement of � . For each cell � ��� � � � , let � � � � be the number of � � ’s that contain � .
Note that for any point � in a cell � with � � � � � � � � � , ��� ��� � � ����� � � , and vice
versa. Hence, we compute � � � � and � � � � for each cell � ��� � � � , and then discard all
the cells � for which � � � � � � �� � � . The remaining cells form the set � � � � � � � ��� �
� ����� � � � . By the Remark following Lemma 5.2.3, � has � � � � vertices, and it can
be computed in � � � ��� � � time. Therefore, � can be computed in � � � ��� � �
time. Similarly, we can compute � � � � � � � � � � � � � � � � � in � �
� ��� � � time, and
we can determine in � � � � � � ��� � � time whether � � � �� � . Similar arguments
can solve the partial matching problem in ��� . Putting everything together, we obtain the
following.
104
Theorem 5.2.5 Let � and � be two families of and � balls, respectively, and let�
be an integer, we can compute � � ������� ���� in � � � � � � �)� � � � time in � � , and in
� � � � � � � � ��� � � � time in � � .
5.3 Hausdorff Distance between Unions of Balls
In Section 5.3.1 we describe the algorithm for computing � � ��������� in � � . The same
approach can be extended to compute � � ������� ���� within the same asymptotic time com-
plexity. In Section 5.3.2, we present approximation algorithms for the same problem in � �
and � � .
5.3.1 The exact 2D algorithm
Let � � � � � � � ��� � � � � and � � � � � � ��� � ��� � � be two sets of disks in the plane. Write,
as above, � � � � � � � � � , for �� � �
�����
� , and � � � ��� � � � � � , for� � � �
��� �
� � . Let��
(resp.,�� ) be the union of the disks in � (resp., � ). As in Section 5.2, we focus on the
decision problem for a given parameter � � .For any point � , we have
� � � � � � � � � # $� ) ��� � � � � � � � � #%$��� � � � � � � � � � �
� � # $��� � � � �
%�� � � � � ��� � � � � � � �� �This value is greater than � if and only if
� #%$��� � � � � � � � � � � � � � � � � � � �
In other words, � � ��� � � ����� � � if and only if there exists a point � � �� such that
� � � �� �� � � � � � �
� � � � � � � � , where � � � � � � � � � � � � � � � is the disk � � expanded by � .Let
� � � � � � � ��� � � ����� � � � 105
� is the set of all translations � such that�� � � " �
� � � � . Our decision procedure
computes the set � and the analogously defined set
� � � � � � � � � � � � � � � � � �and then tests whether
� � � �� � . To understand the structure of � , we first study the
case in which � consists of just one disk � , with center and radius � . For simplicity of
notation, we denote�� � � � temporarily by
�. Let � be the set of vertices of � � and
�the
set of (relatively open) edges of � � ; � � � � � � � � �� � � [120].
Consider the Voronoi diagram� � ' � � � � � of the boundary features of
�, clipped to
within�
. This is a decomposition of�
into cells, so that, for each � � � � � , the cell
� � � � of � is the set of points � � �such that � � � � � � � � ��� � � � � , for all � � � � � � . The
diagram is closely related to the medial-axis of � � . See Figure 5.2(a). For each� � � , let
� � � � denote the circular sector spanned by�
within the disk � � � � � whose boundary is�
,
and let� � � � � ��� )�� � � � � . The diagram has the following structure. A variant of the
following lemma was observed in [19].
Lemma 5.3.1 (a) For each� � � , we have � � � � � � � � � .
(b) For each � ��� , we have � � � � � � � � � � � � � , where � ��� � � is the Voronoi cell of � in
the Voronoi diagram� � ' � � � of � . Moreover, � � � � is a convex polygon.
Lemma 5.3.1 implies that� � ' � � � � � yields a convex decomposition of
�of linear
size. Returning to the study of the structure of � , by definition, � � � " �
if and only if
� � � � � � � � , where � is the feature of � � � whose cell contains � � . This implies
that the set � ��� � of all translations � of � for which � � � " �
is given by
� ��� � � �� )�� ��� �
� � � � �
where
�� � � � � � � �� � � � � � � � � � � � � �106
�
������� ��� ��
�
(a) (b) (c)Figure 5.2: (a) Medial axis (dotted segments) of the union of four disks centered at solid points:The Voronoi diagram decomposes the union into
cells; (b) shrinking by � the Voronoi cell � ��� �
of each boundary element � of the union; (c) light-colored disk bounds a convex arc, and dark diskbounds a concave arc.
For� � � , �� � � � is the sector obtained from
� � � � by shrinking it by distance � towards
its center. For ���� , � � � � � � � � � � � � � � � . See Figure 5.2(b) for an illustration.
Now return to the original case in which � consists of disks; we obtain
� ���� � � � ��� � � �
��� � �
�� ) � � �
� �� � � � � �� � � �
Note that each � ��� � � is bounded by � � � circular arcs, some of which are convex (those
bounding shrunk sectors), and some are concave (those bounding shrunk Voronoi cells of
vertices). Convex arcs are bounded by disks ��� � �� � � � � � � � � � � for some� � � �
� ,
while concave arcs are bounded by disks � � � � � � � � for � � � . Furthermore, since � ��� � � is obtained by removing all points � � �
such that the nearest distance from �to � � is smaller than � � , we have that: (i) ��� � � � � � � � � � � � � " � ��� � � ; and (ii) � � �� � � � � � � � � ��� � � � � � � ��� � � � � � . See Figure 5.2 (c) for an illustration.
Lemma 5.3.2 For any pair of disks � � � � � � � , the complexity of � ��� � � � � ��� � � is
� � � .
PROOF. Clearly, � ��� � � � � ��� � � is bounded by circular arcs, whose endpoints are either
vertices of � ��� � � or
� ��� � � , or intersection points between an arc of � � ��� � � and an arc
of � � ��� � � . It suffices to estimate the number of vertices of the latter kind.
107
Consider the set � �� � of the � � �� � � disks
� ��� � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��� � � � � � � � � � � � � � � � � � � � � � � � � )�� �We claim that any intersection point between two arcs, from � � ��� � � and � � ��� � � respec-
tively, lies on � � ������ � � . Indeed, assume that � is such an intersection point that does not
lie on � � � ���� � � . Then there is a disk � � �� � that contains � (i.e., � � ). There are two
possibilities for the choice of disk .
(i) � � � � � � � � � � � � � � � (resp. � � � � � � � � � � � � � � � ) for some� � � �
� .
Such a disk bounds some convex arc on � � ��� � � (resp. � � ��� � � ), and " � ��� � �(resp. " � ��� � � ). As such, � cannot appear on the boundary of � � ��� � � (resp.
� � ��� � � ), contrary to assumption.
(ii) � � � � � � � � � (resp. � � � � � � � � � ) for some � � � . Recall that � is
the set of vertices on the boundary of � � . Therefore by definition, � � � � (resp.
� � � � ) contains � in its interior, so it cannot be fully contained in�
, implying that
� �� � ��� � � (resp. � �� � ��� � � ). Contradiction.
Thus we have proved the claim by contradiction. It then follows, using the bound of [120],
that the number of intersections under consideration is at most� � � � � �� � � � � � � � � � .
Each vertex of � is also a vertex of some
� ��� � � � � ��� � � . Applying the preceding
lemma to all the � � � pairs � � � � � , we obtain the following.
Lemma 5.3.3 The complexity of � is � � � � , and it can be computed in � � � ��� � �
time.
Similarly, the set � has complexity � �
� � and can be computed in time � �� �)� � � .
Finally, we can determine whether � � � �� � , by plane sweep, in time � � � �
108
� � ��� � � . Using parametric search as in [11], � � ����� ��� can be computed in � � � �� � ��� � � � time.
To compute � � ������� ���� , we follow the same approach as computing ��������� ���� in
the preceding section. Combined with an argument similar to the one above, we can show
the following.
Theorem 5.3.4 Given two families � and � of and � disks in � � , we can compute both
� � ��������� and � � ������� ��� in time � � � � � � ��� � � � .
5.3.2 Approximation algorithms
No good bounds are known for the complexity of the Voronoi diagram of the boundary
of the union of � balls in � � , or, more precisely, for the complexity of the portion of the
diagram inside the union [19]. Hence, a naıve extension of the preceding exact algorithm
to � � yields an algorithm whose running time is hard to calibrate, and only rather weak
upper bounds can be derived. We therefore resort to approximation algorithms.
Approximating � � ����� ��� in � � and � � . Given a parameter �� , we wish to compute
a translation � of � such that � � ��� � � ����� � � � � � � � � ��������� , i.e., � � ��� � � ����� is
an � -approximation of � � ��������� . Our approximation algorithm for � � ��������� follows the
same approach as the one used in [12, 14]. That is, let� ��� � (resp.,
� � ��� ) denote the bottom
left point, called the reference point, of the axis-parallel bounding box of�� (resp.,
�� ).
Set �� � � ��� � � ��� � . It is shown in [14] that in � � , � � ��� � � ����� � � � � � � � � ��������� .
Computing � takes � � � � time. We compute � � ��� � � ����� using the parametric search
technique [11], which is based on the following simple implementation of the decision
procedure:
Put�� � � � � � � � � � � � � � � and
�� � � � � � � ��� � � � � � � � . For given parameter � � ,
we observe that � � ��� � � ����� � � if and only if�� � �" �
� � � � and�� " �
� � � � � � .
109
To test whether�� � � " �
� � � � , we compute � � � � � � � �� � � � , the union of balls in
� � � and � � � � , and check whether any ball of � appears on its boundary. If not, then�� � � " �
� � � � . Similarly, we test whether�� " �
� � � � � � . The total time spent is
� � � � � �)� � � � � in � � , and � � � �� � in � � .
In order to compute an � -approximation of � � ��������� from a constant-factor approx-
imation, we use the standard trick of placing a grid in the neighborhood of� ����� , and
returning the smallest � � ��� � � ����� where � ranges over the differences between all grid
points and� ��� � . We conclude the following.
Theorem 5.3.5 Given two sets of balls � and � of size � and , respectively, and � � ,an � -approximation of � � ��������� can be computed in � � � � � � � � � � ��� � � � � � time in
� � , and in � � � � � � � � � � � � ��� � � � � � time in � � .
Pseudo-approximation for � � ����� � ���� . Currently we do not have an efficient algo-
rithm to � -approximate � � ����� � ���� in � � . Instead, we present a “pseudo-approximation”
algorithm in the following sense.
Let �� �
��� � � � � � , where � is the Minkowski sum, be the set of all placements of
� at which�� intersects
�� ; �
� �� - � ��� � �� � � � � � � � � . Clearly � � � � � � � � � . For
a parameter � , let
� ��� � � �� - � � � � �� � � � � � � � � � � � � � � � �
and � � � � � � � � � � � ��� � � . We call a region � " � � � -free if � " � " ���� � .This notion of approximating � is motivated by some applications in which the data
is noisy, and/or when shallow penetration is allowed. For example, each atom in a pro-
tein is in fact a “fuzzy” ball instead of a hard ball [97]. We can model this fuzziness
by allowing any atom ��� � � � to be intersected by other atoms, but only within the shell ��� � � � � ��� � � � � � � � � for some � � . In this way, the atoms of two docking molecules
110
may penetrate a little in the desired placement. Although � can have large complexity,
namely, up to � � � � � in � � , we present a technique for constructing an � -free region �of considerably smaller complexity. We thus compute � and a placement � � ��� such that
� � ��� � � � ����� � � � � � � � � ������� � � . We refer to such an approximation � � ��� � � � �����as a pseudo- � -approximation for � � ����������� .
Lemma 5.3.6 An � -free region � of size � � � � � � can be computed in time
� � � � � � � ��� � � � � � � �
PROOF. Let � � � � � � � � � � � � � � � � � � � � � � � ��� � � � � � We insert each ball � � � into an oct-tree
. Let ��� be the cube associated with a node of
. In order to
insert � � , we visit
in a top-down manner. Suppose we are at a node . If ��� " � � , we
mark black and stop. If ��� � � � �� � and the size of ��� is at least � � � � � � � � � , then we
recursively visit the children of . Otherwise, we stop. After we insert all balls from , if
all eight children of a node are marked black, we mark black. Let � � � � � � � ����� � (� �be the set of highest marked nodes, i.e., each � is marked black but none of its ancestors
is black. It is easy to verify that each � � marks at most � � � � � � nodes black as nodes
they mark are disjoint and of size at least � � ��� � � � � �� ; thus � � � � � � � � � � . The whole
construction takes � � � � � � � ��� � � � � � � time, and obviously � � � � " � �()�� ���" � �Set � � $ � � � � � �()�� ��� � ; it is an � -free region, as claimed.
Furthermore, let� ����� � � ��� � , and �
� � ����� � � ��� � be as defined earlier in this section.
We prove the following result.
Lemma 5.3.7 Let � � � � be the closest point of � in � . Then
� � ��� � � � � ��� � � � � � � � � ������� � � �
111
PROOF. Let �� � � � ������� � � and �� ��� the placement so that � � ��� � �� ����� � �� . Then
��� � �
� � ��� � � ����� � � ��� � � � � � � ��� � � �� � � � ��� � �
A result in [12] implies that � � � ��� � � �� � � ����� � � � �� . On the other hand,
��� ��� � � � � ��� � �� � ��� � � � � � �� � �
�� � �� � �
� � � � �� �� � � � � �� � � �� � � �� � � � � � � � � ����� � � � �
The closest point of � in � , � � , can be computed as follows. Recall that in Lemma 5.3.6,
� � $ � � � � � ��)�� ��� � . Set �� � $ � � � � � � � � �() � ��� �� consists of a set of disjoint
(other than on the boundary) cubes. We first check whether � � �� by a point-location
operation. If the answer is no, then � � � , and we return � � � � . Otherwise, � � is the
point from � � � � �� that is closest to � . In the latter case, � � is either a vertex of a cube
in � , or lies in the interior of an edge or a face of a cube in � . Given a node � � ,
for each boundary feature � � � � , that is, a face, an edge, or a vertex of � � , compute
the closest point of � in � . Let � � be the resulting set of closest points. Next, for each� ��� � , check whether � �� �� by visiting the neighboring marked nodes that also contain� . This can be achieved by performing point location operations. Finally, we traverse all
nodes in � . Among all those points from � � ’s that lie on � �� (thus on � � ), return
the one that is closest to � . There are � � � � � � cubes, and each has constant number of
boundary features. Furthermore, at most constant number of nodes from � will contain a
given point, and each point location takes � ��� � � � � � � time. Hence, � � can be computed
in � � � � � � � ��� � � � � � � time.
We can compute � � ��� � � � � ��� in � � � � � � � ��� � � � time, as described in Sec-
tion 5.3.2, so we can approximate � � ����� � � � in � � � � � � � ��� � � � time. Further-
more, we can again draw a grid around � � and compute an � -approximation of � � ����� � � � .We obtain the following.
112
Theorem 5.3.8 Given � , � in ��� and � � , we can compute in � � � � � � � � � � � � ��� � � �time, an � -free region � " ��� and a placement � � � of � s.t. � � ��� � � ����� �� � � � � � � ������� � � .
5.4 RMS and Summed Hausdorff Distance between Points
We first establish a result on simultaneous approximation of the Voronoi diagrams of
several point sets, which we believe to be of independent interest, and then we apply
this result to approximate �� ���������� and � � ��������� for point sets � � � � � ����� � � � and
� � � � � � ��� � ��� � � in any dimension.
5.4.1 Simultaneous approximation of Voronoi diagrams
Given a family � � � � � ��� ��� � � of point sets in � � , with a total of � points, and a parameter
� � , we wish to construct a subdivision of ��� , so that, for any � � � � , we can quickly
compute points � � � � � , for all� � � ���
, so that � ��� � � � � � � � � � � � ��� ��� � � , where � � � ��� � �is, as defined earlier, �
#%$ � )�� � � � � � � � . Our data structure is based on a recent result by Arya
and Malamatos [24]: Given a set � of � points and a parameter � � , they construct
a partition � of � � into � � � � � � cells — each cell � � � is the region lying between
two nested hypercubes (the inner hypercube may be empty), and is associated with a point
1 � � � � � , so that for any point ��� � , � � � � 1 � � � � � � ��� � � � � � ����� . � is the partition
induced by the leaves of a compressed quad tree
[146], built on an initial hypercube �
that contains � . � and
can be constructed in � � � � � � � �)� � � � � � � time, and the cell of �
containing a query point can be located in � �)� � � � � � � time.
Let � be a hypercube containing� �� � � � � . We construct the above compressed quad
tree � for point set � � , and let � � be the resulting subdivision. We then merge
� � � ��� � �
into a single compressed quad tree
[146] and thus effectively overlay � � � ����� � � � . In
particular, we start with � and insert cells of � � ’s one by one, for � � � �
. We refine
113
after each insertion so that we still maintain a compressed quad tree structure [24]. Since
all � ’s are built using the same initial hypercube � , the four hypercubes involved in any
two cells during the process are either disjoint or one containing another. Hence each
insertion creates at most � new leaves. Let � be the resulting overlay of � � � � ��� � � � ; � is
a refinement of each � � and ����� � � � � � � � . Since the merged tree
is also a compressed
quad tree, the cell of � containing any query point can be computed in � ��� � � � � � � time.
For any cell � ��� , let 1 � � � � � � � denote the point associated with the cell � � � � � that
contains � . Recall that, for any point � �� , 1 � � � � is an � -nearest neighbor w.r.t. � � , i.e.,
� � � � � � � � � � � � 1�� � � � � � � � � � � � � � � � � � �If we store all the 1 � � � � ’s for each cell � ��� (i.e., in the leaf nodes of
), we need
� � � � � � � � space, which we cannot afford. So we instead store 1 � at appropriate internal
nodes of
. More specifically, for a fixed� � � � � , and for any cell � � � � � , let �
be the node in the merged tree
associated with � � . Let� be the subtree of
rooted at
. For any cell � ��� associated with a node of� , 1 � � � � � 1 � � � � � . We therefore store
1�� � � � � at , instead of storing it at all leaf nodes of� . Since � � � � � � � � � � � � � � � , the total
storage needed to store 1 � � � � ’s is �� � � � � � � � � � � � � � � � � � � � � To query with a point �
lying in a cell � ��� , we collect 1 � � � � , ��� � � �, while traversing the path from the root
to the leaf of
associated with � . As 1 � is stored once along any path from the root to a
leaf of
, we conclude the following.
Theorem 5.4.1 Given a family � � � � ����� ��� � � of point sets in � � , with a total of � points,
and a parameter � � , we can compute in � � � � � � � ��� � � � � � � time a subdivision of � �of size � � � � � � so that, for any point � � � � , one can � -approximate � � � ��� � � , for all
� � � � �, in � �)� � � � � � � � � time.
114
5.4.2 Approximating �� ����������
For��� � � , let � � � � �� � � � � � � � � ��� � � � � � We construct the preceding de-
composition, denoted as � � , and the associated compressed quad-tree� , for � � � ����� ��� � ,
with the given parameter � ; � � � � � � � � � � � . Define
� � � � � � � � � � ��� � � � � #%$��� � � � � � � � ��� � �� � � �
and let
� � � � � � � � � ��� � � � ��� ���� � �� � � � � �
For each cell � � � � , define
�� � - � � � � ���� � �
� � � � � 1 � � � � � �
By construction, for any � �� ,
� � � � � � �� � - � � � � ���� � �
� � � � � 1 � � � � �
���� � �
� � � � � � � � � � � � � � � � � � � � � � � � � � � �
implying that � �
�� � - � � � � � � ��� � � � ��� � � ����� �
Hence, it suffices to store�� � - � � � � at each cell � � � � . Since
�� � - � is a quadratic equation
in � � � � , it can be stored using � � � space (where the constant depends on � ) and updated
in � � � time for each change in 1 � � � � .If we compute
�� � - � for each cell � � � � independently, the total time is � � � � � � � .We therefore proceed as follows. We perform an in-order traversal of the compressed
115
quadtree� . For the cell � associated with the first leaf of
� visited by the procedure,
we compute�� � - � in � � time. For the subsequent leaves we compute
�� � - � from the
value previously computed. Suppose we are currently visiting a cell � of � � , let � � be the
previous cell visited by the procedure, let � (resp. � � ) be the leaf associated with � (resp.
� � ), and let
� � - � � � � � 1�� � � � � �� 1 � � � � �The value of 1 � � � � , for all � � � � - � , are stored along the path from the nearest common
ancestor of � and � � to the leaf � . Since
�� � - � � � � � �� � - � � � � � ��%) ����� �
� � � � � � 1�� � � � � � � � � � � 1�� � � � � ��� �
we can compute�� � - � from
�� � - � in � � � � - �� � � time. As � � � - � � � � � � � � � , the total
time required to compute all�� � - � ’s is � � � � � � .
Next, we compute, in � � � � � � time, a subdivision � � on the family � � � � � � � � �� � � � � , for
� � � �� , and a quadratic function
�� � - � for each cell � � � � so that�� � - � � � � � � � � � � � � � � � � � We overlay � � and � � . The same argument as the one used to
bound the complexity of � � shows that the resulting overlay � has � � � � � � cells and
that it can be computed in � � � � � � � ��� � � � � � � time. Finally, for each cell � in the
overlay, we compute
� � � %(' � � # $�*) � � %�� �
� �� � - � � � � �� � � �� � - � � � � � � �� %(' � � # $
�*) � � � � � � � ��� � � �����
and return
� #%$� )�� � ���� � � � ����� � � ��� � � � ����������
�
Hence, we obtain the following.
116
Theorem 5.4.2 Given two sets � and � of and � points in ��� and a parameter � � ,we can:
i. compute a vector � � � � � in � � � � � � � ��� � � � � � � time, so that
� ���� � � � � ��� � � ��� � � � ���������
ii. construct a data structure of size � � � � � � , in time � � � � � � � ��� � � � � � � , so that
for any query vector � � � � , we can � -approximate � ��� � � ����� in � ��� � � � � � �
time.
5.4.3 Approximating ��� ���������
Modifying the above scheme, we approximate � � ����� ��� as follows. Let � � , � � , � � , and
� � be as defined above. We define
� � � � � ���� � �
� � � ��� � � � � � � ��� � � ����� �
� � � � � ���� � �
� � � � � � � � �� � ����� � � � � � �
For each cell � � � � , let
�� � - � � � � ���� � �
� � � � 1 � � � � � � � � � � � � ����� � � �����
and for each cell � � � � , let
�� � - � � � � ���� � �
� � � � 1 � � � � � � � � � � � � � � � � � � � � � �
As above, we overlay � � and � � . For each cell � in the overlay, we wish to compute
� � � � # $�*) � �
%�� � � �� � - � � � � �
�
�
�� � - � � � ��� �
117
However, since�� � - � and
�� � - � are not simple algebraic functions, we do not know how to
compute, store and update � � efficiently. Nevertheless, we can compute an � -approximation
of � � that is easier to handle. More precisely, for a given set � of points in � � , define the
�-median function
� � � � � � � � �� ) � �
� � � � � �
For any � � � � ,�� � - � � � � � � � ��� � � � � � , where
� � � � � �,1 � � � � � � � � � � . The
same is true for�� � - � , where � � � � . In Section 5.4.4, we describe a dynamic data
structure that, given a point set � of size � , maintains an � -approximation of the function
� � � � � � � as a function consisting of � � � � � � � ��� � � � � � � pieces; the domain of each piece
is a � -dimensional (or the complement of a � -dimensional) hypercube. A point can be
inserted into or deleted from � in � ��� � � � � ��� � � � � � � � � � � time. Furthermore, given
two point sets � and � in � � , this data structure can maintain an � -approximation of
� %�� � � � � � � � � � � � � � � � � � � within the same time bound.
Using this data structure, we can traverse all cells of the overlay of � � and � � , as in
Section 5.4.2, and compute an � -approximation of�� � - � and
�� � - � , (thus roughly a ( � )-
approximation of � � ) for each cell � of the overlay. However, given two adjacent leaves
during the traversal, associated with cells � and � � respectively, we now spend
� � � - � � � � � � � � � �� ��� � � � � � � � �
time to compute an � -approximation of�� � - � from that of
�� � - � . Putting everything to-
gether, we conclude the following.
Theorem 5.4.3 Given two sets � and � of and � points in � � and a parameter �� � � , we can compute:
i. a vector � ��� � � in � � ���� � �� ��� � � � �� � � time so that
� � ��� � � � � ��� � � � � � � � �����������
118
ii. a data structure of � � ���� � � ��� � � � �� � � size in time � � ��
�� � � ��� � � � �� � � , so
that for any query vector � � � � , we can � -approximate � � ��� � � ����� in time
� � �� ��� � � � �� � .
5.4.4 Maintaining the 1-median function
Let � be a set of � points in � � and let � � be a parameter. For � � � � , define
the�-median function �
� � � ��� � � � )�� � � � � � � , as above. We describe a dynamic data
structure that maintains a function � � ��� � � as the points are inserted into or deleted
from � so that
� � � � � � � � ����� � � � ��� � � � � � � � � � � � � � � � �We maintain a height-balanced binary tree
with � leaves, each storing a point of � .
For a node � , let � � " � be the set of points stored at the leaves of the subtree rooted
at ; set � �� � � � � . For each node of height � (leaves have height ), set � � � � � � � ,
where � (= ��� � ) is the height of the tree
. We associate with a node , at height � , a
function � � that is a � � -approximation of �� � ��� , i.e.,
� � � ��� � � � � � � ��� � � � � � � � � � � � ��� ��� � ��� � � � � �The description complexity of � � is � � � � � � � ��� � � � � � � . Finally, we maintain a function �that is an � -approximation of �
� � � � � � with description complexity � � � � � � � ��� � � � � � � .
u
(a) (b)
Figure 5.3: (a) An exponential grid with 3 layers. (b) The larger (resp. smaller) cube is�
(resp.��), and the set of hollow dots are
��� .
119
More specifically, if a leaf stores the point � � � , then set � � ��� � � � � � � � � . For all
internal nodes , we compute � � in a bottom-up manner as follows. Let
and � be the
children of . By induction, suppose we have already computed the function ��� and � �each of descriptive complexity � � � � � � � ��� � � � � � � . Set � � � � � � ����� � � � ��� ��� � . Since
� � � ��� � � � � � � � ��� � � � � � � � ��� ��� � , by induction hypothesis, � � is an � � � � � � � � � � -approximation of �
� � ��� . However, the description complexity of � � is more than what we
desire. We therefore approximate � � by a simpler function � � as follows. For � � � � and
� � � , let� ��� � � � be the hypercube of side length � centered at � . For simplicity, let
� � � �� � . We compute �� %)' � � # $
� � � ��� � and set � � � � � � � . Let� � � � � � � �� � � � � � � �
for� � � � ��� � � � ��� . Partition each cubic shell
� � � � � � � into hypercubes by a � -
dimensional grid � � in which each cell has side length �� � � � � � � � . The union of
� � ’s is an exponential grid with � � � � � �)� � � � � � � cells that covers the hypercube� �
� � � �� � � ��� � � � � . See Figure 5.3 (a) for an illustration. In each cell � � � � , pick an
arbitrary point � �� and set
��� � � � � � � � � � � � � � ��� � �� � (5.1)
For points outside�
, we set
����� � �
� ��� � � ��� � � � ��� � � � � � � � (5.2)
Hence, the function � � is piecewise-constant inside�
and a quadratic function outside�
. The description complexity of � � is � � � � � � � �)� � � � � � � � � � � � � � � ��� � � � � � � . Since
��� and ��� have the same structure, the point � can be computed by evaluating the func-
tion � � at the vertices of the exponential grids drawn for ��� and ��� . At each point � , we
can evaluate � � ��� � in time � ��� � � � � � � time by simply locating � in the two exponential
grids. Hence, we can compute the point � in time � � � � � � � ��� � � � � � � � . We spend an-
other � � � � � � � �)� � � � � � � time to compute � � . That � � � � � is indeed a � � -approximation of
� � � � � � � � is proved in Lemmas 5.4.4 and 5.4.5. This finishes the induction step. Using the
120
same procedure, we compute an ��� � � � -approximation, � , of � root, of descriptive complexity
� � � � � � � �)� � � � � � � . By construction, for all � � � � ,
� � � � ��� � � ����� � � � � � � � � � � root��� �
� � ��� � � � � � ��� � � � � � � � ��� � � � � � � � � � � � ��� � �Obviously, the size of the above data structure is � � � � � � � ��� � � ��� � � � � � � . To insert
or delete a point � , we follow a path from the leaf � storing � to the root of
and recompute
�� at all nodes along this path, and then compute � from � root. Hence, the update time is
� �)� � � � � ��� � � � � � � � � � � , and the only missing component now is to show that � � , as
constructed above at each node � , is indeed a � � -approximation of �� � ��� .
Lemma 5.4.4 Let be a node of
at height � . For any� � � � ��� � � � ��� and for any
� � � � " �,
� � � � � ��� � � � � � � � � � � � � � � � � � � � ��� � ��� � �� �
PROOF. The triangle inequality implies that for any � � � � � � ,
� � � � � � � � � � � � � � � � � � � � � � � ��� � � � � � � � � � ��� � � � � � � � � � � � (5.3)
Therefore, by construction of the exponential grid,
� � � � ��� � � � � � � � ��� � � � � � �� ��
� � � � � �� � (5.4)
Equation (5.1) and (5.3) imply that
����� � � � � � � � �
�� � � � � ��� � � � � � � � � � � � � ��� � � � �
Substituting � for � in (5.3), we obtain
� � � � � ��� � � � � ��� � � � � � � � � � � � �
� � � � � � � � � � � (5.5)
121
Hence, for any � �� ,
��� � � � � � � � � � � � � � � � � � � � � � � �
�
� � � � � � � � � � � � � � � �� � � � � � � � �
�
� ��� � � � � � � � � � �� ��
� � � � � � � ( using (5.4) )
� � � � � � � � ��
� � � � � � ��� � � � � � � � � �� � � � � � � � �
�
� � �
� � � � � ��� ��� � ( using (5.5) )
� � � � � �
� � � � � ��� � � � � � � � � � � � � � ��� � � � �
Lemma 5.4.5 Let be a node of
at height � . Then for any � � ��� � � ,
� � � ��� ��� � � � � � � � � � ��� � � � � � � ��� � � � �PROOF. By (5.3),
� � � � ��� ��� � � � � � ��� � � � � � � ��� � � � � � � � � � ��� ��� � � � � (5.6)
The first inequality of the lemma is now immediate because
� � � ��� ��� � � � � � ��� � � � � � � � ��� � � � � � � � ��� � � � � � � � � � � � �As for the second inequality, we first upper-bound � in terms of �
� � ��� ��� � . Let �� �
� � � � � � ��� � � � � and �� �� �
�� �� (see Figure 5.3 (b)). Then
� � � � � � � � � � �� )���� �
� � � � � �� -)����� �
� � � � � � � � � �� � � � �� � �
�
Therefore
� �� � � � � � � � � � � � � � � � � � � �
122
On the other hand, for any � � � � � � and � � ��
, we have that
� ��� � � � � � � � � � � � � � � � � � � � � � �Hence, as long as � � � � � , we have
� � � � � � � � �� ) ���� �
� � � � � � � �� � � � � �
� � �
���
thereby implying that � � � � � � ��� � � � . Using (5.2) and (5.6),
����� � �
� ��� ��� � � � � � � � � � � � ��� � � �� � � � � � ��� � � � � � � � � � � � � � ��� � � � � � � � � � � � � �
5.4.5 Randomized algorithm
We briefly describe below a simple randomized algorithm to approximate � ���������� . The
algorithm of approximating ��������� ��� is similar. Let � � be the optimal translation, i.e.,
� ��� � � � � ��� � � ���������� .
Lemma 5.4.6 For a random point � from � , � � � � � � ����� � � ���������� , with probability
greater than� � . The same claim holds for � ����������� .
PROOF. Let � be a random point from � , where each point of � is chosen with equal
probability. Let � be the random variable � � � � � � � � ����� . Then �� � � � �
� �
� � � � � � �� � ����� . Moreover,
� ��������� � � ���� � � � ����� ��
��� � �
� � � � � � ����� � �� � � �
The lemma now follows immediately from Markov’s inequality.
123
Choose a random point � � � . Let � � � � � � � and � � � � ��� � � � � ��� , for
� � � �� . It then follows from Lemma 5.4.6 and the same argument as in Lemma 5.3.7,
that �#%$ � � � is a constant-factor approximation of � ��������� , with probability greater than
� �� . Computing � � exactly is expensive in � � , therefore we compute an approximate value
of � � , for� � � �
� , in time � � � � � ��� � � , by performing approximate nearest-
neighbor queries [24]. We can improve this constant-factor approximation algorithm to
compute a � � � � � -approximation of �� ���������� using the same technique as in Section 5.3.
We thus obtain the following result.
Theorem 5.4.7 Given two sets � and � of and � points, respectively, in � � and a
parameter � � , we can compute in time � � � � � � � ��� � � two translation vectors � �and � � , such that with probability greater than
� � ,
� ��� � � � ����� � � ��� � � � ���������� and � ����� � � � � ��� � � � � � � � � ����������
5.5 Notes and Discussion
We provide in this chapter some initial study of various problems related to minimizing
Hausdorff distance between sets of points, disks, and balls. One natural question following
our study is to compute exactly or approximately the smallest Hausdorff distance over all
possible rigid motions in � � and � � . Given two sets of points � and � of size � and ,
respectively, let � be the maximum of the diameters of point sets � and � . We believe that
there is a randomized algorithm with roughly � � expected time, that approximates the
optimal summed-Hausdorff distance (or rms-Hausdorff distance) under rigid motions in
the plane. The algorithm combines our randomized approach from Section 5.4.5, a frame-
work to convert the original problem to a pattern matching problem [110], and a result by
Amir et al. on string matching [21]. However, this approach does not extend to families of
balls. We leave the problem of computing the smallest Hausdorff distance between sets of
124
points or balls under rigid motions as an open question for further research. Another ques-
tion is to approximate efficiently the best Hausdorff distance under certain transformations
when partial matching is allowed. The traditional approaches using reference points break
down with partial matching.
125
Chapter 6
Coarse Docking via Features
6.1 Introduction
Proteins perform many of their functions by interacting with other molecules, and these
interactions are made possible by molecules binding to form either transient or static com-
plexes. We focus in this chapter on the problem of predicting the binding (or docking)
configurations for two large protein molecules, which we refer to as the protein-protein
docking problem (see Figure 6.1). This problem is important because the docked complex
(a) (b) (c)
Figure 6.1: Given protein structures in (a) and (b), the docking problem predicts the dockingconfiguration in (c).
has functional consequences (e.g. signal transduction) and it is usually hard to crystal-
lize complexes. Many of the more than 25,000 proteins in Protein Data Bank (PDB) [31]
are able to form protein-protein complexes, however, there are only a few hundred non-
obligate crystallized complexes1; see the discussion in Chapter 1 for more motivations.
When two proteins bind, their shapes (molecular surfaces) roughly complement each�Obligate complexes are permanent multimers whose components do not exist independently.
126
other at the interface [113, 116]. This is the main justification for a geometric approach
in predicting the docking configurations, which is also the basis for our approach. The
docking problem that we study is as follows: Given two proteins, � and � , assume that
the native docked configurations are � and � � . The goal is to develop an efficient algo-
rithm to compute configurations for � that are close to � � , or in other words, to search
for configurations for � that complement � best geometrically. To measure the goodness
of the complementary fit between � and some configuration � � of � , a scoring function� $ � '�� ��������� � is needed. Note that in the above formulation, we assume that both pro-
teins are rigid bodies during the docking process, which is usually refered to as the bound
docking problem. The unbound docking problem, in which each protein may change its
conformation, is not considered in this chapter, but will be the focus of future work. Of
course in nature, more than two proteins might bind and form one complex. The work in
this chapter serves as a starting point for attacking these more general problems.
Prior work. There are mainly two categories of interactions that have been extensively
studied: protein-ligand interactions which happen between a protein and a small molecule;
and protein-protein interactions. Much of earlier work in this area has focused on protein-
ligand interactions, mainly motivated by drug design. With the number of available protein
structures increasing rapidly, there is considerable research on understanding interactions
between large proteins. Despite some similarities, the known approaches for predicting
these two types of interactions are different in several aspects. For the case of protein-
ligand docking, both chemical information and the flexibility of the ligand (and sometimes
that of the receptor as well) are considered at a quite detailed level. While for the case of
protein-protein docking, it seems that geometry plays a more important role, and because of
their large size, these proteins may not be as flexible as small molecules during the docking
process. Even if we ignore the flexibility, predicting protein-protein docking configurations
is computationally more demanding than the case of protein-ligand docking because of
127
high complexity of protein molecules. We do not review the known results for protein-
ligand docking in this section. Interested readers are referred to [85, 102, 164].
Current research on protein-protein docking is focused on either bound docking, or
unbound docking with fairly small protein conformational changes. Most approaches
for unbound docking use bound docking as a subroutine. They usually consist of two
stages [153]:
(i) bound-docking stage, which produces a set of potential docking configurations by
considering only rigid transformation; and
(ii) refinement stage, which allows certain flexibility.
Two essential components are involved in both stages:
(i) developing a scoring function that can discriminate near-native docking configura-
tions from incorrect ones; and
(ii) a search algorithm to find (approximately) the best configuration under the score
function used.
In general, approaches to the bound docking stage rely mainly on geometric comple-
mentarity. Each of the input proteins can either be represented as a surface model [86, 127],
a union of balls (with each ball representing an atom) [32, 61], or a set of voxels [53, 143].
For the last case, the space is divided into a set of regular grid cells (voxels), where each
cell is marked as inside, outside, or on the surface of the protein. To search for the best geo-
metric fit, the most straightforward approach, called exhaustive search, discretizes transfor-
mational space into a six-dimensional grid and computes the score for each grid point [32].
Although this approach produces good near-native docking configurations with few false
positives2, it is too expensive. Several approaches have been proposed to reduce the time�A false positive is a docking configuration with high score but far from the native configuration
128
complexity of this search procedure.
One popular technique is the fast Fourier transform (FFT) method, which was first used
in molecular docking in [119]. The FFT method represents molecules as voxels. By de-
signing the scoring function appropriately and discretizing the translational space into a� � � � � grid, for a fixed rotation, the scores of all relative translations can be evaluated si-
multaneously in � � � ��� � � time. It is still necessary to search the three-dimensional space
of rotations, which is typically done exhaustively. There are several properties of FFT-type
approaches that make it rather attractive: other than surface complementarity, chemical
properties such as hydrophobicity can be correlated as well, and it is fast to perform low-
resolution FFT-searches, which is useful in producing coarse fits. Available docking pro-
grams exploiting the FFT method include FTDock [90], 3D-Dock [137], GRAMM [163],
and ZDock [53]. Nevertheless, for reasonably high resolution docking, the time complex-
ity for FFT-type approaches is still fairly high. The approach proposed by Ritchie and
Kemp [143, 144] addresses this problem by using spherical harmonics to represent both
the molecular surface and the electric field. Complementarity between surfaces in different
orientations is calculated by Fourier correlations between the expansion coefficients of ba-
sis functions. A table of overlap integrals independent of protein identities is pre-calculated
to expedite the docking algorithm.
Another widely used type of approaches reduces the number of transformations in-
spected by aligning the so-called features on molecular surfaces. The idea goes back to
Connolly[67], who also proposed to identify point features by taking the minima and max-
ima of the so-called Connolly function. One representative here is by Fischer et al.[86]
who use geometric hashing technique to align points computed by a variant of the Con-
nolly function. Variants and improvements of their approach include [93, 127]. Such
algorithms are usually faster, but require post-processing to remove steric clashes and to
refine the geometric fit.
129
Other approaches for bound docking include genetic algorithms to conduct the search [39,
91], and fast bit-manipulation routines to expedite the evaluation of scores [140].
Although good in recombining separate components of a known complex, geometric fit
is not sufficient to dock unbound proteins. It is commonly believed that the native complex
is at the global minimum of � � , the difference between the free energy of the complex
and that of its separate components. Hence the refinement stage is usually modeled as
an energy-minimization problem. The scoring functions here focus more on the thermo-
dynamic aspects of the interaction. Flexibility of side chains, or even that of backbones,
is usually considered during the search process. One common way to incorporate side
chain flexibility is by considering only known populated rotamers of the side chains. Un-
fortunately, this still produces an exponential number of combinations. Techniques such
as iterative mean-field minimization [111], dead-end elimination [18], and genetic algo-
rithms [115] have been used to reduce the time complexity. Recently, Vajda et al. have
proposed a hierarchical, progressive refinement protocol [41, 42]. It follows the intuition
that a simplified energy landscape is sufficient for far-apart proteins, and as two proteins
get closer, more detailed energy landscape should be used. Their algorithm is able to reli-
ably converge to a near-native configuration (with smaller than ˚�
rmsd) starting with one
with initial rmsd of up to� ˚� .
Little success has been achieved for including backbone conformational changes. It
seems that a significant portion of this larger motion is the class of hinge-link inter-domain
movements [124]. Some initial investigation has been made to docking proteins consider-
ing such motions [147].
New work. Since each step in the refinement stage is quite costly, it is crucial to gen-
erate a small and reliable set of potential configurations during the bound docking stage.
On the other hand, even though only rigid motions are considered in this stage, the time
complexity of current approaches is still not satisfactory, preventing us from experiment-
130
ing with larger data sets. In this paper, we present an efficient algorithm for the bound-
docking stage. We use geometric complementarity to guide us in the search for a small
set of rigid motions that fit the two proteins loosely into each other. Such a set of poten-
tial configurations can be further refined independently, using both chemical information
and flexibility [42, 61], to obtain more accurate docking predictions. We remark that for
the case of unbound docking, it is especially important to consider coarse (not tight) fits
between input components in order to provide enough tolerance for flexibility in the later
refinement procedure.
We describe our algorithm, called � � %)' ��� � # � $ , in Section 6.2. It relies on the ability
to describe meaningful “features” on molecular surfaces, such as protrusions and cavities,
using a succinct set of pairs of points computed from the elevation function, defined in
Chapter 4. We then align such pairs and evaluate the resulting alignments by a simple and
rapid scoring function. Compared to similar approaches [86, 127] that align the so-called
feature points, our algorithm inspects orders of magnitude fewer transformations by align-
ing only meaningful features to produce a reliable set of potential docking positions. In
Section 6.3, we demonstrate the performance of our approach by testing a set of 25 bound
protein-complexes obtained from the Protein Data Bank [31]. We also show that by com-
bining our algorithm � � %('���� � # � $ with the local improvement procedure described in [61],
we are able to efficiently find accurate near-native docking positions for all but one case
without any false positives. Additionally, we have tested our algorithm on the unbound
protein docking benchmark [54], demonstrating that � � %)' ��� � # � $ can generate useful pro-
tein poses that can serve as input for refinement methods that take protein flexibility into
account.
131
6.2 Algorithm
Assume that we are given two proteins � � � � � � � � � ��� � � � � � and � � � � � ��� � � ����� ��� � � ,where each � � � � � � � � � (resp. � � � ��� � � � � � ) represents an atom centered at �� � � � (resp.
� � � � � ) with van der Waals radius� � (resp. � � ). Let
� $ � '�� ��������� be a scoring function
that we will describe shortly. Ideally, we would like to find a transformation � for � that
maximizes� $ � '�� ����� � � ��� � . In this section, we describe an algorithm that finds a set of
potentially good transformations for � . Below, we first explain the scoring function we
use. We then present an algorithm that produces a set of transformations for � by aligning
pairs of points computed from the elevation function.
At a high level, we first construct a set�� (resp.
�� ) of features based on elevation
function as defined in Chapter 4, each consisting of two points along with a normal di-
rection, to characterize the molecular surface of protein � (resp. � ). These two sets of
features,�� and
�� , are the inputs to our coarse alignment algorithm, which outputs a
list of possible configurations sorted by their scores. Below, we first describe the scoring
function that we use. Then after explaining how to construct features from the elevation
function, we present the algorithm � � %)'�� � � # � $ .
6.2.1 Scoring function
A good scoring function should produce a higher score for (near) native configurations
than for non-native ones. We design our scoring function to describe the geometric com-
plementarity between � and � . In particular, let
� $ � '�� � � � � � � � # � # � $ � � � � � ��� � � if � � � � ��� � if � � � otherwise �
132
where � � � � � � � � � � � � � � , and � is some prefixed constant that we refer to as
contact-threshold. The scoring function between � and � involves two components:
the score� $ � '�� ��������� � � - � � $ � ' � � � � � � and the collision number � � # � # � $ ��������� � � - � � � # � # � $ � � � � � � A configuration ��������� is valid if � � # � # � $ ��������� � � , where the the
collision-threshold � defines the maximum number of clashes that can be tolerated.
This definition is rather similar to the one used in [32] and in [53]. The main difference
is that in addition to counting collisions, we use them as a reason to lower the score. The
reason behind this is that as our algorithm generates coarse alignments between the input
proteins, we need a large tolerance for collision ( � � � in our experiments, in contrast to
� � �in [32]). However, a high collision number will also increase the number of pairs of
atoms that are in contact, resulting in high scores. By giving a penalty to the score when a
clash happens, we intend to counterbalance the consequential increase in score.
6.2.2 Computing features.
Given protein � , we compute its surface�
and the elevation function� ���(% � # � $ � � � �
on�
(as defined in Chapter 4). For docking purposes, we are interested in points with
locally maximal elevation, as they represent locally most significant features. Recall that
there are four types of maxima, which are illustrated in Figure 6.2 (a), each describing a
different type of features on molecular surfaces.
(a) (b)
Figure 6.2: (a) From left to right: a one-, two-, three-, and four-legged local maximum of theelevation function. (b) A three-legged maximum generates six features as indicated by shadedbars.
133
We collect�� , the set of features for protein � , as follows: for each maximum � from
�, add all possible pairs of points between the head(s) and feet of � into
�� . See Figure 6.2
(b) for an example. With each feature generated from the maximum � , we associate the
normal direction � � (i.e., all head(s) and feet of � are critical in the height function in
direction � � ), and assume that � � is always pointing towards the exterior of the surface at
� . Thus a feature is a pair of points along with the unit vector � � . Given a feature, we
refer to the distance between its two points as its length, and� ���)% � # � $ ��� � as its elevation.
The length and elevation of a feature indicate its importance, thereby providing a way to
distinguish less from more meaningful features.
We remark that previous alignment-based approaches describe features by points resid-
ing at a protrusion or a cavity. This is one of the main difference between our approach and
those methods. Points provide much less information in identifying specific protrusions or
cavities as compared to our representation of features. Therefore our alignment algorithm
(as described below) is able to inspect orders of magnitude fewer configurations than those
algorithms. We elaborate on this difference more in Section 6.4 at the end of this chapter.
6.2.3 Coarse alignment algorithm
Given two proteins � and � , along with their feature sets�� and
�� , respectively, our
algorithm, sketched in Figure 6.3, computes a set of potential coarse alignments. The
rationale behind � � %)' ��� � # � $ is that good fits between the input proteins should have some
features aligned, such as a protrusion from one surface fitting inside a cave from the other
(see Figure 6.4 (a)). Hence if we align all “features” from � to those from � , we will
cover potentially good fits. Here features are defined as in Figure 6.2 (b). As we wish to
align only important features, we preprocess�� and
�� by removing features with small
elevation value or short length. We now explain steps (1), (2) and (3) from the above
134
ALGORITHM � � %)'�� � � # � $Preprocess
�� and
�� ;
for any � � � �� and � ��� �
�(1) if ( � � � % � � # � � � � � � � � � )(2) � =
� # � $ ( � � � � � );(3) compute (
� $ � '�� , � � # �.# � $ ) for ����� � ����� � ;if ( � � # �.# � $ ����� � � ��� � � � )
add � to � ; // � is validSort � by
� $ � ' � ;end
Figure 6.3: The coarse alignment algorithm.
q
p
(a) (b)
Figure 6.4: (a) If two surfaces complement each other well, then features at the interface shouldalign well too. (b) Points
and 0 do not match each other: they should have opposite criticality
(maximum with minimum w.r.t. their own normals).
algorithm.
Function� # � $ ��� : The function
� # � $ � � � � � � � computes a rigid motion � that aligns the
pair of points � � � � � � � � � � with � � � � � � � � � � . In particular, assume that � � and � � are the
normals associated with ��� and � � respectively. Let
� � � ��� � � �� � � � � � and � � � � ���
0 � � 0 �� 0 � � 0 � ���
To obtain the transformation, we first translate the midpoint of segment � � � � to the mid-
point of segment � � � � . Next, we rotate segment � � � � so that (i) segment � � � � lies on the
line passing through � � � � ; and (ii)� � coincides with
� � . See Figure 6.5 for an illustration.
Note that there is ambiguity in (i) as vector��� � � � can either be in the same direction or in
135
� �� �
���
���
� �� �
�� �
� �� �
�
��
� � �
� � � �
�� �
Figure 6.5: First move the midpoints (two empty dots) together. Then rotate so that� � � � and
��0 � 0 �coincide. Last, rotate
�w.r.t.
� �0 � 0 � so that � � coincides with � � .
opposite direction as vector��� � � � . As such, the function
� # � $ � � in fact returns two distinct
transformations, although for simplicity, we pretend it returns only one.
Function � � � % � �.# � � � � : Obviously, if ��� and � � are fairly ”dissimilar”, then they will
not align well with each other in any good configuration. By ”dissimilar”, we mean that
(F1) we are trying to align a protrusion (resp. cave) from one surface with a protrusion
(resp. cave) from the other (see Figure 6.4 (b) ), or
(F2) the length of � � � � is too different from that of � � � � .Given feature pairs � � � � � � � � � � and � � � � � � � � � � , associated with normal � � and � �
respectively, the function � � � % � � # � � � � � � � � � returns false if either (F1) or (F2) happens. In
particular, (F1) happens if � � (or � � ) is a local minimum (or local maximum) with respect
to � � and � � (or � � ) is of the same criticality with respect to � � . Let�
be the length of the
shorter feature pair and � the difference between the length of � � and � � . If � ��� �, we
consider (F2) happens, where � is a threshold on the ratio of the two lengths.
Compute� $ � '�� and � � # �.# � $ . By definition, we can compute both
� $ � '�� ����� ��� and
� � # �.# � $ ��������� in � � � time, where � and are the number of atoms in � and �
respectively. Here we present a simple algorithm that computes them in � � � � � ��� � �time by building a hierarchical data structure for � as follows.
Let � be the set of centers of atoms from � ; the diameter of � is at most � � � � � where
�is the smallest radius of an atom from � (between one and two Angstrom). We build a
136
standard oct-tree
on � . Points of � are stored at the leaves of
. At each internal node
, let � � denote the set of centers in the subtree of and � � the set of balls from � with
center in �� . We associate with node the smallest enclosing ball for � � , denoted by � � .
The depth of
is � ��� � � , therefore it can be constructed in � � ��� � � time.
We now describe the computation of � � # � # � $ ��������� , and that of� $ � ' � ��������� is simi-
lar. In particular, for each � � , ��� � � , we compute � � # �.# � $ ����� � � � � � � # � # � $ � � � � �by a top-down traversal of
. At an internal node , if � � and � � intersect, we recurse down
the subtree rooted at . Otherwise, we return. At a leaf node, if � � intersects the atom
whose center is stored at the leaf node, we increment a counter that records the number of
collisions seen so far by�. The resulting number after the traversal is � � # �.# � $ ����� � � .
It is easy to verify that the above traversal for a specific � � takes � ��� � � time. This
is because that at any level of
, � � intersects at most constant number of nodes of
, and
that the depth of the tree is � ��� � � . Hence it takes � � � � � ��� � � time to compute
� � # �.# � $ ��������� (and similarly� $ � ' � ��������� ) for a particular configuration ��������� .
In practice, we build a similar tree for � and traverse down both trees simultaneously
to compute � � # � # � $ ��������� (as well as� $ � ' � ��������� ) directly, instead of computing each
� � # �.# � $ ����� � � one by one. Although the new algorithm is proved to be more efficient
in practice, it has the same asymptotic time complexity as the one described above. The
overall time complexity for � � %)' ��� � # � $ is � � � � � � � � � � � � � ��� � � .
6.3 Experiments
In this section, we first provide a detailed experimental study of one protein complex. Next
we test our docking algorithm on a diverse data set of 25 bound protein complexes chosen
from the Protein Data Bank [31], which contain both easy docking problems where large
protrusions fill deep pockets and more difficult problems where the interface is relatively
137
flat. Finally, we provide results for testing the unbound protein benchmark [54].
A case study. We take the protein complex barnase/barstar (pdb-id 1brs, chain � and ),
as an example. The two chains have 864 and 693 atoms respectively. We use the msms
program (also available as part of the VMD software [105]) to generate a triangulation
of the molecular surface for each input chain. The resulting triangulations have � � � � and
� � vertices respectively. The left of Table 6.1 shows the number of features generated
2-legged 3-legged 4-legged
A 1044 696 156D 828 510 154
2-legged 3-legged 4-legged
A 112 205 50D 68 160 49
Table 6.1:�
-legged features for chain A and D of ��� � $ in the left. Only large ones are in the right.
from the four different types of maxima of the elevation function: A�
-legged feature is
derived from a�-legged maximum. In the right, we show the number of large features
whose length is at least� ˚� and elevation at least 0.2: these features are the inputs to our
coarse alignment algorithm. We note that a significantly higher percentage of 3-legged
features are large compared to those of 4-legged and especially 2-legged features.
Given the two sets of large features, algorithm � � %)' ��� � # � $ generates a family � of� � � valid configurations whose score is higher than
� � (or � �� � valid ones whose
score is greater than� ). The running time is around 3 minutes on a single processor
PIII 1GHz machine. Each configuration in � corresponds to a transformation for chain . For each transformation � computed from � , we compute the rmsd (root mean squared
deviation) between the centers of all atoms from chain and those of � � � , and use it
to measure how good our predictions are: the smaller the rmsd of a configuration is, the
closer it is to the native docking position. The rmsd for the top ranked configuration in �is��
� � ˚�
, and the rmsd is less than ˚�
for 6 out of the top 10 configurations. Next consider
only a subset � � , the top 100 ranked configurations from � . We refine each configuration
in � � by applying the local improvement heuristic [61] and re-rank � � afterwards based on
138
After LocalImprove Before LocalImprove
rank score rmsd rank � coll rmsd1 359 0.54 12 24 3.232 338 0.80 5 48 2.423 328 0.72 1 23 1.594 314 0.80 4 49 3.575 311 0.91 2 39 1.706 310 0.78 59 12 2.847 307 1.50 3 29 2.328 281 1.47 11 18 3.079 251 2.09 14 16 3.00
10 213 39.96 76 29 39.39
Table 6.2: Performance of the algorithm (including refinement) on protein complex bar-nase/barstar. Only those with ���������������� �
are kept after local improvement heuristic. Theright side shows the corresponding ranking and number of collisions before applying the localimprovement heuristic.
the new scores. The results, shown in Table 6.2, demonstrate that � � %('���� � # � $ generates
multiple useful poses for chains A and D that can be refined to yield a near-native final
configuration.
More bound protein complexes. We extend our experiments to a diverse set of 25
bound protein complexes obtained from PDB. After computing the elevation function for
all protein surfaces, we compute the set of features for each chain, and remove the features
that are not large. Usually, the number of remaining features for each chain is roughly
the same as the number of atoms of that chain. Next, we apply � � %)'�� � � # � $ to these
feature sets. In Table 6.3, we show a low-rmsd (less than� ˚�
) configuration for each
protein complex, as well as its rank (by score) in the set of configurations returned by
� � %('���� � # � $ , with parameters � � � ˚� and � � � . Note that in all but one case, we
have at least one configuration with low rmsd among the top 100 configurations. The last
column shows the running time of algorithm � � %)' ��� � # � $ . It does not include the time to
compute the molecular surfaces and the elevation function.
A configuration is of type � -�
if it is produced by aligning an � -legged feature from one
chain with an�-legged feature from the other one. Our experimental results indicate that
4-legged features seldom give rise to good configurations (i.e., those with rmsd less than
139
pdbid rank � coll rmsd time
1A22 (A, B) 2 23 2.75 201BI8 (A, B) 12 43 2.48 261BRS (A, D) 1 11 1.52 31BUH (A, B) 5 14 1.85 21BXI (B, A) 3 34 2.54 81CHO (E, I) 1 14 2.71 31CSE (E, I) 2 22 2.21 91DFJ (I, E) 78 11 3.09 271F47 (B, A) 15 1 1.49 11FC2 (D, C) 5 49 4.13 61FIN (A, B) 11 44 3.70 411FS1 (B, A) 1 29 1.62 51JAT (A, B) 522 20 1.20 91JLT (A, B) 8 23 3.64 101MCT (A, I) 1 27 3.49 31MEE (A, I) 1 23 1.33 91STF (E, I) 1 43 1.18 81TEC (E, I) 9 54 3.07 71TGS (Z, I) 1 46 2.61 61TX4 (A, B) 2 4 3.35 142PTC (E, I) 1 18 4.55 6
3HLA (A, B) 1 19 1.87 163SGB (E, I) 1 38 3.21 53YGS (C, P) 6 7 1.07 64SGB (E, I) 10 33 2.33 4
Table 6.3: Performance of ������� �� � ������� on 25 proteins. Column 2 shows the rank of the firstconfiguration with low rmsd ( � ˚� ). The last column shows the running time of ������� � � ������ inminutes.
140
After Improve Before Improvepdbid rank score rmsd rank score � coll rmsd
1A22 1 475 1.08 2 270 23 2.751BI8 1 234 29.88 62 211 10 30.01BRS 1 349 1.07 2 389 37 3.011BUH 1 256 0.61 5 209 14 1.851BXI 1 289 0.63 16 217 21 5.591CHO 1 305 0.99 1 243 14 2.711CSE 1 317 0.82 23 273 36 2.571DFJ 1 220 1.28 78 178 11 3.091F47 1 221 0.56 15 129 1 1.491FC2 2 200 1.33 5 356 49 4.131FIN 1 413 0.61 34 382 54 9.941FS1 1 326 0.89 2 309 27 1.591JAT 1 288 0.87 522 168 21 1.201JLT 1 310 1.77 3 232 14 6.17
1MCT 1 322 0.32 84 233 34 3.571MEE 1 372 0.57 1 373 23 1.331STF 1 314 0.79 1 408 43 1.181TEC 1 304 1.28 10 362 51 4.511TGS 1 348 0.44 2 227 13 2.711TX4 1 355 0.36 80 243 25 4.342PTC 1 314 0.66 1 265 18 4.553HLA 1 416 0.70 1 246 19 1.973SGB 1 257 2.24 1 320 38 3.213YGS 1 209 0.85 6 216 7 1.034SGB 1 266 2.50 10 260 33 2.33
Table 6.4: Performance of algorithm for 25 proteins after and before the local improvement.Columns 2 to 4 show the first configuration with low rmsd after locally improving the top � coarsealignments and re-ranking them. Columns 5 to 8 show the corresponding configuration before thelocal improvement and re-ranking.
141
� ˚� ). In fact, no 4-legged feature is involved in producing the best configuration for 24/25
cases (the only exception is 2PTC), and the percentage of good configurations involving
4-legged features is usually less than� � . On the other hand, the computation of 4-legged
features is most expensive (refer to Chapter 4).
For each protein complex, we apply the local improvement heuristic [61] to the top �
configurations, and then re-rank them based on new scores. The results are illustrated in
Table 6.4, where we choose � � � (other than for the case of 1JAT, where we choose
� � � ). We consider only those configurations with at most�
collisions. In all but one
case, the top-ranked (or the second ranked in 1FC2) configuration after local improvement
has a small rmsd. In the only remaining case, we can obtain a small rmsd for the top-
ranked configuration if we relax the number of collisions to � . In other words, in 23 out
of these 25 test complexes, our coarse alignment algorithm in conjunction with the local-
search heuristics [61] can predict an accurate near-native docking configurations without
false positives.
Unbound protein benchmark. We further tested � � %)' ��� � # � $ on the protein-protein
docking benchmark provided in [54]. We omitted the seven complexes classified as difficult
in [54] because they have significantly different conformations in the unbound vs. bound
structures. We also omitted complexes 1IAI, 1WQ1 and 2PCC because of difficulties in
generating molecular surfaces of required quality. Of the 49 remaining complexes, 25
are so-called bound-unbound cases in which one of the components is rigid. For each
complex, we fix one chain as � , which is the rigid chain for the bound-unbound cases, or
the receptor for the unbound-unbound cases. We generate a set � of potential positions
for the other component, � . For each transformation � computed from � , we measure
the rmsd between the interface � � atoms from � and those from � � ��� , and refer to it as
rmsd � . (For unbound protein complexes, rmsd � serves as a better measure than the rmsd
which we used before due to the flexibility, thus the unreliability in the positions of non-
142
C-id Top 2000 ��� All outputs �rmsd � hits rank rmsd � rmsd size
1ACB 3.70 20 3,951 1.75 1.81 14,4261AVW 5.51 8 4,698 5.42 6.38 23,5651BRC 4.66 35 1,629 4.66 5.72 12,7701BRS 1.60 7 426 1.60 1.54 11,6071CGI 3.04 5 695 3.04 3.32 10,1351CHO 2.35 27 92 2.35 2.69 11,8151CSE 3.15 7 15,271 2.74 2.53 21,0681DFJ 6.44 2 1,433 6.44 6.09 35,2311FSS 7.65 2 10,721 5.15 5.45 25,609
1MAH 2.78 4 1,561 2.78 3.45 25,4021TGS 5.27 18 543 5.27 6.07 11,3831UGH 7.95 3 8,268 7.16 7.40 14,6562KAI 6.55 26 2,560 3.41 4.71 13,4782PTC 4.55 32 4,983 4.16 7.85 13,9292SIC 4.04 27 76 4.04 7.71 20,0652SNI 6.34 10 4,894 4.58 4.72 15,8301PPE 4.13 10 37 4.13 5.07 7,6601STF 1.41 8 1 1.41 1.92 15,0821TAB 3.78 3 48 3.78 5.48 8,2961UDI 4.50 3 1,124 4.50 7.38 21,1332TEC 1.42 5 6 1.42 1.53 21,1344HTC 5.94 2 396 5.94 7.39 14,0321AHW 9.38 1 2,781 4.37 10.44 32,9191BVK 1.95 5 1,189 1.95 3.69 24,6111DQJ 4.59 7 710 4.59 6.21 28, 6941MLC 3.71 7 6949 3.32 7.13 29,7471WEJ 6.27 3 4,659 5.86 6.01 18,1941BQL 6.98 11 10,388 4.39 4.64 23,3081EO8 2.31 1 11 2.31 3.11 45,5121FBI 6.49 8 11,783 2.30 2.08 26,0361JHL 3.47 18 14,185 2.61 3.27 32,0911KXQ 5.99 2 1,495 5.99 15.86 37,2181KXT 4.52 12 153 4.52 10.90 39,2401KXV 2.48 7 321 2.48 3.54 46,3681MEL 2.21 8 73 2.21 2.55 17,7411NCA 1.75 7 621 1.75 1.92 49,6001NMB 7.18 7 14,202 2.72 5.11 42,0661QFU 1.97 4 12 1.97 3.07 47,6932JEL 3.46 19 115 3.46 4.39 34,0722VIR 1.08 11 1 1.08 1.86 40,8131AVZ 4.06 8 4,243 3.52 4.08 7,8951L0Y 2.75 2 1136 2.75 3.83 34,0442MTA 2.91 40 19,167 2.07 2.16 36,9031A0O 5.95 3 3950 4.35 5.20 9,1131ATN 1.52 8 1 1.52 2.33 50,7291GLA - - 25,307 2.82 2.83 33,8791IGC 2.48 3 3,260 2.06 2.59 25,3031SPB 2.83 3 617 2.83 3.03 13,7282BTF 5.02 2 10,132 3.28 3.63 33,480
Table 6.5: Protein-protein benchmark. C-id means complex-id.
143
backbone atoms.) As the unbound structures (i.e, the input to our algorithm) provided in
the benchmark are superimposed onto their crystalized correspondances, this value is close
to the rmsd � measured between � ����� and the crystalized structure for � .
Now take � � , the top � ranked configurations from � . The results are shown
in Table 6.5, where we show in column 2 the smallest rmsd � generated from � � , and in
column 3 the number of configurations from � � with a rmsd � smaller than� ˚� . Columns
4 to 6 provide information (rank, rmsd � , and corresponding rmsd) of the configuration in
� with the smallest rmsd � , and the size of � is shown in column 7.
Our results demonstrate a number of favorable characteristics of our algorithm. First,
within the relatively small set of � top-scoring configurations ( � � ), 38 out of 49 com-
plexes yield a configuration below� ˚� rmsd � . All but one complex yield a configuration
below the� ˚� cut-off needed as input for the hierarchical, progressive refinement protocol
in [41, 42]. The fact that most complexes generate multiple hits (column 3) increases the
probability that a local refinement will not be trapped in a local minimum and instead find
a correct solution. Second, within all the configurations generated ( � , at most 50,000),
47 out of 49 complexes yield a configuration below� ˚� , typically within the top 10,000
scores. All 49 generate at least one configuration below � � � ˚�
, in at most � � � configu-
rations. How these coarse alignments re-rank to yield high scoring solutions with low rmsd
remains to be investigated. We also remark that it is possible to futher reduce the size of �by clustering similar configurations [86].
6.4 Notes and discussion
We have presented in this chapter an efficient alignment-based algorithm to compute a set
of coarse configurations given two rigid input proteins. We have shown in Section 6.3
that when combined with the local improvement heuristics from [61], our algorithm can
predict an accurate near-native docking configurations for 23 out of 25 test bound protein
144
complexes, without producing any false positives. When tested on the unbound protein
docking benchmark [54], our algorithm is able to rapidly produce a relatively short list
of potential configurations which can be inputs to other local improvement methods that
allow protein flexibility and thus have the potential to solve the unbound protein docking
problem.
Comparisons. Current approaches for the bound docking stage differ in how they sample
the search space and how they evaluate the docking score. FFT-type methods discretize
the search space in a rather uniform way. They produce more accurate predictions for re-
combining known complexes, but at a much higher computational cost. For the case of
unbound docking, it is possible to run those algorithms at a low resolution to provide some
tolerance for flexibility. However, if the resolution is too low, there is a danger of miss-
ing good alignments completely; while if the resolution is high, too many configurations
will be generated and the selection of a small set of good candidates for the refinement
stage is not a trivial problem. Alignment-based methods tend to sample the search space
in a more selective manner guided by shape complementarity. Methods in this category
designed prior to our algorithm align feature points residing at protrusions and cavities.
In particular, given two sets of such points representing � and � , they align all possible
pairs of points from one set with all possible pairs from the other to generate potential
rigid motions. Since the number of feature pairs computed in our algorithm is similar to
that of feature points computed in their algorithms, they inspect significantly more trans-
formations than ours ( � � vs. �� , where � is the number of feature pairs or points) due
to meaningless ones and duplicates. In contrast, by aligning features computed from the
elevation function, we sample the transformational space in a much sparser manner than
previous approaches, focusing only on potentially good docking locations. The size of the
output of our algorithm is much smaller (with� � � configurations without clustering for
1BRS), and as they are generated by fitting features, we expect to capture reasonable con-
145
figurations unless proteins undergo dramatic conformational changes. We also comment
that it is possible to further improve the speed of our algorithm by the geometric hashing
technique as in [86].
146
Bibliography
[1] http://www.marketdata.nasdaq.com/mr4b.html.
[2] Protein data bank. http://www.rcsb.org/pdb/.
[3] P. Agarwal and K. R. Varadarajan. Efficient algorithms for approximating polygonalchains. Discrete Comput. Geom., 23:273–291, 2000.
[4] P. K. Agarwal, H. Edelsbrunner, J. Harer, and Y. Wang. Extreme elevation on a2-manifold. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages 357–365,2004.
[5] P. K. Agarwal, H. Edelsbrunner, and Y. Wang. Computing the writhing number of aknot. Discrete Comput. Geom., 32:37–53, 2004.
[6] P. K. Agarwal, L. J. Guibas, A. Nguyen, D. Russel, and L. Zhang. Collision detec-tion for deforming necklace. To Appear.
[7] P. K. Agarwal, S. Har-Peled, N. H. Mustafa, and Y. Wang. Near-linear time approx-imation algorithms for curve simplification in two and three dimensions. Algorith-mica, To appear.
[8] P. K. Agarwal, S. Har-Peled, M. Sharir, and Y. Wang. Hausdorff distance undertranslation for points and balls. In Proc. 19th Annu. ACM Sympos. Comput. Geom.,pages 282–291, 2003.
[9] P. K. Agarwal and J. Matousek. Ray shooting and parametric search. SIAM J.Comput., 22:540 – 570, 1993.
[10] P. K. Agarwal and J. Matousek. On range searching with semialgebraic sets. InDiscrete Comput. Geom., volume 11, pages 393–418, 1994.
[11] P. K. Agarwal, M. Sharir, and S. Toledo. Applications of parametric searching ingeometric optimization. J. Algorithms, 17:292 – 318, 1994.
[12] O. Aichholzer, H. Alt, and G. Rote. Matching shapes with a reference point. Intl. J.Comput. Geom. and Appl., 7:349–363, 1997.
[13] J. Aldinger, I. Klapper, and M. Tabor. Formulae for the calculation and estimationof writhe. J. Knot Theory and Its Ramifications, 4:343–372, 1995.
147
[14] H. Alt, B. Behrends, and J. Bloemer. Approximate matching of polygonal shapes.Annu. Math. Artif. Intell., 13:251 – 266, 1995.
[15] H. Alt, P. Brass, M. Godau, C. Knauer, and C. Wenk. Computing the hausdorffdistance of geometric patterns and shapes. Discrete Comput. Geom. – the Goodman-Pollack Festchrift, pages 65–76, 2003.
[16] H. Alt and M. Godeau. Computing the Frechet distance between two polygonalcurves. International Journal of Computational Geometry, pages 75–91, 1995.
[17] H. Alt and L. Guibas. Discrete geometric shapes: Matching, Interpolation, andApproximation. Handbook of Computational Geometry (J.-R. Sack and J. Urrutiaeds), 1999.
[18] E. Althaus, O. Kohlbacher, H. P. Lenhof, and P. Muller. A combinatorial approachto protein docking with flexible side-chains. In Prooceedings of the Fourth Inter-national Conference on Computaional Molecular Biology (RECOMB), pages 8–11,2000.
[19] N. Amenta and R. Kolluri. The medial axis of a union of balls. Comput. Geom:Theory Appl., 20:25–37, 2001.
[20] A. M. Amilibia and J. J. N. Ballesteros. The self-linking number of a closed curvein �
�. Journal of Knot Theory and Its Ramifications, 9(4):491–503, 2000.
[21] A. Amir, E. Porat, and M. Lewenstein. Approximate subset matching with Don’tCares. In Proc. 12th ACM-SIAM Symp. Discrete Algorithms, pages 305–306, 2001.
[22] M. Ankerst, G. Kastenmuller, H. Kriegel, and T. Seidl. 3d shape histograms for sim-ilarity search and classfication in spatial databases. In Proc. of the 6th Int. Sympos.on Spatial Databases, volume 1651, pages 207–226, 1999.
[23] V. I. Arnold. Catastrophe Theory. Springer-Verlag, Berlin, Germany, 1984.
[24] S. Arya and T. Malamatos. Linear-size approximate voronoi diagrams. In Proc.13th ACM-SIAM Symp. on Discrete Algorithms, pages 147–155, 2002.
[25] M. J. Atallah. A linear time algorithm for the Hausdorff distance between convexpolygons. Inform. Process. Lett., 17:207 – 209, 1983.
[26] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science,294(5):93–96, 2001.
148
[27] Y. A. Ban, H. Edelsbrunner, and J. Rudolph. Interface surfaces for protein-proteincomplexes. In RECOMB, pages 205–212, 2004.
[28] T. Banchoff. Self-linking numbers of space polygons. Indiana Univ. Math. J.,25:1171–1188, 1976.
[29] G. Barequet, D. Z. Chen, O. Daescu, M. T. Goodrich, and J. Snoeyink. Effi-ciently approximating polygonal paths in three and higher dimensions. Algorith-mica, 33(2):150 – 167, 2002.
[30] W. R. Bauer, F. H. C. Crick, and J. H. White. Supercoiled dna. Scientific American,243:118 – 133, 1980.
[31] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. H.Shinkdyalov, and P. E. Bourne. The protein data bank. Nucleic Acid Res., 28:235–242, 2000.
[32] S. Bespamyatnikh, V. Choi, H. Edelsbrunner, and J. Rudolph. Accurate proteindocking by shape complementarity alone, 2004.
[33] K. Brakke. Surface evolver software documentation.http://www.geom.umn.edu/software/evolver/.
[34] C. Branden and J. Tooze. Introduction to Protein Structure. Garland PublishingInc., 2 edition, 1999.
[35] M. L. Bret. Catastrophic variation of twist and writhing of circular DNAs withconstraint? Biopolymers, 18:1709 – 1725, 1979.
[36] J. W. Bruce and P. J. Giblin. Curves and Singularities. Cambridge Univ. Press,England, second edition, 1992.
[37] D. Brutlag. DNA topology and topoisomerases, 2000. http://cmgm.stanford.edu/-biochem201/Handouts/DNAtopo.html.
[38] G. Buck. Four-thirds power law for knots and links. Nature, 392:238–239, 1998.
[39] R. M. Burnett and J. S. Taylor. DARWIN: A program for docking flexiblemolecules. Proteins: Structure, Function, and Genetics, 41:173 – 191, 2000.
[40] G. Calugareanu. Sur les classes d’isotopie des noeuds tridimensionnels et leursinvariants. Czechoslovak Math. J., 11:588–625, 1961.
149
[41] C. J. Camacho, D. W. Gatchell, S. R. Kimura, and S. Vajda. Scoring docked con-formations generated by rigid-body protein-protein docking. Proteins, 40:525–537,2000.
[42] C. J. Camacho and S. Vajda. Protein docking along smooth association pathways.Proc. Natl. Acad. Sci., 98:10636–10641, 2001.
[43] J. Cantarella. On comparing the writhe of a smooth curve to the writhe of an in-scribed polygon, 2002.
[44] J. Cantarella, D. DeTurck, and H. Gluck. Upper bounds for the writhing of knotsand the helicity of vector fields. In Proceedings of the Conference in Honor of the70th Birthday of Joan Birman, J. Gilman, X. Lin W. Menasco (eds), 2000.
[45] J. Cantarella, R. Kusner, and J. Sullivan. Tight knot values deviate from linearrelation. Nature, 392:237–238, 1998.
[46] D. Cardoze and L. Schulman. Pattern matching for spatial point sets. In Proc. 39thAnnu. IEEE Sympos. Found. Comput. Sci., pages 156 – 165, 1998.
[47] O. Carugo and S. Pongor. Protein fold similarity estimated by a probablisticapproach based on � � - � � distance comparison. Journal of Molecular Biology,315:887–898, 2002.
[48] F. Cazals, F. Chazal, and T. Lewiner. Molecular shape analysis based upon theMorse-Smale complex and the Connolly function. In Proc. 19th Annu. ACM Sym-pos. Comput. Geom., 2003.
[49] H. S. Chan and K. A. Dill. The effects of internal constraints on the configurationsof chain molecules. Journal of Chemical Physics, 92(5):3118 – 3135, 1990.
[50] W. S. Chan and F. Chin. Approximation of polygonal curves with minimum numberof line segments. In Proc. 3rd Annu. Internat. Sympos. Algorithms Comput., volume650 of Lecture Notes Comput. Sci., pages 378–387. Springer-Verlag, 1992.
[51] B. Chazelle. Cutting hyperplanes for divide-and-conque. Discrete Comput. Geom.,9:145 – 158, 1993.
[52] B. Chazelle, H. Edelsbrunner, L. Guibas, M. Sharir, and J. Stolfi. Lines in space:cobinatorics and algorithms. Algorithmica, 15:428–447, 1996.
150
[53] R. Chen, L. Li, and Z. Weng. ZDOCK: An initial-stage protein docking algorithm.Proteins, 52(1):80–87, 2003.
[54] R. Chen, J. Mintseris, J. Janin, and Z. Weng. A protein-protein docking benchmark.Proteins, 52:88–91, 2003.
[55] H.-L. Cheng, T. K. Dey, H. Edelsbrunner, and J. Sullivan. Dynamic skin triangula-tion. Discrete Comput. Geom., 25:525–568, 2001.
[56] S. S. Chern. Curves and surfaces in Euclidean space. Studies in Global Geometryand Analysis, pages 16–56, 1967.
[57] L. P. Chew, D. Dor, A. Efrat, and K. Kedem. Geometric pattern matching in � -dimensional space. Discrete Comput. Geom., 21:257 – 274, 1999.
[58] L. P. Chew, M. T. Goodrich, D. P. Huttenlocher, K. Kedem, J. M. Kleinberg, andD. Dravets. Geometric pattern matching under euclidean motion. Comput. Geom.Theory Appl., 7:113 – 124, 1997.
[59] L. P. Chew, D. Huttenlocher, K. Kedem, and J. Kleinberg. Fast detection of commongeometric substructure in proteins. In Proc. 3rd. Int. Conf. Comput. Mol. Biol.,1993.
[60] I. Choi, J. Kwon, and S. Kim. Local feature frequency profile: A method to measurestructural similarity in proteins. PNAS, 101(11):3797 – 3802, 2004.
[61] V. Choi, P. K. Agarwal, H. Edelsbrunner, and J. Rudolph. Local search heuristic forrigid protein docking. In 4th Workshop on Algorithms in Bioinformatics, 2004.
[62] D. Cimasoni. Computing the writhe of a knot. Journal of Knot Theory and ItsRamifications, 10(3):387–395, 2001.
[63] K. Cole-McLaughlin, H. Edelsbrunner, J. Harer, V. Natarajan, and V. Pascucci.Loops in Reeb graphs of 2-manifolds. Discrete Comput. Geom. To appear.
[64] M. L. Connolly. Molecular surface review.
[65] M. L. Connolly. Analytical molecular surfaces calculation. Journal of AppliedCrystallography, 16:548 – 558, 1983.
[66] M. L. Connolly. Measurement of protein surface shape by solid angles. J. Mol.Graphics, 4:3 – 6, 1986.
151
[67] M. L. Connolly. Shape complementarity at the hemo-globin albl subunit interface.Biopolymers, 25:1229–1247, 1986.
[68] T. E. Creighton. Proteins: structures and molecular properties. W. H. Freeman andComapny, New York, second edition, 1993.
[69] F. H. C. Crick. The packing of alpha-helices: simple coiled coils. Acta Crystallog-raphy, 6:689–697, 1953.
[70] F. H. C. Crick. Linking numbers and nucleosomes. Proc. Natl. Acad. Sci. USA,73(8):2639–2643, 1976.
[71] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computationalgeometry: algorithms and applications. Springer, 1997.
[72] Y. Diao and C. Ernst. The complexity of lattice knots. Topology Appl., pages 1–9,1998.
[73] K. A. Dill. Theory for the folding and stability of globular proteins. Biochemistry,24(6):1501 – 1509, 1985.
[74] K. A. Dill. Dominant forces in protein folding. Biochemistry, 29(31):7132 – 7135,1990.
[75] D. H. Douglas and T. K. Peucker. Algorithms for the reduction of the number ofpoints required to represent a digitized line or its caricature. Canadian Cartogra-pher, 10(2):112–122, Dec. 1973.
[76] H. Edelsbrunner, J. Harer, and A. Zomorodian. Hierarchical morse complexes forpiecewise linear 2-manifolds. Discrete Comput. Geom., 30:87–108, 2003.
[77] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence andsimplification. Discrete Comput. Geom., 28:511–533, 2002.
[78] H. Edelsbrunner and E. P. Mucke. Simulation of simplicity: a technique to copewith degenerate cases in geometric algorithms. ACM Trans. Graphics, 9:66–104,1990.
[79] M. H. Eggar. On White’s Formula. Journal of Knot Theory and Its Ramifications,9(5):611–615, 2000.
152
[80] I. Eidhammer, I. Jonassen, and W. R. Taylor. Structure comparison and structurepatterns. J. Mol. Biol., 7:685–716, 2000.
[81] A. H. Elcock, D. Sept, and J. A. McCammon. Computer simulation of protein-protein interactions. J. Phys. Chem., 105:1504–1518, 2001.
[82] M. A. Erdmann. Protein similarity from knot theory and goemetric convolution. InRECOMB, pages 195–204, 2004.
[83] R. Estkowski and J. S. B. Mitchell. Simplifying a polygonal subdivision whilekeeping it simple. In Proc. 17th Annu. ACM Sympos. Comput. Geom., pages 40–49,2001.
[84] A. Fersht. Structure and mechanism in protein science. W. H. Freeman and Com-pany, New York, third edition, 2000.
[85] P. Finn and L. Kavraki. Computational approaches to drug design. Algorithmica,25:347–371, 1999.
[86] D. Fischer, S. L. Lin, H. Wolfson, and R. Nussinov. A geometry-based suite ofmolecular docking processes. J. Mol. Biol., 248:459–477, 1995.
[87] W. Fraczek. Mean sea level, GPS, and the geoid. Arc-Users Online, 2003. ERSIWeb Sites: www.esri.com/news/arcuser/0703/sum-mer-2003.html.
[88] F. B. Fuller. The writhing number of a space curve. In Proc. Natl. Acad. Sci. USA,volume 68, pages 815–819, 1971.
[89] F. B. Fuller. Decomposition of the linking number of a closed ribbon: a problemfrom molecular biology. In Proc. Natl. Acad. Sci. USA, volume 75, pages 3557–3561, 1978.
[90] H. A. Gabb, R. M. Jackson, and M. J. Sternberg. Modeling protein docking usingshape complementarity, electrostatics and biochemical information. J. Mol. Biol.,272(1):106–120, 1997.
[91] E. J. Gardiner, P. Willett, and P. J. Artymiuk. Protein docking using a genetic algo-rithm. Proteins: Structure, Function, and Genetics, 44:44 – 56, 2001.
[92] M. Godau. A natural metric for curves: Computing the distance for polygonalchains and approximation algorithms. In Proc. of the 8th Annual Symposium onTheoretical Aspects of Computer Science, pages 127–136, 1991.
153
[93] B. B. Goldman and W. T. Wipke. QSD: quadratic shape descriptors. 2. Moleculardocking using quadratic shape descriptors (QSDock). Proteins, 38:79–94, 2000.
[94] D. Goldman, S. Istrail, and C. H. Papadimitriou. Algorithmic aspects of proteinstructure similarity. In IEEE Symposium on Foundations of Computer Science,pages 512–522, 1999.
[95] M. T. Goodrich, J. S. Mitchell, and M. W. Orletsky. Practical methods for approx-imate geometric pattern matching under rigid motion. In Proc. 10th Annu. ACMSympos. Comput. Geom., pages 103 – 112, 1994.
[96] L. J. Guibas, J. E. Hershberger, J. S. Mitchell, and J. Snoeyink. Approximatingpolygons and subdivisions with minimum link paths. Internat. J. Comput. Geom.Appl., 3(4):383–415, Dec. 1993.
[97] I. Halperin, B. Ma, H. Wolfson, and R. Nussinov. Principles of docking: Anoverview of search algorithms and a guide to scoring functions. Proteins: Struc-ture, Function, and Genetics, 47:409 – 443, 2002.
[98] S. Har-Peled. A replacement for voronoi diagrams of near linear size. In Proc. 42ndAnnu. IEEE Sympos. Found. Comput. Sci., pages 94–103, 2001.
[99] A. Hatcher and J. Wagoner. Pseudo-Isotopies of compact manifolds, 1973. SocieteMathematique de France.
[100] P. Heckbert and M. Garland. Survey of polygonal surface simplification algorithms.In SIGGRAPH 97 Course Notes: Multiresolution Surface Modeling, 1997.
[101] J. Hershberger and J. Snoeyink. An � � ��� � � implementation of the Douglas-Peucker algorithm for line simplification. In Proc. 10th Annu. ACM Sympos. Com-put. Geom., pages 383–384, 1994.
[102] A. Hillisch and R. Hilgenfeld, editors. Modern Methods of Drug Discovery.Springe Verlag, 2003.
[103] L. Holm and C. Sander. Mapping the protein universe. Science, 273:595–602, 1996.
[104] L. Holm and C. Sander. Dali/FSSP classification of three-dimensional protein folds.Nucleic Acids Research, 25(1):231–234, 1997.
[105] W. Humphrey, A. Dalke, and K. Schulten. VMD– Visual Molecular Dynamics. J.Molec. Graphics, 15:33 – 38, 1996.
154
[106] D. P. Huttenlocher, K. Kedem, and J. M. Kleinberg. On dynamic Voronoi diagramsand the minimum Hausdorff distance for point sets under Euclidean motion in theplane. In Proc. 8th Annu. ACM Sympos. Comput. Geom., pages 110 – 120, 1992.
[107] D. P. Huttenlocher, K. Kedem, and M. Sharir. The upper envelope of Voronoi sur-faces and its applications. Discrete Comput. Geom., 9:267 – 291, 1993.
[108] H. Imai and M. Iri. An optimal algorithm for approximating a piecewise linearfunction. Information Processing Letters, 9(3):159–162, 1986.
[109] H. Imai and M. Iri. Polygonal approximations of a curve-formulations and algo-rithms. In G. T. Toussaint, editor, Computational Morphology, pages 71–86. North-Holland, Amsterdam, Netherlands, 1988.
[110] P. Indyk, R. Motwani, and S. Venkatasubramanian. Geometric matching undernoise: Combinatorial bounds and algorithms. In Proc. 10th Annu. ACM-SIAM Sym-pos. Discrete Alg., pages 457 – 465, 1999.
[111] R. M. Jackson, H. A. Gabb, and M. J. E. Sternberg. Rapid refinement of proteininterfaces incorporating solvation: application to the docking problem. J. Mol. Biol.,276:265–285, 1998.
[112] D. J. Jacobs, A. J. Rader, L. A. Kuhn, and M. F. Thorpe. Protein flexibility predic-tions using graph theory. Proteins: Structure, Function, and Genetics, 44:150–165,2001.
[113] J. Janin and C. Chothia. The structure of protein-protein recognition site. J. Mol.Biol., 265:16027 – 16030, 1990.
[114] J. Janin and S. J. Wodak. The structural basis of macromolecular recognition. Adv.Protein Chem., 61:9–73, 2002.
[115] G. Jones, P. Willett, R. C. Glen, A. R. Leach, and R. Taylor. Development andvalidation of a genetic algorithm for flexible docking. J. Mol. Biol., 267:727–748,1997.
[116] S. Jones and J. M. Thornton. Principles of protein-protein interactions. Proc. Natl.Acad. Sci., 93(1):13–20, 1996.
[117] R. D. Kamien. Local writhing dynamics. Eur. Phys. J. B, 1:1–4, 1998.
155
[118] G. Kastenmuller, H. Kriegel, and T. Seidl. Similarity search in 3d protein databases,1998.
[119] E. Katchalski-Katzir, I. Shariv, M. Eisenstein, A. A. Friesen, C. Aflalo, and I. A.Vakser. Molecular surface recognition: determination of geometric fit between pro-teins and their ligands by correlation techniques. Proc. Natl. Acad. Sci., 89:2195–2199, 1992.
[120] K. Kedem, R. Livne, J. Pach, and M. Sharir. On the union of Jordan regions andcollision-free translational motion amidst polygonal obstacles. Discrete Comput.Geom., 1:59–71, 1986.
[121] K. Klenin and J. Langowski. Computation of writhe in modeling of supercoiledDNA. Biopolymers, 54:307 – 317, 2000.
[122] P. Koehl. Protein structure similarities. Curr. Opin. Struct. Biol., 11:348–353, 2001.
[123] P. Koehl and M. Levitt. A brighter future for protein structure prediction. NatureStructural Biology, 6(2):108 – 111, 1999.
[124] W. G. Krebs and M. Gerstein. The morph server: a standardized system for ana-lyzing and visualizing macromolecular motions in a database framework. NuclericAcids Res., 28:1665–1675, 2000.
[125] A. R. Leach. Molecular modelling: Principles and applications. Pearson EducationLimited, 1996.
[126] B. Lee and F. M. Richards. The interpretation of protein structures: Estimation ofstatic accessibility. J. Mol. Biol., 55:379 – 400, 1971.
[127] H. Lenhof. An algorithm for the protein docking problem. Bioinformatics: FromNucleic Acids and Proteins to Cell Metabolism, pages 125–139, 1995.
[128] M. Levitt. Protein folding by restrained energy minimization and molecular dynam-ics. J. Mol. Biol., 170:723–764, 1983.
[129] M. Levitt and M. Gerstein. A unified statistical framework for sequence comparisonand structure comparison. Proc. Nat. Acad. Sci., 95:5913–5920, 1998.
[130] J. Liang, H. Edelsbrunner, and C. Woodward. Anatomy of protein pockets andcavities: measurement of binding site geometry and implications for ligand design.Protein Sci., 7:1884–1897, 1998.
156
[131] S. Loncaric. A survey of shape analysis techniques. Pattern Recognition, 31(8):983– 1001, 1998.
[132] A. C. Martin, C. A. Orengo, E. G. Hutchinson, S. Jones, M. Karmirantzou, R. A.Laskowski, J. B. Mitchell, C. Taroni, and J. M. Thornton. Protein folds and func-tions. Struct. Fold. Des., 6:875 – 884, 1998.
[133] N. Megiddo. Applying parallel computation algorithms in the design of serial algo-rithms. J. ACM, 30:852 – 865, 1983.
[134] A. Melkman and J. O’Rourke. On polygonal chain approximation. In G. T. Tous-saint, editor, Computational Morphology, pages 87–95. North-Holland, Amsterdam,Netherlands, 1988.
[135] J. Milnor. Morse Theory. Princeton Univ. Press, New Jersey, 1963.
[136] J. C. Mitchell, R. Kerr, and L. F. T. Eyck. Rapid atomic density measures for molec-ular shape characterization. J. Mol. Graph. Model., 19:324–329, 2001.
[137] G. Moont and M. J. E. Sternberg. Modelling protein-protein and protein-dna dock-ing. Bioinformatics – From Genomes to Drugs, 1:361–404, 2001.
[138] A. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: A structural classifi-cation of proteins database for the investigation of sequences and structures. J. Mol.Biol., 247:536–540, 1995.
[139] C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M.Thornton. CATH – A hierarchic classfication of protein domain structures. Struc-ture, 5(8):1093–1108, 1997.
[140] P. N. Palma, L. Krippahl, J. E. Wampler, and J. J. G. Moura. BIGGER: A new softdocking algorithm for predicting protein interactions. Proteins: Structure, Func-tion, and Genetics, 39:178 – 194, 2000.
[141] F. M. G. Pearl, D. Lee, J. E. Bray, I. Sillitoe, A. E. Todd, A. P. Harrison, J. M.Thornton, and C. A. Orengo. Assigning genomic sequences to CATH. NucleicAcids Research, 28(1):277–282, 2000.
[142] W. F. Pohl. The self-linking number of a closed space curve. J. Math. Mech.,17:975–985, 1968.
157
[143] D. W. Ritchie. Evaluation of protein docking predictions using hex 3.1 in caprirounds 1 and 2. Proteins, 52(1):98–106, 2003.
[144] D. W. Ritchie and G. J. L. Kemp. Protein docking using spherical poloar fouriercorrelations. Proteins, 39:178–194, 2000.
[145] P. Rogen and B. Fain. Automatic classification of protein structure by using gaussintegral. PNAS, 100(1):119–124, 2003.
[146] H. Samet. Spatial Data Structures: Quadtrees, Octrees, and Other HierarchicalMethods. Addison-Wesley, Reading, MA, 1989.
[147] B. Sandak, R. Nussinov, and H. J. Wolfson. A method for biomolecular struc-tural recognition and docking allowing conformational flexibility. J. Comput. Biol.,5:631–654, 1998.
[148] S. Seeger and X. Laboureux. Feature extraction and registration: An overview.Principles of 3D Image Analysis and Synthesis, pages 153 – 166, 2002.
[149] I. N. Shindyalov and P. E. Bourne. Protein structure alignment by incremental com-binatorial extension (ce) of the optimal path. Proten Eng., 11(9):739–747, 1998.
[150] M. L. Sierk and W. R. Pearson. Sensitivity and selectivity in protein structure com-parison. Protein Science, 13:773–785, 2004.
[151] A. P. Singh and D. L. Brutlag. Protein structure alignment: a comparison of meth-ods. 2001.
[152] D. D. Sleator and R. E. Tarjan. A data structure for dynamic trees. J. Comput. Sys.Sci., 26(3):362–391, 1983.
[153] G. R. Smith and M. J. E. Sternberg. Prediction of protein-protein interactions bydocking methods. Current Opinion in Structural Biology, 12:29–35, 2002.
[154] B. Solomon. Tantrices of spherical curves. Amer. Math. Monthly, 103:30–39, 1996.
[155] R. Srinivasan and G. D. Rose. Linus: A hierarchic procedure to predict the fold ofa protein. Proteins: Structure, Function, and Genetics, 22:1143 – 1159, 1995.
[156] M. J. E. sternberg. Protein structure prediction: a pratical approach. Oxford Uni-versity Press, 1996.
158
[157] D. Swigon, B. D. Coleman, and I. Tobias. The elastic rod model for DNA andits application to the tertiary structure of DNA minicircles in mononucleosomes.Biophysical Journal, 74:2515–2530, 1998.
[158] M. B. Swindells, C. A. Orengo, D. T. Jones, E. G. Hutchinson, and J. M. Thornton.Contemporary approaches to protein structure classification. BioEssays, 20:849–891, 1998.
[159] M. Tabor and I. Klapper. The dynamics of knots and curves (Part I). NonlinearScience Today, 4(1):7–13, 1994.
[160] W. R. Taylor, A. C. W. May, N. P. Brown, and A. Aszodi. Protein structure: ge-ometry, topology and classification. Reports on Progress Physics, 64:517 – 590,2001.
[161] M. L. Teodoro, G. N. Phillips, and L. E. Kavraki. Understanding protein flexibilitythrough dimensional reduction. Journal of Computational Biology, 10:617–634,2003.
[162] A. Sali, E. Shakhnovich, and M. Karplus. How does a protein fold? Nature,369:248 – 251, 1994.
[163] I. A. Vakser. Protein docking for low-resolution structures. Protein Engineering,8:371–377, 1995.
[164] P. Veerapandian. Structure-based drug design. Marcel Dekker, 1997.
[165] R. C. Veltkamp and M. Hagedoorn. State of the art in shape matching. Principlesof Visual Information Retrieval, edited by M. S. Lew, Series in Advances in PatternRecognition, 2001.
[166] A. V. Vologodskii, V. V. Anshelevich, A. V. L. ka shin, and M. D. Frank-Kamenetskii. Statistical mechanics of supercoils and the torsional stiffness of theDNA double helix. Nature, 280:294 – 298, 1979.
[167] R. Weibel. Generalization of spatial data: principles and selected algorithms. InM. van Kreveld, J. Nievergelt, T. Roos, and P. Widmayer, editors, Algorithmic Foun-dations of Geographic Information System. Springer-Verlag Berlin Heidelberg NewYork, 1997.
[168] J. White. Self-linking and the Gauss integral in higher dimensions. Amer. J. Math.,XCI:693–728, 1969.
159
[169] D. L. Wild and M. A. S. Saqi. Structural proteomics: Inferring function from proteinstructure. Current Proteomics, 1:59 – 65, 2004.
160
Biography
Yusu Wang was born on June 28th, 1976 in Shanxi Province, China. After receiving her BS
from Tsinghua University in 1998, Yusu Wang joined in Computer Science Dept. of Duke
University, where she received her MS in year 2000 and is now pursuing a PhD in the area
of geometric computing. Her research focuses on designing efficient computational meth-
ods for shape analysis problems, especially for protein structure analysis, by combining
both geometry and topology. She is currently involved in an interdisciplinary collabarotive
project, BioGeometry, at Duke Univ., Stanford Univ., UNC-Chapel Hill, and UNCA � T, to
address fundamental computational problems in representing, searching, simulating, ana-
lyzing, and visualizing biological structures.
Related Publications.
1. P. K. AGARWAL, H. EDELSBRUNNER AND Y. WANG Computing the writhing number of
a knot. Discrete Comput. Geom. 32 (2004), 37–53.
2. S. HAR-PELED AND Y. WANG Shape fitting with outliers. SIAM J. Comput., to appear.
3. P. K. AGARWAL, S. HAR-PELED, N. MUSTAFA AND Y. WANG Near-linear time approxi-
mation algorithm for curve simplification. Algorithmica, to appear.
4. P. K. AGARWAL, H. EDELSBRUNNER, J. HARER AND Y. WANG Extreme elevation on a
2-Manifold. In “Proc. 20th Annu. Sympos. Comput. Geom., 2004”, 357–365.
5. P. K. AGARWAL, Y. WANG AND H. YU A 2D kinetic triangulation with near-quadratic
topological changes. In “Proc. 20th Annu. Sympos. Comput. Geom., 2004”, 180–189.
6. P. K. AGARWAL, S. HAR-PELED, M. SHARIR AND Y. WANG Hausdorff distance under
translation for points and balls. In “Proc. 19th Annu. Sympos. Comput. Geom., 2004”,
282–291.
161