migration of the molecular dynamics application charmm to ... · distributed computing platform of...

Eidgenössische Technische Hochschule ZürichInstitut für Computersysteme

Migration of the Molecular DynamicsApplication CHARMM to the Widely

Distributed Computing Platform of UnitedDevices

Bennet Uk

Diploma Thesis

11th March 2002

Professors:Prof. Thomas M. Stricker and Prof. Amedeo Caflisch

Assistants:Michela Taufer and Dr. Giovanni Settanni

Abstract

In this study, we will show the adaptation of molecular dynamics simulations usingCHARMM (Chemistry at HARvard Macromolecular Mechanics) for protein foldingresearch to the environment of widely distributed computing. We will then investigatethe performance and efficiency of such a migration. In particular, we will go throughsome critical issues such as the scheduling, the fault tolerance and the communicationproblems related to the migration, proposing some interesting solutions. For the proofof feasibility, we aim at a realistic setting using commodity hardware and well-knownsoftware components. Since the idea of widely distributed computing is quite wellestablished, we rely on existing middleware to do the distribution of the computation.Therefore, we teamed up with the company United Devices, Inc., and are using theirsoftware to investigate the feasibility of widely distributed molecular simulations.

Acknowledgements

First of all, I would like to thank Prof. Stricker and Prof. Caflisch for giving me theopportunity to do my diploma thesis work on this interesting interdisciplinary project.Many special thanks go to my advisor, Michela Taufer for her time, help and encour-agement and for supporting me throughout the entire project.

This project would not have been possible without the cooperation of Prof. Caflisch’sBiocomputing Group at the Department of Biochemistry at the University of Zürich. Iwould like to thank Dr. Giovanni Settanni in particular for sharing with me his ongoingresearch, allowing me to work on a real-world biochemical problem, and providing thepossibility to directly compare the results obtained on our widely distributed settingto his work on a dedicated cluster. He also contributed a great deal of biologicalbackground information which found its way into this report.

Furthermore, thanks go to David McNett and Ashok Adiga of United Devices, Inc. forthe provision of their MetaProcessor platform server software and development toolkit,for their valuable support during installation and their inputs concerning implementa-tion details. I would also like to thank IBM Switzerland for providing a license for theusage of their DB2 relational database management system.

Finally, many thanks to all the numerous people who generously donated their freeCPU cycles by allowing me to run my tests on their machines.

Bennet Uk Zürich, March 2002

2

Contents

1 Introduction 7

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Platforms for High Performance Computing 11

2.1 Dedicated Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 The Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Classification of Applications . . . . . . . . . . . . . . . . . 12

2.2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Characterization of Applications . . . . . . . . . . . . . . . . . . . . 13

2.4 The United Devices MetaProcessor Platform . . . . . . . . . . . . . . 14

2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.3 Porting Applications to the UD MetaProcessor Platform . . . 18

3 The Application CHARMM 21

3.1 A Short Overview of CHARMM . . . . . . . . . . . . . . . . . . . . 21

3.2 Protein Folding Research with Molecular Dynamics Simulations . . . 22

3.2.1 The Biological System . . . . . . . . . . . . . . . . . . . . . 22

3

4 Migration of CHARMM to the UD MetaProcessor Platform 23

4.1 Preliminary Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Migration of the CHARMM Executable . . . . . . . . . . . . . . . . 24

4.2.1 Language Interface Library . . . . . . . . . . . . . . . . . . . 24

4.2.2 Input/Output Functionality . . . . . . . . . . . . . . . . . . . 24

4.3 Adaptation of the Simulation Algorithm . . . . . . . . . . . . . . . . 26

4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.2 Work Partitioning and Scheduling . . . . . . . . . . . . . . . 27

4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.4 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Validation and Performance 31

5.1 Platform for Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Tests Conducted and Validation of the Results . . . . . . . . . . . . . 32

5.3 Results of “Real World” Long-Running Tests . . . . . . . . . . . . . 33

5.4 Analysis of Test Results . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.4.1 Communication . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4.2 Slow Machines vs. Fast Machines . . . . . . . . . . . . . . . 35

5.4.3 Effect of Work Unit Length . . . . . . . . . . . . . . . . . . 41

5.4.4 The Effect of the Working Set Size . . . . . . . . . . . . . . 42

5.4.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 Cluster vs. MetaProcessor Platform . . . . . . . . . . . . . . . . . . 47

5.5.1 Differences in Architecture . . . . . . . . . . . . . . . . . . . 47

5.5.2 Differences in the Algorithm . . . . . . . . . . . . . . . . . . 47

5.5.3 Comparison of results . . . . . . . . . . . . . . . . . . . . . 47

5.6 Problems During the Tests and Validation . . . . . . . . . . . . . . . 48

5.6.1 Slow Communication Between Server and Controller Process 48

5.6.2 Monitoring Agent Machines . . . . . . . . . . . . . . . . . . 48

5.7 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.7.1 Reducing Communication . . . . . . . . . . . . . . . . . . . 49

5.7.2 Alternatives to Greedy Optimization . . . . . . . . . . . . . . 49

4

6 Conclusion 51

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

References 53

5

Chapter 1

Introduction

1.1 Motivation

Large Amounts of Computing Power are Available. Today, personal computersare in use everywhere. In many companies, every single employee has a PC sittingon the desk connected to a corporate network. PC’s with Internet connections are alsofound in many households.

Much Potentially Useful Computing Power Goes Unused. Desktop machines inoffices today are chronically under-used, typically utilizing only 18% of their powerduring normal use (word processing, web browsing, e-mail etc.) [8, 39, 31]. Oftenmachines are left running at night time, when utilization is even lower. In effect,countless processor cycles are wasted.

Many Problems are Computationally Expensive. Some computationally intensiveproblems are too large for even todays largest supercomputers to solve in reasonableamounts of time. We find such examples in many fields, such as computational sci-ences, cryptography and mathematics.

The goal of distributed computing platforms like the United Devices MetaProcessorplatform is to enable the unused computing power which is potentially available to beutilized for such large-scale problems.

The possibility of harnessing idle processor power over the Internet has been proven bythe success of projects like Seti@home for the analysis of radio telescope data [32] anddistributed.net for the breaking of cryptographic keys [13]. These projects involve so-called embarassingly parallel problems, meaning that each participating node performswork which is not dependent on the work of other nodes.

7

Protein folding research is an area of computational biology that could greatly bene-fit from such computational power. Protein folding simulations by use of moleculardynamics require huge amounts of computing power. However, it is not possible toeliminate dependencies between work partitions, making the protein folding problemmore challenging to parallelize than the above problems.

In this study we investigate the feasibility of migrating a protein folding simulationscenario to a widely distributed platform.

1.2 Contributions

In a first phase, we will investigate the viability of the migration of CHARMM, aswell as a specific simulation scenario which has been tested on dedicated clusters,to the United Devices MetaProcessor platform. Linux has been chosen as the targetoperating system platform.

We will then investigate some critical issues related to the migration, such as thescheduling and work distribution, and the behaviour of our specific application onthe MetaProcessor platform in a limited environment with relatively small numbers ofnodes.

1.3 Overview

Chapter 2 gives an overview of high performance computing platforms, and focusseson the United Devices MetaProcessor platform. The major features of the platform aredescribed, as well as an overview of porting applications to the platform. In Chapter3, the application CHARMM is introduced. The specific biological problem of proteinfolding is outlined. Next, Chapter 4 describes the migration of the application to theMetaProcessor platform. This is done in two parts, one part concerning the adaptationof the CHARMM executable and the other part the implementation of our specificwork generation and distribution model. In Chapter 5, we describe the tests conductedto validate the migration, and we provide interpretations for some interesting resultswhich have been found. Finally, in Chapter 6, we draw some conclusions and providean outlook for possible further work.

1.4 Related Work

This project is closely related to work being done by Dr. Giovanni Settanni in Prof.Caflisch’s Biocomputing Group at the Department of Biochemistry at the Universityof Zürich. Settanni is performing very similar simulations on dedicated clusters [33].

8

A similar project is the Folding@home project conducted by Pande Group at StanfordUniversity. Folding@home is a distributed computing project using widely distributedmolecular dynamics to simulate protein folding [19, 42].

Other related work involve studies of CHARMM on other distributed platforms [27,37, 28].

9

Chapter 2

Platforms for High PerformanceComputing

2.1 Dedicated Clusters

Even the highest-performance massively parallel supercomputers are limited in termsof their maximum performance when used independently, as single nodes. Cluster-ing provides an architecture to go beyond that limit, while providing a better cost-performance ratio [3].

A cluster is a system of multiple computers interconnected via a high-performance lo-cal area network. In most cases, the nodes communicate via message-passing. Shared-memory clusters are less common, but do exist. Tandem introduced a 16-node clusterin 1975. SGI pioneered large, non-uniform memory access shared memory clusters inthe 1990s [3].

In 1994 the Beowulf Project was started, with the goal of building low-cost clustersusing COTS (Commodity Off The Shelf) base systems [35, 4]. In summer 1994,Thomas Sterling and Don Becker presented a 16-node cluster of Intel DX4 proces-sors connected by channel-bonded Ethernet. The machine, called Beowulf, was verysuccessful and the idea of using COTS machines and widely available software tobuild clusters spread quickly through the academic and research communities, leadingto the Beowulf project as it exists today [4]. Beowulf clusters have brought do-it-yourself cluster computing to many groups where such computing power was out ofreach before.

2.2 The Grid

The main idea of grid computing is to assemble a virtual supercomputer by connectingcomputing resources over long distances via high-speed networks [20, 1]. Such a

11

computing infrastructure would be able to solve complex problems too large for asingle supercomputer, enabling new classes of applications of many kinds.

2.2.1 Classification of Applications

Many different applications can potentially benefit from a grid architecture. In [20],five application classes are distinguished.

1. Distributed Supercomputing

Distributed supercomputing applications use grids to harness very large amountsof computational resources to solve problems which are too large for a singlesystem. Examples include large-scale distributed simulations, from the simula-tion of complex physical processes to simulations of military battle scenarios.Distributed supercomputing presents many challenges, from the discovery andscheduling of often scarce and expensive resources to the scalability of protocolsand algorithms to large numbers of nodes [20, 1].

2. High-Throughput Computing

In high throughput computing, large numbers of uncoupled jobs are scheduledto many independent nodes (often idle workstations), with the goal of utiliz-ing otherwise unused processor cycles. In contrast to distributed supercomput-ing, high-throughput computing involves few dependencies between jobs, lead-ing to different types of problems and problem-solving methods [20, 1]. TheUnited Devices MetaProcessor platform used in this study is suited toward high-throughput computing.

3. On-Demand Computing

On-demand computing aims towards using the grid to dynamically acquire re-sources for short-term usage. These resources can be software, data reposito-ries, specialized sensors (for instance microscopes, telescopes) etc. These ap-plications often focus on cost-effectiveness concerns rather than maximum rawperformance [20].

4. Data-Intensive Computing

Data-intensive computing involves processing large amounts of data in distributeddatabases, online libraries or other repositories, seeking to extract new or inter-esting information. This can involve large amounts of computation as well ascommunication [20].

5. Collaborative Computing

Collaborative applications assist in human-to-human communication by provid-ing shared virtual environments, allowing researchers to collaborate regardlessof physical distance in real-time [20].

12

2.2.2 Challenges

Grid computing poses many challenges toward the systems and the infrastructure usedto enable such wide-scale distribution of computational tasks [1].

Scalability. Grid computing environments can become arbitrarily large and complex.As well as adapting algorithms to scale well on large numbers of nodes, algo-rithms for the selection, reservation and utilization of computing resources mustbe developed, taking into account diverse criteria including cost, security, con-nectivity and reliability.

Heterogeneity. Computing power of nodes on a grid can range from desktop-classperformance to large supercomputers. Similarly, the range of network connec-tivity can include a very wide span. Furthermore, resources accessible via a gridare not limited just to computing power, but can also involve other resources likephysical devices, data, or software.

Unpredictability. A grid infrastructure must be able to cope with many unknown fea-tures, which furthermore can change without prior warning during resource us-age. Changes in network latency and bandwidth or failures of resources can hap-pen at essentially any time. The heterogeneity and unpredictable factors make itquite difficult to anticipate a computation’s structure.

Multiple administrative domains. Nodes on a grid consist of several geographicallydistributed resources which are administrated locally. Differences in securityand maintenance policies can pose further challenges.

2.3 Characterization of Applications

Depending on the nature of a specific parallel task, it can be better or less well suitedto running on a certain platform. We would like to consider two main factors forcharacterizing the applications.

Granularity. The granularity of a task is a measure for the amount of work a singlenode can complete before requiring communication with other nodes1. Coarse-grainparallelism would imply high numbers of computation cycles (in the order of minutesto hours, or even longer) between communication steps, whereas with fine-grain par-allelism we are looking at computation units of seconds to fractions of seconds. Aclosely related factor is the computation-communication ratio.

1other nodes could be other computing nodes, or, like in the case of the MetaProcessor platform, a“master” node of some sort.

13

Interdependencies of Work Units. In the simplest case, the work contained in oneunit is completely independent from the other work units. Applications involving workunits with some dependencies could be described as loosely-coupled, and applicationswith strong dependencies are tightly-coupled, requiring, for instance, frequent syn-chronization of all nodes.

Generally speaking, coarse-grain problems with no dependencies will be the easiest toparallelize and will perform well on many different platforms. Such applications areoften described as “embarrassingly parallel”, and several of these have been tackledin Internet distributed-computing projects [10, 13, 32, 38]. Many of these projectshave been able to aggregate computing powers much greater than the largest dedicatedsuper-computers.

At the other extreme end, we find fine-grain, tightly coupled applications. These canoften only be made to scale well on limited numbers of nodes, and even then special-ized high-performance networking infrastructure is needed, to avoid the communica-tion overhead annilihating performance gains.

A further factor which needs consideration is the overhead for a controller process togenerate work units and analyze results [40]. Naturally, this overhead must not be toolarge, otherwise it may not be possible to generate work quickly enough to keep allnodes busy, or the analysis of results may take too long, again decreasing the benefitsof the parallelized computation.

2.4 The United Devices MetaProcessor Platform

2.4.1 Overview

The United Devices MetaProcessor Platform (MP Platform) [39, 40] provides an en-vironment for running compute-intensive tasks distributed over many desktop-classmachines in a corporate Intranet or over the Internet. The MP Platform is well suitedto problems with low communication-to-computation ratio and involving coarse-grainparallelism with no dependencies between work units. Unlike typical clusters, whichare relatively limited in the number of nodes, and usually consist of dedicated, homo-geneous nodes connected via high-speed, low-latency networks, in the MetaProces-sor model the nodes are not dedicated to the usage as distributed computing nodes,and may have widely varying specifications (CPU-speed, RAM, architecture, oper-ating system etc.) and can be connected via a variety of network types (from high-performance LANs to dial-up modems). The MP server is also designed to scale up tovery large numbers of nodes. In the Intel-United Devices Cancer Research Project,over a million PC’s throughout the Internet are participating in a massive “virtualscreening” project [10].

Figure 2.1 shows the MetaProcessor platform architecture and components.

14

Figure 2.1: MetaProcessor Platform overview

15

Database. The database contains all relevant data about the participating devices, thetasks, jobs, work units, resident data and results. The enterprise-level relationaldatabase management system DB2 by IBM is used as the primary supporteddatabase server for the MetaProcessor platform.

MP Server. The MP server is the link to the “outside world”, to the participatingagents. It is responsible for scheduling, as well as for the distribution of taskmodules, resident data and work units to the agents. The MP server also receivesresults from the agents.

Management Server. The management server allows access to the internal data struc-tures via the Management API (Application Programming Interface), which isdefined in terms of XML-RPC calls [41]. The submission of work units andretrieval of results take place via the management API.

UD Agent. The UD Agent is a small program which runs on each participating device.The Agent communicates with the MP Server to request work units, residentdata and task modules as needed, executes the task module on the participatingdevice, and returns the task’s results to the server.

Management Console. The management console provides administrators with a web-based interface to the MP platform. Client device management, server monitor-ing, task and job management as well as client software release management canbe performed using the console.

The database, MP server and Management server form the heart of the system.

Operating Principle

The basic operating principle of the MP platform is rather simple: each participatingdevice (or client or node) must be running the UD agent. At startup, the agent contactsthe MP server, requesting work. The agent then receives a work unit from the server,together with a task executable and “resident data”, typically data which is needed forseveral work units. The agent then runs the task module with the received work unitas input. The task module interacts with the agent via a Task API, which providesspecial I/O functions and other MetaProcessor-specific functionality. After the taskcompletes execution, the generated output is sent back to the MP server as the workunit’s result. The agent then obtains a new work unit. The task executable and residentdata persist on the device, and are typically sent to each device only once. Note thatall communication between the MP server and participating devices is initiated by theagent running on each device.

The work units are generated by a controller program, and submitted to the server viaa management API, implemented using XML-RPC calls. The same API is used toretrieve results from the server.

16

Tasks, Jobs, Work Units and Resident Data

A task refers to a specific application program which has been ported to the MP Plat-form. To get actual work done, a job is created, associated with a specific task. Eachjob then contains multiple work units which are, in the end, sent to the participatingdevices.

Resident data is data associated with a task or job, which is typically needed for mul-tiple work units of the corresponding task or job. Resident data is normally static dataand persists on the agent devices over many work units. Every work unit can have oneassociated resident data file.

Although work units and resident data each consist of essentially only one file, the UDPlatform provides a packaging tool and a wrapper which allows applications to packmultiple files into each work unit and resident data file. Multiple output files how-ever are not supported, so applications requiring such functionality must implement aworkaround themselves.

2.4.2 Features

Scheduling

The MP platform features a workload scheduler to determine how work units get dis-tributed to the participating devices. Basically, as long as work units are queued whichhave not been sent out yet, they get sent to devices on a first-come, first-served basis.Additionally, the scheduler supports device usage profiles, which specify when the UDagent is allowed to run and which tasks the device participates in. Prioritizing of tasksis supported by assigning target service rates to jobs, and starting and ending times forjobs can also be defined. However, due to the uncontrollable nature of the MP platformenvironment, no service rate guarantees or hard deadlines can be provided.

Fault Tolerance

Fault tolerance is achieved by the MP server by means of eager scheduling [24]. Whenall work units have been assigned to devices, and a client requests more work, theMP server will send out older work units which are already in process, but have notyet returned results. This is somewhat similar to the concept of work-stealing [6],where processors which have finished work take away work from slower processors.However, in the MP case of resending work units, the slower processors continue work(except in the case of a failure), and eventually two results will end up being sent backto the server. Due to the strict communication model (clients only communicate withthe server, and all communication is initiated by the client), no cancellation of work

17

units, or means of “migration” of work between clients as in other systems [7], ispossible.

With eager scheduling, high fault tolerance is assured, the server does not need to beable to sense when a node has failed. A failed node can be regarded as a special caseof slow node – an infinitely slow node – and will be overtaken by faster processors. Apossible drawback of this scheme may be that slow machines get overtaken too often,diminishing the effectiveness of their calculations.

In addition to the fault tolerance by rescheduling, it is also possible to specify a “re-dundancy level” for a job, causing each work unit of the job to be sent out until thenumber of results has reached a certain number. This can be useful to cross-check re-sults for plausibility, or for some stochastical problems, multiple results may increasethe quality of the overall result.

Security

The MP platform is designed to provide end-to-end security in environments as hostileas the Internet. All communication over the network is encrypted, as well as filesstored on the participating devices’ disks. Encryption and decryption is performedtransparently to the application when file access takes place via the task API.

Automatic Updates and Release Management

New versions of the UD agent, as well as task modules are automatically rolled outto the participating devices whenever they communicate with the UD server. The MPplatform also supports three phases (Test, Pilot and Production) of software releases[40]. Subsets of the devices can be defined to participate in one of the three phases,and each phase can be monitored separately.

2.4.3 Porting Applications to the UD MetaProcessor Platform

United Devices provides a software development toolkit (SDK) to enable program-mers to develop applications for the MP platform. The SDK includes the task APIlibrary and documentation, as well as packaging tools and a “test-agent” to test portedapplications.

Before starting with any porting work, the developer must consider how to adapt theapplication to the MP paradigm. Important points are how the problem can be split upinto work units, and how the results can be combined. The nature of the MP platformmust be kept in mind. Dependencies between work units should be kept to a minimum,and the computation time for a work unit should be in the order of at least severalminutes, if not hours. After considering these points, we can proceed to adapt the

18

application to the MP platform. This involves two separate parts: the task modulewhich runs on the agent devices, and a master controller which submits work units tothe MP server and retrieves results, using the management API.

Task Module

The task module is implemented using the task API. This API defines functions for ini-tialization, input/output, checkpointing and some other miscellaneous functions. Thetask API functions replace standard I/O calls in the application. A C library imple-menting the task API is provided.

For applications written in languages other than C (such as Fortran), language-specificinterface libraries are needed which enable the task API functions to be called.

The MP task API supports checkpointing, which enables a task module to continuework on a work unit if it is forced to terminate before completion of the entire workunit. The task module can periodically save its state to a checkpoint file. When theagent terminates, the last checkpoint file is saved and the result file is rolled back to thepoint in time of the corresponding checkpoint. When the agent is restarted, the taskmodule can restore its state from the checkpoint file and continue work at that point.

During the development phase, the application can be built as a stand-alone executablefor testing purposes. When built as stand-alone, the task API calls are essentiallyreplaced by stubs comprising calls to the standard C input/ouput library functions.The resulting executable program is not linked with the UD task library. This makesquick testing easy, however not all “quirks” of the MP platform will be reproduced.After the application runs acceptably as a stand-alone build, it can then be built usingthe real task API library. If the application uses the packager for multiple input files,the binary file must also be wrapped with the wrapper executable.

The MP SDK provides a “test-agent” to test applications built with the task API library.This test-agent essentially simulates the environment which a task module would findin the real MP system.

The working task module can then be uploaded to the MP server and associated witha task.

Master Controller

The master controller is responsible for submitting work units and retrieving resultsfrom the MP server. The management API, implemented using XML-RPC over theHTTP protocol, is used for this purpose. The nature in which the master controllerfunctions is dependent on the specific problem. For simple parallel problems, thecontroller could generate a number of work units for a job, wait until they have allbeen completed, and then retrieve all results, combining them in some way to give an

19

overall result. Other applications can involve dependencies between results of earlierwork units and the input to later work units. In this case, the controller would haveto periodically retrieve results, analyze them and then generate subsequent work unitsbased on the results obtained so far.

20

Chapter 3

The Application CHARMM

3.1 A Short Overview of CHARMM

CHARMM (Chemistry at HARvard Macromolecular Mechanics) is a widely recog-nized scientific code for molecular dynamics and mechanics in computational biology[9, 28]. Initially, CHARMM was developed at the Department of Chemistry at Har-vard University, Massachusetts at the beginning of the 1980’s. Since then many peoplehave made contributions, extending the functionality of CHARMM and also adaptingit to diverse parallel architectures. The code achieved good performance on manycomputer platforms in the recent past and was a considerable success in many areas ofbiology, including research on prion infections. In particular the code runs well on theaging massively parallel multiprocessors manufactured by Cray Research Inc. in themid 1990’s (T3D and T3E) with more than 100 processors, and on low cost clusters ofPC’s with up to 16 processors.

CHARMM is a powerful tool which can be used for many different tasks:

� energy minimization

� analyze structural equilibriums of molecules

� structure and energy comparisons on molecules

� molecular dynamics simulations

� statistical elaboration of molecular dynamics simulations.

In this study, we use CHARMM for molecular dynamics (MD) simulations to investi-gate the protein folding process.

21

3.2 Protein Folding Research with Molecular Dynam-ics Simulations

One of the most intriguing challenges of the post-genomic era is the characterizationof the protein folding process, that is the sequence of events that lead the swollen chainof amino acids immersed in the biological environment to its unique functional native(folded) structure (or conformation or state) [18]. Up to now few aspects of this phe-nomenon have been clarified which allow for drawing a feasible protein folding sce-nario [18]. The phase transition from the folded to the unfolded conformations (andvice-versa) has a clear 1st order character (at least for small globular proteins). How-ever, although the native structure of many proteins has been fully characterized [5],very little is known about their unfolded state. Within the computational approachesto protein folding (with the exclusion of small peptides), most of the interesting in-sights have been obtained either using simplified models that imply the knowledge ofthe native state structure (Go models) [22, 11, 34] or using more realistic moleculardynamics models for the simulation of the unfolding process (from native to unfoldedconformations) [12, 23].

In this study, we use a molecular dynamics model with an implicit treatment of wa-ter [17]. Until now, these MD models have not yet sucessfully described the transitionfrom completely unfolded to folded proteins, mainly because the time scale of the fold-ing process is still some order of magnitude larger than present standard computationalcapabilities. In other words, present standard computational approaches do not allowfor an exhaustive exploration of the conformational space of unfolded proteins, and itis thus unlikely that folding events can be sampled using this kind of MD simulations.

In our scenario, we hope that the migration of standard MD software (i.e. CHARMM)to massively distributed computing facilities will allow exploitation of much largeramounts of computational power and thus for a corresponding widening of the proteinconformational space that can be explored. As a consequence, the possibility to sampleprotein folding events could become feasible.

3.2.1 The Biological System

The system that we chose for this study is the SH3 domain, a 56 residue protein whosefolding characteristics have been thoroughly studied both from the experimental andthe theoretical point of view [11, 23, 25, 30]. It is a natural choice because it representsa sort of testing table for theories or algorithms dealing with protein folding. Its shortlength and the abundance of comparable studies make this protein the optimal targetfor our computations.

22

Chapter 4

Migration of CHARMM to the UDMetaProcessor Platform

4.1 Preliminary Study

The standard, unmodified CHARMM is a programmable application and takes as inputa set of CHARMM commands in a text file. Usually a few auxilary input files, such asparameter files and topology files are also needed. Output information is written to thestandard output stream. When performing molecular dynamics, additional files suchas trajectory files may be created.

An effective way to adapt CHARMM to the MP platform is to let entire CHARMMprograms, i.e. in our case an entire molecular dynamics run and subsequent analy-sis of the calculated trajectory, execute on the compute nodes. The input program iscontained in the work unit input file. Additional input files can be packaged togetherin either the work unit input file (for data specific to each work unit) or in a residentdata file (for data used for multiple work units), using the packaging tool included inthe SDK. A mechanism for returning multiple files must be added to CHARMM, andthe corresponding mechanism for extracting these files from the result file must beimplemented in the controller.

The master controller which generates the work units and retrieves results is closelytailored toward our specific problem. We have adapted Settanni’s cluster algorithm tothe environment of the MP platform. The algorithm and more implementation detailsare described in Section 4.3.

23

4.2 Migration of the CHARMM Executable

4.2.1 Language Interface Library

To port Fortran applications to the MP platform, a language-specific interface libraryis needed as a wrapper around the task API library. This library is written in C withfunctions conforming to the calling conventions of Fortran 77 [16, 21]. The tool f2c[15], a Fortran-to-C conversion tool, is used to generate function stubs.

4.2.2 Input/Output Functionality

The largest change involved in porting applications to the MetaProcessor platform isthe replacement of the standard I/O functionality usually provided by the operatingsystem (file I/O and standard input/output streams) by I/O via the MetaProcessor taskAPI. This is also the case for CHARMM, and turned out to be a rather time-consumingprocess.

Whereas in C, file I/O is provided through regular function calls which can easily bereplaced, Fortran I/O is embedded within the language, making it necessary to modifyevery single I/O statement. Although a Perl script is used to manipulate the Fortransource files, many different types of I/O (formatted I/O, list-directed I/O, unformattedI/O [14, 36]) make it difficult to handle every case automatically, resulting in lots ofmanual work.

The following procedures were implemented in the language interface library as wrap-pers around the task API calls:

UDSTART, UDDONE perform necessary initialization and cleanup tasks.

UDOPEN, UDCLOSE provide functionality corresponding to the Fortran OPEN andCLOSE commands.

UDREAD, UDWRITE perform simple reading and writing to/from files, one line at atime.

We make use of internal files, which are essentially string buffers which can be readfrom and written to using the same Fortran commands and formatting statements as forfile I/O. Essentially, we replace every READ command in the original source code by acall to the UDREAD procedure, reading one line from the input file into a buffer. Thenthe READ command is invoked using the buffer as an internal file. Similary, WRITEcommands are replaced by a WRITE into the buffer followed by a call to the UDWRITEprocedure.

24

List-Directed I/O

List-directed I/O is a free-form method for I/O in Fortran. For list-directed input, datais read from a stream and parsed according to the variable list provided to the READcommand. The input data is separated by blanks. As much data as variables providedin the list is read. If the input data cannot be parsed (for example when an integervariable is specified, but non-numerical string data is read), an I/O error occurs, whichleads to termination of the application, unless the I/O error is trapped by error-handlingcode.

For output, the variables provided are written using standard formats depending on thevariable type and length.

The following example reads from the file unit IUNIT string into HDR and an integerinto ICNTRL (assume these variables have been defined beforehand). The originalcommand is as follows:

READ(IUNIT) HDR,ICNTRL

This translates to the following code (UDBUF is a 1024-byte string buffer, and UDERRis an integer which is set to an error code if the operation is not successful):

CALL UREAD(IUNIT, UDBUF, UDERR)READ(UDBUF, *) HDR, ICNTRL

Formatted I/O

Formatted I/O uses format specifications to determine how input data is parsed, or howoutput data is formatted. Mostly, formatted I/O statements can be simply replaced bythe UDREAD - READ and WRITE - UDWRITE sequences described above. In somecases, for example formats involving multiple lines or arrays of data, manual adapta-tion of the commands is required.

The following example writes a warning message to file unit OUTU. The format state-ment contains fixed strings and output specifications for the two integers IATOM andNSLCT.

The original code:

WRITE(OUTU,100) IATOM,NSLCT100 FORMAT(’ ** WARNING ** IATOM=’,I5,’ NSLCT=’,I5)

Is converted to:

WRITE(UDBUF,100) IATOM,NSLCTCALL UWRITE(OUTU, UDBUF, UDERR)

100 FORMAT(’ ** WARNING ** IATOM=’,I5,’ NSLCT=’,I5)

25

Unformatted I/O

Unformatted I/O essentially transfers data in its raw binary format, i.e. the data isstored just as in memory, with added information about the size of the stored data.Because Fortran does not support unformatted I/O on internal files, special proce-dures were implemented in the language interface library as a workaround. A sin-gle Fortran statement thus becomes a sequence of procedure calls involving initial-ization of a buffer, one call per variable read or written, and a call signaling theend of the READ/WRITE statement. The procedures implemented for reading areUDUFREADSTART, UDUFREADINT, UDUFREADSTR and UDUFREADEND, and cor-responding procedures were also implemented for writing.

The following statement writes the array Z(1..NATOM) to the unit IUNCRD:

WRITE(IUNCRD) (SNGL(Z(I)),I=1,NATOM)

Our implementation looks like this:

CALL UDUFWRITESTART(IUNCRD)DO I=1,NATOMCALL UDUFWRITEINT(IUNCRD, SNGL(Z(I)))

END DOCALL UDUFWRITEEND(IUNCRD)

Other Modifications

Restrictions of the task API require the wrapper library to handle temporarily generatedfiles differently than input files and the main output file. To be able to return multiplefiles, we needed to add a command to CHARMM to embed files within the main outputfile. The embedded file is preceeded by a header containing the file name and length.

4.3 Adaptation of the Simulation Algorithm

4.3.1 Introduction

The algorithm for the generation and distribution of work units that we are implement-ing takes into consideration many of the unique properties of the MP platform. Themain idea is that molecular dynamics trajectories that get closer to the native confor-mation are continued, while the others are stopped [33]. In each work unit, an MDsimulation of a certain number of steps is performed. A quality factor is then calcu-lated for each snapshot along each trajectory, and the best snapshot (the point with the

26

smallest quality factor) is returned at the end. The quality factor is to be minimized,so essentially what we perform is a “greedy” minimization of the quality factor [29].During run-time, new work units are generated using the best result obtained so far.

Figures 4.1 and 4.2 show examples of the path the quality factor takes along two differ-ent calculated trajectories. In the first case, the minimum occurs right at the beginningof the trajectory, in the second case the best value after around 28’000 steps. One prop-erty of this simulation is that it is not possible to predict if or when an improvement ofthe quality factor will occur. In practice, a balance will have to be found between howmany different trajectories will be explored, and how far, in terms of numbers of steps,we will calculate each trajectory before giving up. The number of steps calculated ineach work unit is a factor that will be examined in more detail in Section 5.4.3. An-other factor is the frequency of samples along the trajectory. For all tests conducted,we take one sample every 100 steps.

Figure 4.1: Quality factor along trajectory of one work unit (50’000 steps, sampleevery 100 steps) with minimum at the beginning.

4.3.2 Work Partitioning and Scheduling

Basically, what we do in one work unit is: starting with an input configuration witha certain quality factor, we run a MD simulation, calculating a trajectory of a certainlength. After this calculation phase, we analyze the trajectory, looking for the point

27

Figure 4.2: Quality factor along trajectory of one work unit (50’000 steps, sampleevery 100 steps) with minimum in second half of trajectory.

with the best (smallest) quality factor. We return this configration, together with thecalculated new quality factor, as the result. This result can then be used to generatenew work units.

So, a work unit can be regarded as having an input configuration and as output a newconfiguration with a quality factor. The goal is to minimize the quality factor. Toachieve this goal, we distribute many work units with identical starting configurationsto many nodes, having the work units only differ in a random seed used to calculatethe random initial velocities with which the MD simulation is run. We hope that oneof these work units will return a configuration better than the starting configuration,and we can then use this new configuration as a starting point for the further search ofeven better configurations.

Figure 4.3 schematically illustrates the algorithm with 3 machines of different speeds.Machine m1 is the fastest machine and m3 the slowest. At time t0, all machines beginwork on work units with the starting quality factor q=0.6. At time t1, m1 returns aresult with q=0.36. This result is better than q=0.6, so it is accepted as the new startingconfiguration. m1 now begins a new work unit starting with q=0.36. At time t2, m2returns a result with q=0.42. This is not better than q=0.36, so the result is discarded,and m2 starts a new work unit with q=0.36. At time t3, m3 returns a result withq=0.32, which is again accepted as the new optimum, so m3 starts it’s new work atq=0.32. Next, m1 returns a result at t4, with q=0.39. This result also gets discarded

28

Figure 4.3: Distributed minimization algorithm with 3 machines of different speeds

and m1 resumes work starting with q=0.32.

4.3.3 Implementation

We have implemented a controller process which generates work units and submitsthem to the MetaProcessor server via the management API. This same process peri-odically polls the server for results returned by the clients, analyzes these results andgenerates new work units based on the latest results.

Given the fact that there is no direct coupling between the controller process and theagent devices, the controller process needs to ensure that at any time, there are freshwork units on the server ready to be sent to agents returning results and requesting newwork. When agents return results, they immediately request a new work unit. In theversion of the MetaProcessor server used in this study, if the server does not have workunits ready which have not yet been sent out, it will send out older work units which arealready being processed by other machines, but for which no results have been receivedyet (this is due to the eager scheduling described in Section 2.4.2). This behaviouris desirable in an environment where agent devices are not guaranteed to completeworkunits they received, and each specific result is needed for the completion of theentire job. In our case, this is not necessary, since workunits which never return resultsdo not hinder the entire job from making progress. United Devices has confirmed thatthis behaviour should be configurable, and will include such functionality in a futurerelease.

In this first implementation, we take a simple approach by ensuring that the numberof “active” work units (work units which have been submitted to the server, but for

29

which no results have been retrieved yet) is always larger than the number of machinesparticipating in the computation. This number of “spare” work units should be atleast as large as the maximum expected number of agents to return a result betweensubsequent pollings by the controller process, to prevent the server from running outof work units. We use the term working set for the set of work units which have beensubmitted to the server, but not returned yet (i.e. work units in process on clients, andwork units waiting to be scheduled).

4.3.4 Fault Tolerance

Our algorithm is quite resilient to failures of single nodes. Because work units gen-erated with the same starting point differ only in the random number seed, the loss ofwork units has no noticeable influence on the entire job.

30

Chapter 5

Validation and Performance

5.1 Platform for Tests

The UD MetaProcessor server was installed on a machine at the CS Department atETH Zürich, which was used exclusively for this task during the duration of the tests:

� PC with 400 MHz PII, 9+18 GB SCSI disks, 384 MB RAM running RedHatLinux 6.2 (as recommended by UD) and IBM DB2 7.2. The server is runningUD MetaProcessor server release 2625.

For the tests described below, we were able to run the UD Agent on several machines,mostly at the CS Department at ETH Zürich, but also a few other machines at the Uni-versity of Zürich and some home machines connected via the Internet. Although theUD Agent is available for several platforms, we only built the CHARMM executablefor Linux on Intel x86 platforms, and were thus constrained to using PC’s runningLinux. Most machines were up and running 24 hours a day.

The following machines were available around the clock:

� 16 1 GHz Pentium III (Coppermine) machines (dedicated cluster at CS Depart-ment ETH Zürich)

� 14 400 MHz Pentium II (Deschutes) machines at CS Department ETH Zürich

� 4 500 MHz Pentium III (Katmai) machines at CS Department ETH Zürich

� 3 933MHz - 1.0 GHz Pentium III machines at CS Department ETH Zürich

� 2 1.4 GHz Athlon machines and 1 1.8 GHz Pentium 4 from Prof. Caflisch’sgroup at the Department of Biochemistry, University of Zürich

31

� 5 home machines connected via the Internet, ranging from 350 MHz to 1 GHz

The following machines are multi-boot machines (Windows NT, Oberon and Linux)in student labs at the CS Department of ETH Zürich. During the day, they are subjectto reboots and therefore not generally usable for our test. We used them, mostly duringnight time, for specific tests, which are described in more detail below:

� 12 600 MHz and 12 500 MHz Pentium III (Katmai) machines

� 24 933 MHz Pentium III (Coppermine) machines

Some of the machines used are actually dual-processor machines, however the UDAgent is geared toward single-processor machines and does not easily support theusage of more than one processor.

5.2 Tests Conducted and Validation of the Results

A long-running test using the machines available around the clock (43 machines) wasconducted over a period of 20 days. The following parameters were chosen for thistest:

� 100’000 simulation steps per work unit

� Working set of 50 work units

� Polling frequency: once overy 5 minutes

This test was subject to a number of unforeseen disruptions (network failures, failuresof machines, including the machine running the controller process, server’s disk full),which caused the test to stall occasionally.

Another long-running test with the same parameters was started to have at least onesimilar set of results to compare.

Additionally, a number of short-running tests of 12–24 hours were conducted on the48 machines in the student labs. For each of these tests, different parameters wereselected to investigate their influence on the performance and results. Two parameterswere varied, the number of steps per work unit and the working set size. For eachparameter, three values were chosen, resulting in the following nine tests:

10’000 steps, working set 70 20’000 steps, working set 70 50’000 steps, working set 7010’000 steps, working set 120 20’000 steps, working set 120 50’000 steps, working set 12010’000 steps, working set 200 20’000 steps, working set 200 50’000 steps, working set 200

Most of these tests were conducted twice, but due to time constraints not all tests couldbe repeated.

32

5.3 Results of “Real World” Long-Running Tests

The first test conducted was run over a period of 20 days on the set of machines avail-able around the clock. This test, although involving a rather small number of nodes,nevertheless exhibits many properties which would be found on a “real-world” widelydistributed grid platform:

Heterogeneity. The nodes used range in CPU speed from 350 MHz to 1.6 GHz, pro-viding a wide performance spectrum. Also, a widely varying range of networkconnectivity technologies were in use, from machines in the same LAN segmentas the server (100 Mbit/s switched Ethernet) to machines at home connected viaADSL (128 kbit/s).

Non-dedicated nodes. Most of the nodes used were used during the course of the testfor normal day-to-day tasks. Because the UD agent spawns the task moduleprocesses at low priority, only otherwise unused processor cycles are consumedby the task.

Unreliability and unpredictability. Machines can be rebooted, switched off, and theycan lose network connectivity at any time. As long as the server itself is notsubject to such failures, the simulation continues with the remaining availablenodes.

This first test was unfortunately also subject to some failures of the server. Neverthe-less the system recovered and the test was continued until around 10’000 work unitshad been calculated. The simulation reached a quality factor of 0.07837 after 3’400completed work units. After that, no more improvement was made.

After completion of the first test, a second test with the same parameters was startedand left to run for 11 days, after which approximately 6’000 work units had beencompleted. This test reached a quality factor of 0.07251 after 2’200 work units, makingno improvements thereafter.

5.4 Analysis of Test Results

Due to the nature of the MP platform, it is difficult to reliably measure hard perfor-mance data. The performance of specific devices depends on many factors not underour control (mainly utilization of resources by normal usage, since the machines arenot dedicated to our computation), and the number of devices participating at any givenmoment can vary due to reboots, loss of network connectivity to the MP server etc.

It is nevertheless possible to measure some numbers to estimate the kind of speed-upwhich could be expected as the number of computing nodes is increased.

33

Figure 5.1: Progression of quality factor for the two long-running tests

34

5.4.1 Communication

The amount of data communicated per work unit is around 15 kilobytes from theserver to the client, and in our current implementation, the amount of data returnedby the client is dependent on whether the work unit actually generates an improvementof the quality factor. In the case of no improvement, the amount of data returned isaround 170 kilobytes for a 100’000-step work unit this data contains the best config-uration achieved, as well as the values of the quality factor along the entire trajectorycalculated. If an improvement is made, additionally the entire trajectory is returned,amounting to around 7 megabytes for the 100’000-step work unit. The amount of datareturned could be drastically reduced in the case of no improvement of the qualityfactor, since these data are never actually needed for further calculations.

With large numbers of clients, the server’s network bandwidth can become a bottle-neck. Network latency, however is much less likely to be a problem. Even on theInternet, where propagation delays are high due to large physical distances, and la-tency increases with each “hop” due to queuing delays in routers, the total latency isvery unlikely to exceed 1 second, remaining negligble in comparison to the computa-tion time [26].

Assuming that results always contain the entire trajectory, we can estimate the theoret-ical maximum number of results that can be returned per hour, which is limited by theserver’s bandwidth. Assuming that the server’s bandwidth is the limiting factor, andthe size of results is constant, the maximum number of results per hour will be:

�� ! �"�#�#$ ��%�&��'��(*),+So, for a server with 100 Mbit/second LAN connection, and work units of 50’000 stepslength (generating 3’369 kilobytes of data per result, as can be seen in Figure 5.2), themaximum number of work units per hour is:

� �-�� . #�#�#�#�#��! �"�#�# � �"�/��10 2 354648769As we can see in Figure 5.3, 1 GHz machines can produce around 3 such results perhour. It follows that for a homogeneous situation of 1 GHz machines, the networkconnection of the server can handle around :<;=;=>=?=@A;CBED=D�>=F clients.

5.4.2 Slow Machines vs. Fast Machines

An important question in heterogenous environments like the one we are studying, iswhether slow machines are able to provide enough contributions to the overall com-putation to justify their participation. In our case, work units are generated based onthe results of previous work units. It follows that the time at which a result is returnedis a factor which can not be ignored. It is possible that, during the processing time

35

Figure 5.2: Amount of data returned for results including a trajectory

Figure 5.3: Calculation time per work unit for different number of steps, measured onmachines with different CPU speeds.

36

of a work unit on one machine, the state of the entire system has progressed so thatthe result of the work unit of the considered machine is obsolete. A very simplisticdefinition of obsolete work could be the following:

A work unit’s result is considered obsolete if, at the point in time that the result is re-turned, a configuration better than the starting configuration of the work unit in ques-tion has been accepted by the system, and the optimal configuration of the work unitin question is not better than this newly accepted configuration.

For example (assuming times t1 < t2 < t3 etc.) device #1 starts computing a workunit at time t1, with a starting configuration with q=1.0. Device #2 starts a work unitwith the same starting configuration at time t2. At time t3, device #1 returns a resultwith q=0.9. This configuration is accepted as the new starting configuration for futurework units. Now, when device #2 returns its result at time t4, this result is consideredobsolete, unless it has a quality factor better than q=0.9.

For this definition of obsolete, every machine will return obsolete results, becausefor each accepted result, almost all other work units remaining in the working set,i.e. work units in process and in the server’s queue, will become obsolete. In theremaining working set, only those work units which themselves get accepted as newstarting configurations will not be considered obsolete. What we wish to find outis how strongly the percentage of obsolete work generated by a device is dependenton the device’s speed. If slow machines return a disproportionately large amount ofobsolete results, their contribution to the entire calculation will be less valuable.

The results of the tests show that results returned by all machines, fast and slow, wereaccepted as good configurations. Figures 5.4 and 5.5 show the CPU speed for each re-sult which was accepted in the long-running tests. Most results are from machines withCPU speeds between 910 MHz and 1010 MHz. This is due to the fact that many suchmachines were used, and thus the total number of work units completed by these ma-chines was large. Nevertheless slower machines make contributions during all phasesof the simulation. Note that in test 2, no machines faster than 1010 MHz participatedin the calculation.

Figure 5.6 shows the proportion of results returned and results accepted by each groupof nodes with the same speed. The data was collected from the short-running tests onthe student lab cluster with machines with 500 MHz, 600 MHz and 933 MHz. Eachchart shows the results for one test.

The charts show that the proportion of accepted work units corresponds roughly to thetotal amount of results returned. However, in many cases the fastest machines seem toproduce a slightly larger proportion of accepted results than the rest of the nodes.

In Figure 5.7 the results of all nine tests were summed up, and the percentage of ac-cepted work units for each CPU speed was calculated. The rate of accepted work unitsis significantly higher for the fast machines. This may have negative consequences ifthe speed differences between the slowest and fastest machines are too large.

37

Figure 5.4: CPU speed of nodes returning accepted results, for first long-running test

Figure 5.5: CPU speed of nodes returning accepted results, for second long-runningtest

38

Figure 5.6: Number of returned results vs. number of accepted results for differentCPU speeds. Each chart shows data from one of the short-running tests, each with adifferent combination of work set size and number of steps per work unit.

39

Figure 5.7: Overall percentage of accepted results (accepted work units divided bytotal number of returned work units for each CPU speed) for different CPU speeds, forall short-running tests.

40

To investigate whether any difference can be observed between the beginning of a testand later phases, in Figure 5.8 the accepted results of all nine tests are grouped by thepoint during the simulation at which they were generated. However, no clear patternis apparent.

Figure 5.8: Accepted work units in different phases of the tests. Each bar shows thenumber of accepted work units by CPU speed, scaled to 100%, for a range of workunits. Results of the nine short-running tests shown in Figure 5.6 are added up.

5.4.3 Effect of Work Unit Length

An important parameter of our simulations is the number of molecular dynamics stepscalculated in each work unit. This is one of the variables chosen for the short-runningtests on the student lab machines.

The number of steps per work unit of course has a direct effect on computation timeand amount of communication, as discussed in Section 5.4.1. However, we would alsolike to investigate what the effect is not for a single work unit, but for the course of theentire simulation.

Figure 5.9 shows the progression of the quality factor for all chosen test parameters.Each chart shows the results for the three work unit lengths chosen, leaving the work-

41

ing set size constant. Where more than one test was conducted, the test with the betterresult was chosen.

We can see that in terms of work units completed before reaching a certain qualityfactor, the tests with 50’000 steps reach good quality factors earlier. The other testsalso reached similar quality factors after larger amounts of processed work units, butin two of the three cases, the test with 50’000 steps still obtained the best quality factorat the end.

Since longer work units take longer to calculate, it is not possible to state that choosinga large number of steps will lead to good results in a shorter amount of time. In Figure5.10, the paths were stretched according to the time required to process the work units,allowing an approximation of the path the quality factor takes in real time.

It immediately becomes less obvious which value is best. The tests with 10’000 stepsreach small values slightly more quickly, but the ultimate quality factor achieved isworse than the other tests except for the case with working set size of 120.

Because longer work units take more time on the nodes, choosing a high number ofsteps will allow a server to handle more nodes before its communication interfacebecomes a bottleneck. So in a setting with a larger number of participating nodes,more steps per work unit (up to a reasonable limit) will probably be preferred.

5.4.4 The Effect of the Working Set Size

Initially, the parameters for the conducted tests were chosen such as to try and keepthe amount of “obsolete” work generated (as defined in Section 5.4.2) as small aspossible. The first short-running tests were conducted with a working set size of 70.Interestingly, in the test with 50’000 steps per work unit the amount of “obsolete”work was very large, but nevertheless, the simulation progressed to a good qualityfactor rather quickly.

Furthermore, all of these short-running tests progressed to a quality factor significantlybetter than the one achieved by the long-running test.

We concluded that this may be due to the larger working set size chosen. By keepingthe working set large, we are “delaying the greediness” of the algorithm, letting thesystem explore more trajectories from each starting configuration. A known problemof “greedy” algorithms is the chance of running into local minima [29], where no moreimprovements can be made during the length of a single subsequent work unit. Thelarger working set may allow the simulation to find better improvements on a givenstarting configuration, especially in phases of the simultation where the acceptancerate is high.

Initially the working set size was chosen to prevent the MP server from running out ofnew work units between updates by the controller process. Due to the observed results,we decided to conduct further tests with larger working set sizes.

42

Figure 5.9: Effect of number of steps per work unit for various working set sizes. Eachpoint is a work unit which returned a result that was accepted due to improvementof the quality factor (q). In all cases, the tests with 50’000 steps/WU reached goodquality factors after fewer work units than the tests with less steps/WU. Results of thenine short-running tests in Figure 5.6 are shown.

43

Figure 5.10: Progression of quality factor in time. Results of the nine short-runningtests in Figure 5.6 are shown.

44

Figure 5.11: Effect of working set size. Results of the nine short-running tests inFigure 5.6 are shown.

45

Figure 5.11 shows the results obtained. Again, where more than one test with the sameparameters was conducted, the test obtaining the best quality factor was chosen.

Figure 5.12: Quality factor reached using different working set sizes and number ofsteps per work unit, for all short-running tests.

For all work unit lengths, the test with working set size 120 reached the best qualityfactor. This implies that there may be an optimal working set size which leads to thebest results. Figure 5.12 shows the quality factors obtained for different working setsizes. All short-running tests conducted were taken into account.

5.4.5 Scalability

In the tests conducted during this study, the number of computing nodes is rather small.Ultimately, we would like to find out if the algorithm is able to scale to an Internet-wideenvironment involving potentially thousands or even millions of nodes. The communi-cation bottlenecks likely to arise in such a situation will probably be solvable, on onehand by reducing the returned result size to an absolute minimum, and on the otherhand by extending the single server to a multiple server, load-balanced architecture.

However, taking into account the effects of the working set size observed during thetests conducted, it is likely that a single distributed simulation will not be able to makeeffective use of the large number of available processors. A possible solution would beto add a layer of hierarchy to the system, essentially running multiple simulations, eachutilizing a subset of all available nodes. More work will have to go into investigatinghow the algorithm can be extended to allow such scalability.

46

5.5 Cluster vs. MetaProcessor Platform

Dr. Giovanni Settanni has been running very similar simulations involving the samebiological system on a cluster of 32 dual-processor Athlon MP 1800+1 machines atthe Department of Biochemistry of the University of Zürich. We would like to com-pare this setting to the MetaProcessor platform, highlighting some key differences. Apreliminary rough comparison of the results obtained is also provided [33].

5.5.1 Differences in Architecture

The cluster used by Settanni is a high performance, dedicated homogeneous cluster.All CPUs are running at the same speed and can thus be expected to calculate workunits of the same length in the same amount of time. The high performance networkingalso allows the nodes to perform more tightly-coupled work.

5.5.2 Differences in the Algorithm

One key difference is that in the cluster setting, there is no concept of a work unit queuewith an arbitrary working set size. This is not needed because the number of nodes is aknown factor and can be assumed to be constant. There is thus no entity correspondingto the MP server in our setting. Instead, the master node interacts directly with the slavenodes via MPI.

The number of steps calculated in a single work unit can vary between 10’000 and100’000. This is because the clients themselves can perform additional checks on theresults in a shell script outside of CHARMM.

The other parameters used for the MD simulations are the same as those used in thisstudy.

5.5.3 Comparison of results

Most of the simulations performed by Settanni produced similar values of the qualityfactor than the tests conducted in this study. One single run however reached a verysmall value ( � B >�� ? D�?=FA;�� :�� ). The fraction of accepted working units varies from0.5% to 3%. The results were reached after approximately 48–72 hours of computationon 12–16 processors, with approximately 2’000 to 10’000 completed work units.

Due to the rather small number of tests made using the MP platform, it may wellbe possible that, after conducting more simulations, perhaps leaving them to run for

1these processors are actually running at 1.53 GHz, not at 1.8 GHz as the product name would imply.

47

a longer duration, results similar to the best result achieved on the cluster could beobtained.

That the other results reach similar quality factors is a sign that, basically, the migratedalgorithm is working well on the distributed platform. The working set on our platformmay allow the simulations to select more optimal paths in the early phases of the tests,leading to good quality factors after fewer completed work units.

5.6 Problems During the Tests and Validation

5.6.1 Slow Communication Between Server and Controller Pro-cess

In the initial tests, communication between the server and the controller seemed to bevery slow at times, taking up to a minute to retrieve a single result. Two measureswere taken to reduce this problem. Firstly, the CHARMM input files were modified toreduce the amount of data returned, only returning the trajectory if the quality factorhas actually improved during the run. This drastically reduced the amount of com-munication needed between the server and the controller, but of course also betweenclients and the server.

Furthermore, the communication between controller process and server was encryptedwith SSL, introducing an overhead due to the SSL handshake and the computationneeded for encryption of the data [2]. Also, for each XML-RPC call, a new connectionto the server is initiated. After turning the encryption off, the performance of thecontroller process increased noticeably.

5.6.2 Monitoring Agent Machines

The MP platform provides no way to easily check whether a specific agent machine isstill working. Because of this, failures of machines can go unnoticed for a long time.In an Internet setting, this feature is not that useful, since the administrators of the MPplatform have no control over the participating machines. However, in a situation likethe one in this study, it should be possible to show agent machines which have notreturned any results in a certain time period, allowing the administrator to investigatepossible problems and take appropriate action.

48

5.7 Optimizations

5.7.1 Reducing Communication

The amount of data contained in a trajectory is relatively large (over 6 megabytes fora run of 100’000 steps with snapshots every 100 steps), and ultimately, only a verysmall fraction of the calculated trajectories are used. By allowing the nodes to decidewhether to return the trajectory or not, the amount of communication needed can begreatly reduced. In our tests, we have already limited the amount of cases where thetrajectory is returned to those cases where the quality factor obtained is better than thequality factor of the “parent” configuration. However, at the beginning of a simulation,all results will fulfill this criteria, although only a few of these results will get acceptedas new starting configurations. In other words, the optimal configuration will haveimproved during the calculation of the work unit on a node. If the client were to beable to obtain the current optimal configuration at the time the work unit has completedcalculation, it would be able to decide not to return the trajectory unless the obtainedquality factor is even lower.

The MP platform currently restricts communication to the returning of results andrequests for new work units, so such a query is not directly supported by the platform.United Devices is considering extending the communication model for future versionsof their product, so this optimization may be possible in future. The implementationof a workaround is already possible, for instance by opening a network connection viathe available operating system calls on the node, bypassing the task API library. But itwould be preferable that such functionality would be provided by the MP platform.

5.7.2 Alternatives to Greedy Optimization

The results observed suggest that the greedy method of optimization can lead to thesimulation getting trapped in a “local minimum”, where no more improvements of thequality factor can be made, or only after a very large number of calculated work units.

By increasing the working set size, we are already moving away from a strict greedyalgorithm, and the results do indeed show better quality factors for larger workingset sizes. However, the improvements made by this simple approach are likely to belimited as well.

Investigating other optimization strategies may allow even better minimization of thequality factor.

49

Chapter 6

Conclusion

6.1 Summary

The CHARMM code was successfully ported to the MetaProcessor platform, allowingfirst tests using algorithms for real biochemical problems to be conducted on a limiteddistributed computing setting involving up to fifty nodes.

The tests showed promising results. Depending on the parameters chosen, the simu-lations progressed to good results during tests on up to fifty PC’s in a relatively shorttime (less than 24 hours). This compares favourably to most of the results observed ona dedicated cluster.

The system was able to tolerate failures of single nodes, and even the failure of theMetaProcessor server was handled gracefully, allowing the test to continue after re-covery of the server.

The significance of two factors was investigated in more detail, namely the number ofmolecular dynamics steps calculated in each work unit, and the working set size. Bothfactors do indeed influence the behaviour of the distributed simulation.

The effect of the working set size on the progression of the simulation seems to suggestthat the algorithm as implemented will not scale without modifications to very largenumbers of nodes. Communication bottlenecks are likely to be avoided by reducingthe communication volume to an absolute minimum. However, the algorithm itselfwill probably need to be extended, for instance by introducing an addtional hierarchylayer, to effectively exploit very large numbers of nodes.

6.2 Further Work

Due to the small number of tests performed during this study, more tests are necessaryto verify the results obtained and the conclusions drawn. In particular the short running

51

tests could be continued for a longer period of time to see whether any significantimprovements in the quality factor can still be made.

The results of the tests conducted on a limited number of nodes are encouraging andprompt further studies into how the algorithm can be extended to scale to very largenumbers of nodes. Changes in the distribution strategy as well as other approaches tominimization are important points to be explored.

52

References

[1] G. Aloisio, M. Cafaro, C. Kesselman, and R. Williams. Web access to supercomputing.Computing in Science & Engineering, 2001.

[2] G. Apostolopoulos, V. Peris, P. Pradhan, and D. Saha. Securing electronic commerce:Reducing the SSL overhead, 2000.

[3] G. Bell and J. Gray. High performance computing: Crays, clusters, and centers. whatnext? Technical report, Microsoft Research, August 2001.

[4] Beowulf project homepage. http://www.beowulf.org/.

[5] Protein data bank, 2002. http://www.rcsb.org/.

[6] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. Cilk: Anefficient multithreaded runtime system. In Proc. 5th ACM Symposium on Principles andPractice of Parallel Prog. (PPoPP), pages 207–216, Santa Barbara, July 1995. ACM.

[7] R. Blumofe and D. Park. Scheduling large-scale parallel computations on networks ofworkstations. In Proceedings of the Third International Symposium on High-PerformanceDistributed Computing, pages 96–105, San Francisco, California, August 1994.

[8] W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless dis-tributed file system deployed on an existing set of desktop PCs. In Proceedings of theInternational Conference on Measurement and Modelling of Computer Systems, 2000.

[9] B. R. Brooks, R. E. Bruccolery, B. D. Olafson, D. J. States, S. Swaminathan, andM. Karplus. CHARMM: A program for macromolecular energy, minimization, and dy-namics calculation. Journal of Computational Chemistry, Vol. 4(No. 2):187–217(1983),Oct. 1982.

[10] Intel-United Devices cancer research project homepage.http://members.ud.com/projects/cancer/.

[11] C. Clementi, H. Nymeyer, and J. N. Onuchic. Topological and energetic factors: Whatdetermines the structural details of the transition state ensemble and “en-route” interme-diates for protein folding? an investigation for small globular proteins. J. Mol. Biol.,298:937, 2000.

53

[12] V. Daggett, A. Li, L. S. Itzhaki, D. E. Otzen, and A. R. Fersht. Structure of the transitionstate for folding of a protein derived from experiment and simulation. J. Mol. Biol.,257:430–440, 1996.

[13] distributed.net project page. http://distributed.net/.

[14] T. M. R. Ellis. A Structured Approach To Fortran 77 Programming. Addison-WesleyPublishers Limited, 1982.

[15] S. Feldman, D. Gay, M. Maimone, and N. Schryer. A Fortran-to-C Converter. TechnicalReport 149, AT&T Bell Laboratories, March 1995.

[16] S. I. Feldman and P. J. Weinberger. A Portable Fortran 77 Compiler. In UNIX TimeSharing System Programmer’s Manual, volume 2. AT&T Bell Laboratories, 1990.

[17] P. Ferrara, J. Apostolakis, and A. Caflisch. Evaluation of fast implicit solvent model formolecular dynamics simulations. Proteins, 2002. In Press.

[18] A. R. Fersht. Enzyme Structure, Mechanism, and Protein Folding. Freeman ed., NewYork, 1998.

[19] Folding@home project page. http://folding.stanford.edu/.

[20] I. Foster and C. Kesselman, editors. The Grid: Blueprint for a New Computing Infras-tructure. Morgan Kaufmann Publishers, Inc., 1998.

[21] GNU Fortran Compiler ’g77’ Manual.

[22] N. Go. Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng., 12:183, 1983.

[23] J. Gsponer and A. Caflisch. Role of native topology investigated by multiple unfoldingsimulations of four SH3 domains. J. Mol. Biol., 309:285–298, 2001.

[24] M. Karaul. Metacomputing and Resource Allocation on the World Wide Web. PhD thesis,New York University, May 1998.

[25] J. C. Martinez and L. Serrano. The folding transition state between SH3 domains is con-formationally restricted and evolutionarily conserved. Nat. Struct. Biol., 6:1010, 1999.

[26] D. A. Menasce and V. Almeida. Capacity Planning for Web Services. Prentice Hall PTR,2002.

[27] A. Natrajan, M. Crowley, N. Wilkins-Diehr, M. Humphrey, A. Fox, A. Grimshaw, andC. Brooks. Studying protein folding on the grid: Experiences using CHARMM on NPACIresources under Legion, 2001.

[28] E. Perathoner. Performance driven migration an optimization of a common moleculardynamics code (CHARMM) on different cluster platforms. Diploma Thesis, 2001.

[29] L. S. Pisoulis and M. G. C. Resende. Greedy randomized adaptive search procedures.Technical report, AT&T Labs, 2001.

54

[30] D. S. Riddle, V. P. Grantcharova, J. V. Santiago, E. Alm, I. Ruczinski, and D. Baker.Experiment and theory highlight role of native state topology in SH3 folding. Nat. Struct.Biol., 6:1016, 1999.

[31] Kyung Dong Ryu. Exploiting Idle Cycles in Networks of Workstations. PhD thesis,University of Maryland, 2001.

[32] Seti@home: Search for extraterrestrial intelligence at home.http://setiathome.ssl.berkeley.edu/.

[33] G. Settanni. Personal communication, 2001–2002.

[34] G. Settanni, A. Cattaneo, and A. Maritan. Role of native-state topology in the stabilizationof intracellular antibodies. Biophysical Journal, 81:2935–2945, 2001.

[35] T. Sterling, D. Becker, J. Dorband, D. Savarese, U. Ranawake, and C. Packer. Beowulf: Aparallel workstation for scientific computation. In Proceedings, International Conferenceon Parallel Processing, 1995.

[36] Sun Microsystems, Inc. Fortran 77 Language Reference, 1996.

[37] M. Taufer, E. Perathoner, T. Stricker, and A. Caflisch. Performance characterization of amolecular dynamics code on PC clusters - is there any easy parallelism in CHARMM?,2001.

[38] M. Taufer, T. Stricker, G. Roos, and P. Güntert. On the Migration of the Scientific CodeDyana from SMPs to Clusters of PCs and on to the Grid. 2002.

[39] United Devices, Inc. Edge Distributed Computing with the MetaProcessor Platform,2001. http://www.ud.com/products/documentation/.

[40] United Devices, Inc. MetaProcessor Platform, Version 2.1 Application Developer’sGuide, 2001.

[41] Dave Winer. XML-RPC Specification, 1999. http://www.xml-rpc.com/spec.

[42] B. Zagrovic, E. Sorin, and V. Pande. Beta hairpin folding simulations in atomistic detailusing an implicit solvent model. Journal of Molecular Biology, 2001.

55

migration of the molecular dynamics application charmm to ... · distributed computing platform of...

Documents