building and managing production bioclusters chris dagdigian biosilico vol2, no. 5 september 2004...

Building and managing production bioclusters

Chris DagdigianBIOSILICO Vol2, No. 5

September 2004

Ankur Dhanik

Computer Cluster• Computer cluster consists of connected

computers/servers/resources acting like a single system.• System capable of performing tasks previously

delegated to machines costing hundreds of thousands to millions of dollar.

• More efficient resource utilization.• General benefit of a flexible research-computing

infrastructure that can be tuned and adapted to meet changing research and user demands.

• Areas of scientific inquiry previously discarded as impossible are now feasible.

Biology and cluster configuration

• Cluster configuration influenced by intended application mix.

• In bioinformatics, far more commonly seen are serial computing problems, also referred to as “embarrassingly parallel”.

• These type of problems can be broken down into a series of independent steps each of which can be completed in any order without affecting result.

• For example, large scale bioinformatics sequence analysis, experimentation.

Biology and cluster configuration• Sequence analysis

– compare one sequence against a database of many sequences.– vastly increase performance by simply dividing the query

sequences up and running multiple searches at the same time on separate machines.

• Experimentation– run slight variations of same program thousands or millions of

times in a row.– this can be dealt with via loosely coupled compute clusters, also

known as compute farms.• Do not need High performance clusters, like Beowulf

– good for tightly coupled problems.• The large number of embarrassingly parallel problems is

primary driver for widespread adoption of clusters.

A typical biocluster

1 2 N

Software based distributed resource management (DRM)

Ethernet network

Small inexpensive servers

Users

A typical biocluster

Portal architecturePublic local area network

Private cluster network

Cluster compute elements

File serverPortal machine aka

‘master’, ‘head’ or ‘login’ node

Design considerations• Reliability, Availability, and Security

– compute nodes should be anonymous and interchangeable to support non-disruptive troubleshooting, maintenance and upgrade activities.

– critical failure points such as fileservers and portal machines need to be duplicated or made resilient to failure.

• Flexibility and Scalability– multiple competing users, workflows, projects should be

supported simultaneously.• Manageability

– administrative overhead should be minimized, this requires methodologies for automating or reducing administration tasks.

– software DRM layer needs to ensure that business and scientific priorities can dynamically alter allocation of computing resources.

Pre purchase decisions• DRM

– simplifies interaction with cluster from both user level and administration level.

– important decision.– most commonly seen DRM software suites for life

sciences run Sun Grid Engine (SGE) or Platform (LSF).

– when it comes to flexible yet sophisticated resource sharing and job scheduling needs, especially among many different groups or projects, LSF still has edge with respect to functionality and ease of configuration.

– installing a sophisticated Grid Engine configuration can be an adventure.

– experience suggests that LSF requires least amount of resources to install, configure and maintain over time.

Choosing hardware• Science and scientific application demands

should drive hardware configuration.• Absence of specific application benchmarks.• Dual processor Intel Xeon based servers for

compute node configuration.• Networking technology

– Switched Gigabit Ethernet is affordable and should be the default interconnect for cluster systems.

– alternative cluster interconnects such as InfiniBand and Myrinet offer higher performance, but large cost and lack of existing life science application codes capable of benefiting from such technologies.

Choosing hardware

• Storage– speed of cluster storage is usually a performance

bottleneck.– network storage: computing-storage devices that can

be accessed over a computer network, rather than directly being connected to the computer, e.g. NAS (network attached storage), SAN (storage area network) or hybrid architectures.

– use of large internal disk drives within each compute node to cache data needed for data intensive cluster jobs.

Deploying, monitoring and management

• Maintenance methodology– if a cluster node enters a faulted state, the node is

wiped and reinstalled via the network.– the power to the host is cycled.– if the node fails to successfully rejoin the cluster, it is

disabled and considered failed.– it is replaced later at the convenience of operator.

• Prepackaged methods for handling remote unattended operating system installations and rebuilds, e.g. SystemImager for linux clusters, NetBoot for Apple hardware and Mac OS X.

Conclusions

• Building high quality clusters for use in computational biology is a non-trivial task.

• It is important to– understand user and application requirements.– actively participate in DRM selection process.– avoid fixation on raw price/ performance figures that

might not reflect the true costs of deploying, managing and supporting distributed systems.

– beware of total solutions.

Questions & discussion

• How difficult or easy it is to detect failure modes – hardware, code, process?

• How difficult it is to have cluster with mixed architecture nodes?

building and managing production bioclusters chris dagdigian biosilico vol2, no. 5 september 2004...

Documents