hpc in the human genome project james cuff [email protected]

19
HPC in the Human Genome Project James Cuff [email protected]

Post on 18-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

HPC in theHuman Genome Project

James Cuff

[email protected]

Page 2: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

• The Sanger Centre is a research centre funded primarily by the Wellcome Trust

• Located in 55 acres of parkland

Also on site are the • European Bioinformatics Institute (EBI)• Human Genome Mapping Project Resource Centre

(HGMP-RC)

Page 3: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

The Sanger Centre

• Founded in 1993; >570 staff members now.

• Our purpose is to further the knowledge of the biology of organisms, particularly through large scale sequencing and analysis of their genomes.

• Our lead project is to sequence a third of the human genome as part of the international Human Genome Project.

Page 4: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Sanger Centre research programmes

• Pathogen sequencing programme

• Informatics

– support data collection

– analyse and present results

– develop methodology: algorithms and data resources

• Cancer genome project

• Human genetic programme - study genetic variation (SNPs) and find disease genes

Page 5: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

The era of genome sequencing

Size No. of Genes Completion (Mbases) date

H. influenzae 2 1,700 1/1kb Bacterium 1995

Yeast 13 6,000 1/2kb Eukaryotic cell 1996

Nematode 100 18,000 1/6kb Animal 1998

Human 3000 ?40,000 1/60kb Mammal 2000/3

Sequence data production increase of >2000%

Page 6: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

The Sequencing Facility

Page 7: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Sanger I.T.

Sanger network:- more than 1600 devices

300 PC’s – various

150+ X-terms/Network Computers (NCD)

250 NT/Mac ABI Collection devices

Various other servers, Linux desktop systems, printers, etc.

Paracel, Compugen and Timelogic systems

>350 Compaq Alpha systems (DS10, DS20, ES40,8400)

+440 Node Sequence Annotation Farm (PC/DS10/DS10L)

>750 Alpha processors in total

Page 8: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Systems architecture hierarchyCompute

Server Farm

RaidStorage

RaidStorage

ASX-0BX ASX-0BX

ASX-0BX

A T MFront-endCompute Servers

Front-endCompute Servers

Desk topworkgroupsystems

Desk topworkgroupsystems

LSF

LSF - Load Sharing Facility by Platform Computing Ltd

F/C F/C

Page 9: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Computer Systems Architecture

Fibre Channel/Memory Channel Tru64 Clusters Implementing tightly coupled clustering with Tru64 V5.x

We get:

Improved disk I/O (fibre channel), scaleability (multi-cpu, multi-terabyte)

Improved manageability

- single system image, whole clusters are managed as single entities)

Page 10: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

ES40 Clusters, F.C Storage

Page 11: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

• 32 x CableTron 100Mbs switches, 16 x RS232 Terminal servers, 2 x 155Mb ATM fibre uplinks back to v5.0 cluster

• Two network subnets (multicast and backbone)

• 640x100Mb Fast Ethernet ports

• 1,920 UTP cable crimps, 8 cabinets ~ 100kW of power

• 8 Racks each with 40 x Tru64 v5.0 Alpha DS10L. 320x466Mhz Alpha EV6.7, 1U High Total of 320GB mem, spinning 19.2TB internal storage

• ca. Equivalent to 10 x GS320, perf around 355 Gflops

Annotation Farms

Page 12: HPC in the Human Genome Project James Cuff james@sanger.ac.uk
Page 13: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

• Highly available NFS (Tru64 CAA)• Fast I/O (ATM > switched full duplex ethernet)• Socket data transfer

(via rdist, rcp, and MySQL DBI sockets)• Segmented network architecture via two elans

8 nodeES40

M/C F/Ccluster

ATM172.27

172.25

uplinks172.27

Sanger

172.25

Farm

Network Overview

Page 14: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Compute – systems architecture

Webserver

400 nodeS.A. farm

Pathogen

Sequence dataprocessing

Informatics

Mapping

SNP

Ensembl

Blastserver

Alta Vista

FTP

Large scaleassembly &sequencing

Firewall DMZExternal Services

ATM

Traceserver

Page 15: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Enterprise Clustering

• LSF is still key for job scheduling and batch operations

• LSF offers greater granularity of operation and functionality than Tru64 scheduling

• Schedule individual nodes, cluster-wide and cross-cluster scheduling

• With LSF we still have the capability to use many of the 750+ compute nodes as a single Sanger Compute Engine

MODULAR SUPERCOMPUTING

Page 16: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Projects

• Will involve thousands of CPU’s

- Large numbers of PC farm nodes - High-end, large memory SMP configurations

• All are computationally expensive

• Will require > 100 Terabytes of storage

• We need to continue scaling up and deal with the physical limitations

Page 17: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Immediate Future

The Sanger CentreGenome Campus

ATM

Storage AreaNetwork (SAN)

The EBIGenome Campus

LSFClustering

Institute to Institute Clustering Closer collaborations between Sanger, EBI and other organisations brings the need for site wide shared clusters.

Implement Storage Area Network Install multi-TB to enable disk mirroring, controller/controller snapshots

Page 18: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Longer Term Future

• Wide Area Clusters Needed for large scale collaborations.

• GRID Technology - Global Distributed Computing International Cluster collaborations with other scientific institutes

GLOBAL COMPUTE ENGINES

• Sanger is keen to keep abreast of this emerging technology

Page 19: HPC in the Human Genome Project James Cuff james@sanger.ac.uk

Questions ?