r. scott studham, associate director advanced computing april 13, 2004 hpc at pnnl march 2004

16
R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL HPC At PNNL March 2004 March 2004

Upload: noah-mckenzie

Post on 13-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

R. Scott Studham,Associate DirectorAdvanced Computing

April 13, 2004

HPC At PNNLHPC At PNNLMarch 2004March 2004

HPC At PNNLHPC At PNNLMarch 2004March 2004

Page 2: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

2

HPC Systems at PNNLHPC Systems at PNNLHPC Systems at PNNLHPC Systems at PNNL

Molecular Science Computing Facility 11.8TF Linux based supercomputer using Intel

Itanium2 processors and Elan4 interconnect A balance for our users: 500TB Disk, 6.8 TB memory

PNNL Advanced Computing Center 128 Processor SGI Altix NNSA-ASC “Spray Cool” Cluster

Page 3: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

3

William R. Wiley Environmental Molecular Sciences Laboratory

Who are we? A 200,000 square-foot U.S. Department of

Energy national scientific user facility

Operated by Pacific Northwest National Laboratory in Richland, Washington

What we provide for you Free access to over 100 state-of-the-art

research instruments

A peer-review proposal process

Expert staff to assist or collaborate

Why use EMSL? EMSL provides - under one roof - staff and

instruments for fundamental research on physical, chemical, and biological processes.

Page 4: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

4

HPCS2 ConfigurationHPCS2 ConfigurationHPCS2 ConfigurationHPCS2 Configuration

Elan4

4 Login nodes with 4Gb-Enet

2 System Mgt nodes

1,976 next generation Itanium® processors

11.8TF6.8TB Memory

…...928

compute nodes

2Gb SAN / 53TB

Lustre

Elan3

The 11.8TF system is in full operations now.

Page 5: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

5

Academia

Private Industry

PNNL (not EMSL)

EMSL

Other DOE LabsOther

Other Gov. Agencies

Who uses the MSCF, and what do they run?Who uses the MSCF, and what do they run?Who uses the MSCF, and what do they run?Who uses the MSCF, and what do they run?

NWChem - MD

NWChem - PW

VASP

ADF

Jaguar

Own Code

Other

Guassian

Climate CodeNWChem - Ab Initio

FY02 numbersGrand

Challenge

Pilot ProjectSupport

Gaussian

Page 6: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

6

More than 67% of the usage is for large jobs.

Demand for access to this resource is high.

Fewer users focused on Longer, Larger runs and Big Science.

0%

5%

10%

15%

20%

25%

30%

35%

40%

% N

ode-

Hou

rs U

sed

<3% 3-6% 6-12% 12-15% 25-50% >50%

Percent of system used by a single job

FY98 FY99 FY00 FY01 FY02

MSCF is focused on grand challengesMSCF is focused on grand challengesMSCF is focused on grand challengesMSCF is focused on grand challenges

Page 7: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

7

The world-class science is enabled by having systems The world-class science is enabled by having systems that enable the fastest that enable the fastest time-to-solutiontime-to-solution for our science for our scienceThe world-class science is enabled by having systems The world-class science is enabled by having systems that enable the fastest that enable the fastest time-to-solutiontime-to-solution for our science for our science

Significant improvement (25-45% for moderate number of processors) in time to solution by upgrading the interconnect to Elan4.

Improved efficiency Improved scalability

HPCS2 is a science driven computer architecture that has the fastest time-to-solution for our users science of any system we have benchmarked.

Page 8: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

8

Accurate binding energies for Accurate binding energies for large water clusters large water clusters

Accurate binding energies for Accurate binding energies for large water clusters large water clusters

These results provide unique information on the transition from the cluster to the liquid and solid phases of water.

Code: NWChemKernel: MP2 (Disk Bound)Sustained Performance: ~0.6 Gflop/s per processor (10% of peak)Choke Point: Sustained 61GB/s of Disk IO and used 400TB of scratch space.

Only took 5 hours on 1024 CPUs of the HP cluster. This is a capability class problem that could not be completed on any other system.

Page 9: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

9

Energy calculation of a protein complexEnergy calculation of a protein complexEnergy calculation of a protein complexEnergy calculation of a protein complex

The Ras-RasGAP protein complex is a key switch in the signaling network initiated by the epidermal growth factor (EGF). This signal network controls cell death and differentiation, and mutations in the protein complex are responsible for 30% of all human tumors.

Code: NWChemKernel: Hartree-FockTime for solution:~3 hours for one iteration on 1400 processors

Computation of 107 residues of the full protein complex using approximately 15,000 basis functions. This is believed to be the largest calculation of its type.

Page 10: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

10

Molecular dynamics of a lipopolysaccharide (LPS)

Classical molecular dynamics of the LPS membrane of Pseudomonas aeruginosa and mineral

Quantum mechanical/molecular mechanics molecular dynamics of membrane plus mineral

HPCS1

HPCS2HPCS3

Biogeochemistry:Biogeochemistry:Membranes for BioremediationMembranes for Bioremediation

Biogeochemistry:Biogeochemistry:Membranes for BioremediationMembranes for Bioremediation

Page 11: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

11

A new trend is emergingA new trend is emergingA new trend is emergingA new trend is emerging

With the expansion into biology, the need for storage has drastically increased.

EMSL users have stored >50TB in the past 8 months. More than 80% of the data is from experimentalists.

0.00001

0.0001

0.001

0.01

0.1

1

10

100

1000

Peta

Byte

s

Proteomic data

GenBank

Projected Growth Trend for BiologyLog Scale!

Archive

Experimental

Supercomputer

Computational

The MSCF provides a synergy between the computational and experimentalists.

Page 12: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

12

Storage DriversStorage DriversWe support Three different domains with different We support Three different domains with different

requirementsrequirements

Storage DriversStorage DriversWe support Three different domains with different We support Three different domains with different

requirementsrequirements

High Performance Computing – Chemistry Low storage volumes (10 TB) High performance storage (>500MB/s per client, GB/s aggregate) POSIX access

High Throughput Proteomics – Biology Large storage volumes (PB’s) and exploding Write once, read rarely if used as an archive Modest latency okay (<10s to data) If analysis could be done in place it would require faster storage

Atmospheric Radiation Measurement - Climate Modest side storage requirements (100’s TB) Shared with community and replicated to ORNL

Page 13: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

13

PNNL's Lustre ImplementationPNNL's Lustre ImplementationPNNL's Lustre ImplementationPNNL's Lustre Implementation

PNNL and the ASCI Tri-Labs are currently working with CFS and HP to develop Lustre.Lustre has been in full production since last Aug and used for aggressive IO from our supercomputer. Highly stable Still hard to manage

We are expanding our use of Lustre to act as the filesystem for our archival storage. Deploying a ~400TB filesystem

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8

Clients

GB

/s

Lustre over Elan4Lustre over Elan3Aggregate Local IONFS over GigE

660MB/s from a single client with a simple “dd” is faster than any local or global filesystem we have tested.

We are finally in the era where global filesystems provide faster access

Page 14: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

14

Open computing requires a trust relationship between sites.User logs into siteA and ssh’s to siteB. If siteA is compromised the hacker has probably sniffed the password for siteB. Reaction #1: Teach users to minimize jumping

through hosts they do not personally know are secure (why did the user trust SiteA?)

Reaction #2: Implement one-time passwords (SecureID)

Reaction #3: Turn off open access (Earth simulator?)

SecuritySecuritySecuritySecurity

Page 15: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

15

Thoughts about one-time-passwordsThoughts about one-time-passwordsThoughts about one-time-passwordsThoughts about one-time-passwords

A couple of different hurdles to cross: We would like to avoid having to force our users to carry

a different SecureID card for each site they have access to.

However the distributed nature of security (it is run by local site policy) will probably end up with something like this for the short term.

As of April 8th the MSCF has converted over to the PNNL SecureID system for all remote ssh logins.

Lots of FedEx’ed SecureID cards

Page 16: R. Scott Studham, Associate Director Advanced Computing April 13, 2004 HPC At PNNL March 2004

16

SummarySummarySummarySummary

HPCS2 is running well and the IO capabilities of the system are enabling chemistry and biology calculations that could not be run on any other system in the world.Storage for proteomics is on a super-exponential trend.Lustre is great. 660MB/s from a single client. Building 1/2PB single filesystem.We rapidly implemented SecureID authentication methods last week.