focus on the science, not the server - hpc advisory …...file system throughput and iops scale...

46
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adam Hunter, Solutions Architect, Public Sector ANZ [email protected] 28 th August 2019 Focus on the Science, Not the Server Pawsey Supercomputing & HPC-AI Advisory Council

Upload: others

Post on 07-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Adam Hunter,

Solutions Architect, Public Sector ANZ

[email protected]

28th August 2019

Focus on the Science, Not the ServerPawsey Supercomputing & HPC-AI Advisory Council

Page 2: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

IT’S ABOUT SCIENCE, NOT SERVERS.

#AWSresearchcloud

aws.amazon.com/rcp

Page 3: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

2

2 2

4

2

1

1

3

7

7

4

9

5

7

6 6

77

4

8

4

Time

(days)C

ore

sTime (days)

8

2

1

9

5

4

53

12

3

6

1

9

4

8

1

2

8

7

7

6

Co

res Data

centre

capacity

limit

* Source: Hyperion Research, 2018

The metric for success should be time-to-results

Page 4: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

1.1M vCPUs for machine learning

A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster in the cloud by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS Region..

The graph highlights the

elastic, automatic

expansion of resources.

Clemson took advantage of

the new per-second billing

for Amazon EC2 instances.

The vCPU count usage is

comparable to the core

count on the largest

supercomputers in the

world.

Amazon

S3

Provisioning

and workflow

automation

software

Amazon

S3

JOB

SCRIPT

CLOUDY

CLUSTER

APIs

LOGIN SCHEDULE

R

SLURM

AUTO

SCALING

SPOT FLEET

CCQ

S

3 DDB Amazon

VPC

Page 5: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Cost advantages

On-PremisesCapital Expense Model

Amazon Web Services (AWS)Pay As You Go Model

▪ Use only what you need

▪ Multiple pricing models

▪ High upfront capital cost

▪ High cost of ongoing support

Page 6: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

RESERVED INSTANCES

ON-DEMAND INSTANCES

Cost Optimization

Spot

Up to 90% savings

by using excess

capacity, charged at

a Spot price that

fluctuates based on

supply and demand

For high-scale, time-

flexible workloads

SPOT

RESERVED INSTANCES

ON-DEMAND INSTANCES

SPOT

Conse rva t i v e

Op t im i zed

RESERVED INSTANCES

ON-DEMAND INSTANCES

SPOT

Opt im i zed w i t h sca l e -ou t (m agn i f y t he peak )

Reserved

Make a low, one-

time payment and

receive a significant

discount on the hour

charge

For committed

utilization

On-Demand

Pay for compute

capacity by the hour,

with per-second

billing and no long-

term commitments

For spiky workloads,

or to define needs

Page 7: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

A virtually unlimited number of architecture and deployment options to meet the demands of both your users and your

applications.

HPC on AWS

Page 8: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Life Sciences

All Sorts of HPC Workloads in the Cloud

Financial Services Energy & Geo Sciences

Design &

Engineering

Media &

Entertainment

Autonomous

Vehicles

Page 9: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Compute: Amazon EC2 C5n

Page 10: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

C5n Instances with 100 Gbps Networking

▪ The first “network optimized” instances on AWS

▪ Intel Skylake CPUs

▪ Nitro System (hypervisor and ENA)

C5n

With EFA

Page 11: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Launching c5n:Choose Instance Type

Configure Instance

Page 12: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Network: Elastic Fabric Adapter

Page 13: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

What is Elastic Fabric Adapter (EFA)

C5n P3dn

EFAElastic Fabric Adapter,

best for large HPC workloads

Scale tightly-coupled HPC applications on AWS

Page 14: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

HPC software stack in Amazon Elastic Compute Cloud (Amazon EC2)

Userspace

Kernel

Without EFA With EFA

Page 15: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

What can EFA do?

Page 16: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

{

"LaunchTemplateName": "EFA",

"LaunchTemplateData": {

"CpuOptions": {

"CoreCount": 36,

"ThreadsPerCore": 1

},

"ImageId": "ami-XXXXXXXXXXXXXXXXX",

"Placement": {

"GroupName": "cfdinA"

},

"InstanceType": "c5n.18xlarge",

"NetworkInterfaces": [

{

"DeviceIndex" : 0,

"SubnetId": "subnet-XXXXXXXX",

"InterfaceType" : "efa",

"Groups": [

"sg-b2a50ad6"

]

}

]

}

Step 1: Prepare an EFA-enabled Security Group

Step 2: Launch a Temporary Instance

Step 3: Install EFA Software Components

Step 4: Install your HPC Application

Step 5: Create an EFA-enabled AMI

Step 6: Launch EFA-enabled Instances

into a Cluster Placement Group

Step 7: Terminate the Temporary Instance

Getting Started with EFA

Page 17: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Storage: Amazon FSx Lustre

Page 18: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon FSx Lustre

Fully managed third-party file systems optimized for a variety of workloads

Fully managed Cost-effectiveHigh

performing

Parallel distributed file system

Massively scalable performance• 100+ GiB/s throughput• Millions of IOPS• Consistent low latencies• Lustre, an open-source parallel file system

Page 19: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

File system throughput and IOPS scale linearly with storage capacity

Each terabyte (TB) of storage provides 200 MB/second of file system throughput

File systems can scale to hundreds of GB/s and millions of IOPS

Page 20: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Seamless integration with Amazon S3

Data stored in Amazon S3 is loaded

to Amazon FSx for processing

Output of processing

returned to Amazon S3 for retention

When your workload finishes, simply delete your file system.

Link your Amazon S3 data set to your Amazon FSx for Lustre file system, then….

Page 21: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Getting Started with FSx Lustre 1

2

3

4

sudo lfs hsm_archive filename5

Page 22: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Orchestration: AWS ParallelCluster

Page 23: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

23

AWS ParallelCluster is now a supported open-source service!

Simplifies the deployment of HPC

in the cloud:

▪ SLURM

▪ Grid Engine

▪ Torque

Incorporates the best of AWS HPC

Services:

▪ FSx Lustre

▪ EFA

▪ AWS Batch

Page 24: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Many new features with AWS ParallelCluster

• Improved cluster scale up and scale down• Filesystem support, including:

FSx LustreRAID Multiple NFS sharesEFS

• Regional expansion:All Standard RegionsChina GovCloud service!

• Improved support for Custom AMI support

Page 25: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Getting started with AWS ParallelCluster

$ sudo pip install aws-parallelcluster

$ pcluster configure

$ pcluster ssh cfdcluster

$ pcluster list

$ pcluster delete cfdcluster

Try it with:

https://docs.aws.amazon.com/parallelcluster/

https://github.com/aws/aws-parallelcluster

Client / Cloud9

Page 26: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Partners: Ronin

Page 27: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Page 28: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Page 29: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Page 30: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

https://ronin.cloud/

Page 31: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Operations: AWS Systems Manager

Page 32: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Secure Operations with AWS Systems Manger

IAM

Shell or CLI

VPC

Compute Nodes

Access Control

Available in all AWS regions including AWS GovCloud (US)

Run Command

Session Manager

• Browser-based shell and CLI for EC2 instances

• No need to open inbound ports or manage SSH keys

• Grant access through IAM

• Session auditing and logging

• Support for AWS PrivateLinkTags

Page 33: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Getting Started withSession Manager

AWS Systems Manager

Page 34: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Performance Considerations

Page 35: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Performance Considerations for HPC on AWS

• Use Real World Test Cases

• Disable Hyperthreading

• Bind processes to cores

• Launch to a placement group

• Use the default clock source on c5/c5n

• Use an up-to-date OS

• Compile for the host (AVX512)

Page 36: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Researcher’s handbook

Written by Amazon’s Research Computing community for scientists.

• Explains foundational concepts about how AWS can accelerate time-to-science in the cloud

• Step-by-step best practices for securing your environment to ensure your research data is safe and your privacy is protected

• Tools for budget management that will help you control your spending and limit costs (and preventing any over-runs)

• Catalogue of scientific solutions from partners chosen for their outstanding work with scientists

aws.amazon.com/rcp

Page 37: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

aws.amazon.com/rcp

Page 38: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Adam Hunter

Solutions Architect, Public Sector

[email protected]

https://aws.amazon.com/hpc/

Page 39: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Page 40: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Page 41: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Fluid dynamics – Ansys Fluent

• C4.8xlarge instance type

• 140M cell model

• F1 car CFD benchmark

Page 42: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Scaling Benchmarks

Page 43: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Resources used in this study

Archer: Cray XC30 supercomputer • two 2.7 GHz, 12-core Intel E5-2697 v2 (Ivy Bridge)

AWS:• z1d: 4.0 GHZ Intel®Xeon®Scalable Processors; 24 core per instance;

16GB Ram per core; 25 Gigabit network bandwidth

• c5n: 3.0/3.5 GHZ Intel®Xeon®Scalable Processors; 36 core per instance; 5.3 GB Ram per core; 100 Gigabit network bandwidth; New Elastic Fabric Adaptor (EFA) for fast networking

Page 44: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Methodology

• OpenFOAM v1806 in Double Precision (pimpleFoam)

• Scotch decomposition for solving hierarchical (i.e constant x/y/z loading) for meshing

• SST-DDES Turbulence Model

• ANSA generated 143/280M cell unstructured mesh

• Time Step=5e-4s with five inner iterations

• Preconditioned Conjugant Gradient Linear Solver

Page 45: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Scaling Results

0.00E+00

2.00E+00

4.00E+00

6.00E+00

8.00E+00

1.00E+01

1.20E+01

1.40E+01

1.60E+01

1.80E+01

0.E+00 2.E+05 4.E+05 6.E+05 8.E+05

se

c/t

ime-s

tep

Cells/core

z1d (medium)

z1d (fine)

c5n (medium)

c5n (fine)

Archer (medium)

Archer (fine)

Page 46: Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale linearly with storage capacity Each terabyte (TB) of storage provides 200 MB/second

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Acknowledgement

Dr. Neil Ashton of Oxford University

Stephen Sachs, of AWS