focus on the science, not the server - hpc advisory …...file system throughput and iops scale...
TRANSCRIPT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adam Hunter,
Solutions Architect, Public Sector ANZ
28th August 2019
Focus on the Science, Not the ServerPawsey Supercomputing & HPC-AI Advisory Council
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IT’S ABOUT SCIENCE, NOT SERVERS.
#AWSresearchcloud
aws.amazon.com/rcp
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2
2 2
4
2
1
1
3
7
7
4
9
5
7
6 6
77
4
8
4
Time
(days)C
ore
sTime (days)
8
2
1
9
5
4
53
12
3
6
1
9
4
8
1
2
8
7
7
6
Co
res Data
centre
capacity
limit
* Source: Hyperion Research, 2018
The metric for success should be time-to-results
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1.1M vCPUs for machine learning
A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster in the cloud by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS Region..
The graph highlights the
elastic, automatic
expansion of resources.
Clemson took advantage of
the new per-second billing
for Amazon EC2 instances.
The vCPU count usage is
comparable to the core
count on the largest
supercomputers in the
world.
Amazon
S3
Provisioning
and workflow
automation
software
Amazon
S3
JOB
SCRIPT
CLOUDY
CLUSTER
APIs
LOGIN SCHEDULE
R
SLURM
AUTO
SCALING
SPOT FLEET
CCQ
S
3 DDB Amazon
VPC
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cost advantages
On-PremisesCapital Expense Model
Amazon Web Services (AWS)Pay As You Go Model
▪ Use only what you need
▪ Multiple pricing models
▪ High upfront capital cost
▪ High cost of ongoing support
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
RESERVED INSTANCES
ON-DEMAND INSTANCES
Cost Optimization
Spot
Up to 90% savings
by using excess
capacity, charged at
a Spot price that
fluctuates based on
supply and demand
For high-scale, time-
flexible workloads
SPOT
RESERVED INSTANCES
ON-DEMAND INSTANCES
SPOT
Conse rva t i v e
Op t im i zed
RESERVED INSTANCES
ON-DEMAND INSTANCES
SPOT
Opt im i zed w i t h sca l e -ou t (m agn i f y t he peak )
Reserved
Make a low, one-
time payment and
receive a significant
discount on the hour
charge
For committed
utilization
On-Demand
Pay for compute
capacity by the hour,
with per-second
billing and no long-
term commitments
For spiky workloads,
or to define needs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A virtually unlimited number of architecture and deployment options to meet the demands of both your users and your
applications.
HPC on AWS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Life Sciences
All Sorts of HPC Workloads in the Cloud
Financial Services Energy & Geo Sciences
Design &
Engineering
Media &
Entertainment
Autonomous
Vehicles
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Compute: Amazon EC2 C5n
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
C5n Instances with 100 Gbps Networking
▪ The first “network optimized” instances on AWS
▪ Intel Skylake CPUs
▪ Nitro System (hypervisor and ENA)
C5n
With EFA
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Launching c5n:Choose Instance Type
Configure Instance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Network: Elastic Fabric Adapter
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Elastic Fabric Adapter (EFA)
C5n P3dn
EFAElastic Fabric Adapter,
best for large HPC workloads
Scale tightly-coupled HPC applications on AWS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HPC software stack in Amazon Elastic Compute Cloud (Amazon EC2)
Userspace
Kernel
Without EFA With EFA
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can EFA do?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
{
"LaunchTemplateName": "EFA",
"LaunchTemplateData": {
"CpuOptions": {
"CoreCount": 36,
"ThreadsPerCore": 1
},
"ImageId": "ami-XXXXXXXXXXXXXXXXX",
"Placement": {
"GroupName": "cfdinA"
},
"InstanceType": "c5n.18xlarge",
"NetworkInterfaces": [
{
"DeviceIndex" : 0,
"SubnetId": "subnet-XXXXXXXX",
"InterfaceType" : "efa",
"Groups": [
"sg-b2a50ad6"
]
}
]
}
Step 1: Prepare an EFA-enabled Security Group
Step 2: Launch a Temporary Instance
Step 3: Install EFA Software Components
Step 4: Install your HPC Application
Step 5: Create an EFA-enabled AMI
Step 6: Launch EFA-enabled Instances
into a Cluster Placement Group
Step 7: Terminate the Temporary Instance
Getting Started with EFA
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storage: Amazon FSx Lustre
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon FSx Lustre
Fully managed third-party file systems optimized for a variety of workloads
Fully managed Cost-effectiveHigh
performing
Parallel distributed file system
Massively scalable performance• 100+ GiB/s throughput• Millions of IOPS• Consistent low latencies• Lustre, an open-source parallel file system
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
File system throughput and IOPS scale linearly with storage capacity
Each terabyte (TB) of storage provides 200 MB/second of file system throughput
File systems can scale to hundreds of GB/s and millions of IOPS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Seamless integration with Amazon S3
Data stored in Amazon S3 is loaded
to Amazon FSx for processing
Output of processing
returned to Amazon S3 for retention
When your workload finishes, simply delete your file system.
Link your Amazon S3 data set to your Amazon FSx for Lustre file system, then….
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started with FSx Lustre 1
2
3
4
sudo lfs hsm_archive filename5
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Orchestration: AWS ParallelCluster
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
23
AWS ParallelCluster is now a supported open-source service!
Simplifies the deployment of HPC
in the cloud:
▪ SLURM
▪ Grid Engine
▪ Torque
Incorporates the best of AWS HPC
Services:
▪ FSx Lustre
▪ EFA
▪ AWS Batch
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Many new features with AWS ParallelCluster
• Improved cluster scale up and scale down• Filesystem support, including:
FSx LustreRAID Multiple NFS sharesEFS
• Regional expansion:All Standard RegionsChina GovCloud service!
• Improved support for Custom AMI support
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting started with AWS ParallelCluster
$ sudo pip install aws-parallelcluster
$ pcluster configure
$ pcluster ssh cfdcluster
$ pcluster list
$ pcluster delete cfdcluster
Try it with:
https://docs.aws.amazon.com/parallelcluster/
https://github.com/aws/aws-parallelcluster
Client / Cloud9
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Partners: Ronin
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://ronin.cloud/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Operations: AWS Systems Manager
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Secure Operations with AWS Systems Manger
IAM
Shell or CLI
VPC
Compute Nodes
Access Control
Available in all AWS regions including AWS GovCloud (US)
Run Command
Session Manager
• Browser-based shell and CLI for EC2 instances
• No need to open inbound ports or manage SSH keys
• Grant access through IAM
• Session auditing and logging
• Support for AWS PrivateLinkTags
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started withSession Manager
AWS Systems Manager
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance Considerations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance Considerations for HPC on AWS
• Use Real World Test Cases
• Disable Hyperthreading
• Bind processes to cores
• Launch to a placement group
• Use the default clock source on c5/c5n
• Use an up-to-date OS
• Compile for the host (AVX512)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Researcher’s handbook
Written by Amazon’s Research Computing community for scientists.
• Explains foundational concepts about how AWS can accelerate time-to-science in the cloud
• Step-by-step best practices for securing your environment to ensure your research data is safe and your privacy is protected
• Tools for budget management that will help you control your spending and limit costs (and preventing any over-runs)
• Catalogue of scientific solutions from partners chosen for their outstanding work with scientists
aws.amazon.com/rcp
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
aws.amazon.com/rcp
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adam Hunter
Solutions Architect, Public Sector
https://aws.amazon.com/hpc/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fluid dynamics – Ansys Fluent
• C4.8xlarge instance type
• 140M cell model
• F1 car CFD benchmark
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scaling Benchmarks
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Resources used in this study
Archer: Cray XC30 supercomputer • two 2.7 GHz, 12-core Intel E5-2697 v2 (Ivy Bridge)
AWS:• z1d: 4.0 GHZ Intel®Xeon®Scalable Processors; 24 core per instance;
16GB Ram per core; 25 Gigabit network bandwidth
• c5n: 3.0/3.5 GHZ Intel®Xeon®Scalable Processors; 36 core per instance; 5.3 GB Ram per core; 100 Gigabit network bandwidth; New Elastic Fabric Adaptor (EFA) for fast networking
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Methodology
• OpenFOAM v1806 in Double Precision (pimpleFoam)
• Scotch decomposition for solving hierarchical (i.e constant x/y/z loading) for meshing
• SST-DDES Turbulence Model
• ANSA generated 143/280M cell unstructured mesh
• Time Step=5e-4s with five inner iterations
• Preconditioned Conjugant Gradient Linear Solver
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scaling Results
0.00E+00
2.00E+00
4.00E+00
6.00E+00
8.00E+00
1.00E+01
1.20E+01
1.40E+01
1.60E+01
1.80E+01
0.E+00 2.E+05 4.E+05 6.E+05 8.E+05
se
c/t
ime-s
tep
Cells/core
z1d (medium)
z1d (fine)
c5n (medium)
c5n (fine)
Archer (medium)
Archer (fine)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Acknowledgement
Dr. Neil Ashton of Oxford University
Stephen Sachs, of AWS