launch a thousand core hpc cluster in minutes with aws cfncluster

24
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adam Boeglin, HPC Solutions Architect Monday, October 31, 2016 Launch a thousand core HPC cluster in minutes with AWS CfnCluster

Upload: amazon-web-services

Post on 12-Apr-2017

467 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Adam Boeglin, HPC Solutions Architect

Monday, October 31, 2016

Launch a thousand core HPC cluster in minutes with AWS CfnCluster

Page 2: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Webinar Highlights

• What is CfnCluster and when to use it• Architecture guidance to fit your

security models• How to install and configure of

CfnCluster• Demo: Review of CfnCluster and

managing compute at scale

Page 3: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Introduction to CfnCluster

• AWS CloudFormation + Cluster = CfnCluster• Simple to install, easy to manage• Everything you need to get a cluster up and running in

minutes• Head node with scheduler• Shared NFS Storage

• /home• /shared

• OpenMPI• Compute nodes that grow and shrink on demand

Page 4: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Workloads Well Suited for CfnCluster

• Computational Fluid Dynamics• Semiconductor Design• Weather Modeling• Genomics and Molecular Simulation• Seismic and reservoir simulations• 3D rendering and visualizations• … anything that uses a traditional HPC scheduler

Page 5: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Cluster HPC and Grid HPC

Cluster HPCTightly coupled, latency sensitive

applications

Use larger EC2 compute instances, placement groups,

Enhanced Networking

Grid HPCLoosely coupled,

pleasingly parallel.

Requires very little node to node interaction.

Grids of ClustersUse a grid strategy on the cloud

to run a group of parallel, individually clustered HPC jobs

Page 6: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Computational Fluid DynamicsANSYS Fluent

• AWS c4.8xlarge• 140M cells• F1 car CFD benchmark

http://www.ansys-blog.com/simulation-on-the-cloud/

Page 7: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

https://aws.amazon.com/hpc/cfncluster/

Page 8: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Configuration Options• Operating System

• Amazon Linux• Centos 6• Centos 7• Ubuntu 14.04

• Scheduler• Sun Grid Engine (SGE)• OpenLava• Torque• SLURM

• Storage Size & IOPS• EBS & Instance Store

Encryption• Scaling Speed & Limits• Provisioning Scripts

Page 9: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Many AWS services to tie it all together

• CloudFormation manages the state of the cluster• Amazon CloudWatch & Auto Scaling lets compute fleet

grow and shrink on demand• Amazon SQS & Amazon SNS allows compute nodes to

signal to master when they’re online• AWS Identity and Access Management (IAM) allows for

fine grained access control• Amazon S3 for storage of CloudFormation templates

Page 10: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Amazon S3

DynamoDB

Amazon SQS

CloudWatch

Internet Gateway

(IGW)

region-1a

Master Server

Auto ScalingCompute

Fleet

CloudFormation

Standalone CfnCluster

Page 11: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Amazon S3

DynamoDB

Amazon SQS

CloudWatch

Internet Gateway

(IGW)

Private Subnet

Master Server

Auto ScalingCompute

Fleet

CloudFormation

Public Subnet

VPC NAT gateway

Private Subnet Route TableVPC Traffic -> Local

0.0.0.0 -> Nat Gateway

Public Subnet Route TableVPC Traffic -> Local

0.0.0.0 -> Internet Gateway

Isolated CfnCluster

Bastian Server

Page 12: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Amazon S3

DynamoDB

Amazon SQS

CloudWatch

Internet Gateway

(IGW)

Private Subnet

Master Server

Auto ScalingCompute

Fleet

CloudFormation

Public Subnet

VPC NAT gateway

Corporate Data Center

Engineer VPN Connection

Private Subnet Route TableVPC Traffic -> Local

Corp IP Range -> VPN0.0.0.0 -> Nat Gateway

Public Subnet Route TableVPC Traffic -> Local

Corp IP Range -> VPN0.0.0.0 -> Internet Gateway

Isolated CfnCluster w/ VPN

Page 13: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Private Subnet

Master Server

Auto ScalingCompute

Fleet

Amazon S3

DynamoDB

Amazon SQS

CloudWatch

CloudFormation

Corporate Data Center

Proxy ServerVPN Connection

InternetConnection

Private Subnet Route TableVPC Traffic -> Local

Corp IP Range -> VPN0.0.0.0 -> VPN

Private CfnCluster w/ VPN & Proxy

Page 14: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Creating an IAM User

• Create an IAM user with Administrative privileges• Fine grain access controls can be done later

• Generate an Access & Secret key and keep it safe

Page 15: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Create an SSH Key

• Generate or import the key you’ll use for user login

Page 16: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Installing the CfnCluster CLI

• On your desktop or a bastion server

$ sudo pip install cfncluster

Page 17: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Creating the Base Configuration

• First, create the base config required to start a cluster.

$ cfncluster configure

Page 18: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Edit the configuration file to meet your needs

• Reference the configuration docs• http://cfncluster.readthedocs.io/en/latest/configuration.html

$ vim ~/.cfncluster/config

Page 19: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Launch the Cluster

$ cfncluster create mycluster

• Cluster creation usually takes ~15 minutes

• Completely managed by CloudFormation

Page 20: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Submit your first job[ec2-user@ip-10-0-0-17 ~]$ cat hw.qsub#!/bin/bash##$ -cwd#$ -j y#$ -pe mpi 2#$ -S /bin/bash#module load openmpi-x86_64mpirun -np 2 hostname

[ec2-user@ip-10-0-0-17 ~]$ qsub hw.qsub Your job 1 ("hw.qsub") has been submitted

[ec2-user@ip-10-0-0-17 ~]$ qstatjob-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------------ 1 0.55500 hw.qsub ec2-user r 02/01/2015 05:57:25 [email protected] 2

[ec2-user@ip-10-0-0-17 ~]$ ls -ltotal 8-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub-rw-r--r-- 1 ec2-user ec2-user 26 Feb 1 05:57 hw.qsub.o1

[ec2-user@ip-10-0-0-17 ~]$ cat hw.qsub.o1 ip-10-0-0-44ip-10-0-0-45

Page 21: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

EBS Snapshots for Software & Storage Management

• Install your applications and store any working data to /shared

• Create a snapshot of that volume

• Re-use that snapshot every time you launch your cluster

ebs_snapshot_id = snap-xxxxx

Master Server

Root & HomeVolume (/ & /home)

NFS Shared Volume(/shared)

Amazon EBS Snapshot

(snap-xxxxx)

Page 22: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Upgrading Hardware is Easy!

• Simple upgrade from Ivy Bridge to Haswell

1. Let all compute nodes stop2. Edit ~/.cfncluster/config and change

compute_instance_type = c3.8xlargeto

compute_instance_type = c4.8xlarge3. Update the cluster

$ cfncluster update mycluster

C3

C4

Page 23: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Demo: Launching a Cluster

Page 24: Launch a Thousand Core HPC Cluster in Minutes with AWS CfnCluster

Thank you!