hpc/htc and cloud - home - openstack is open source ... · hpc/htc and cloud: making them work ......

HPC/HTC and Cloud:Making them work together efficiently

Rajul Kumar

Northeastern University

[email protected]

Our group

Rajul Kumar

Northeastern [email protected]

Evan Weinberg

Boston [email protected]

Chris Hill

Massachusetts Institute of [email protected]

HPC and Cloud convergence

High Performance Computing (HPC)

• HPC users have infinite demand for resources

Cloud

• Overprovisioned to meet the peak workloads and mostly stay underutilized

Can we make HPC soak up these idle cycles without impacting cloud workload

Simple Case: Single node HTC jobs

• High Throughput Computing (HTC) jobs focus on efficient execution ofloosely-coupled tasks

• Backfilled HTC jobs get killed to release resources for HPC workload

• Invested compute cycles are lost and requires complete rework

Suspend and resume the Virtual Machine running the jobs as and when the resources are available

Implementation

HPC cluster OpenStack cloud

Resource monitorHPC

HTC

Cloud

Implementation


Resource monitor

OpenVPN

Implementation

Control daemon


Resource monitors

OpenVPN

Implementation

Control daemon

HP

C c

lust

er

Op

enStack clo

ud

Resource monitors

OpenVPN

HPC jobs

HPC job arrives

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

HTC jobs moved to Cloud

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

Cloud utilization increases

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

HTC job suspended to release resources for cloud

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

Cloud utilization goes low

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

HTC jobs resumed on cloud

Modifications to Slurm

Slurm – A workload manager for HPC cluster

• Manages the resource and job scheduling

• Marks a node DOWN and removes the jobs for an unreachable node

• Does the same for a suspended virtual node

Modified Slurm to manage the suspended node and keep the job states intact

Future prospects

• Harden and utilize full data center performance (hardware, network etc.)

• Running multi-node jobs in virtual environment

• Move the jobs between Virtual Machine and Bare metal nodes

• Experiment with container frameworks

Conclusion

• Dynamic HPC/HTC cluster with least overhead and impact

• Better productive utilization of the HPC/HTC cluster

• Better resource utilization of the cloud

http://info.massopencloud.org

hpc/htc and cloud - home - openstack is open source ... · hpc/htc and cloud: making them work ......

Documents