hpc/htc and cloud - home - openstack is open source ... · hpc/htc and cloud: making them work ......
TRANSCRIPT
HPC/HTC and Cloud:Making them work together efficiently
Rajul Kumar
Northeastern University
Our group
Rajul Kumar
Northeastern [email protected]
Evan Weinberg
Boston [email protected]
Chris Hill
Massachusetts Institute of [email protected]
HPC and Cloud convergence
High Performance Computing (HPC)
• HPC users have infinite demand for resources
Cloud
• Overprovisioned to meet the peak workloads and mostly stay underutilized
Can we make HPC soak up these idle cycles without impacting cloud workload
Simple Case: Single node HTC jobs
• High Throughput Computing (HTC) jobs focus on efficient execution ofloosely-coupled tasks
• Backfilled HTC jobs get killed to release resources for HPC workload
• Invested compute cycles are lost and requires complete rework
Suspend and resume the Virtual Machine running the jobs as and when the resources are available
Implementation
HPC cluster OpenStack cloud
Resource monitorHPC
HTC
Cloud
Implementation
HPC cluster OpenStack cloud
Resource monitor
OpenVPN
Implementation
Control daemon
HPC cluster OpenStack cloud
Resource monitors
OpenVPN
Implementation
Control daemon
HP
C c
lust
er
Op
enStack clo
ud
Resource monitors
OpenVPN
HPC jobs
HPC job arrives
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
HTC jobs moved to Cloud
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
Cloud utilization increases
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
HTC job suspended to release resources for cloud
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
Cloud utilization goes low
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
HTC jobs resumed on cloud
Modifications to Slurm
Slurm – A workload manager for HPC cluster
• Manages the resource and job scheduling
• Marks a node DOWN and removes the jobs for an unreachable node
• Does the same for a suspended virtual node
Modified Slurm to manage the suspended node and keep the job states intact
Future prospects
• Harden and utilize full data center performance (hardware, network etc.)
• Running multi-node jobs in virtual environment
• Move the jobs between Virtual Machine and Bare metal nodes
• Experiment with container frameworks
Conclusion
• Dynamic HPC/HTC cluster with least overhead and impact
• Better productive utilization of the HPC/HTC cluster
• Better resource utilization of the cloud
http://info.massopencloud.org