hpc/htc and cloud · rajul kumar northeastern university kumar.raju@husky.neu.edu evan weinberg...

HPC/HTC and Cloud:Making them work together efficiently

Rajul Kumar

Northeastern University

kumar.raju@husky.neu.edu

Our group

Rajul Kumar

Northeastern Universitykumar.raju@husky.neu.edu

Evan Weinberg

Boston Universityweinbe2@bu.edu

Chris Hill

Massachusetts Institute of Technologycnh@mit.edu

HPC and Cloud convergence

High Performance Computing (HPC)

• HPC users have infinite demand for resources

• Overprovisioned to meet the peak workloads and mostly stay underutilized

Can we make HPC soak up these idle cycles without impacting cloud workload

Simple Case: Single node HTC jobs

• High Throughput Computing (HTC) jobs focus on efficient execution ofloosely-coupled tasks

• Backfilled HTC jobs get killed to release resources for HPC workload

• Invested compute cycles are lost and requires complete rework

Suspend and resume the Virtual Machine running the jobs as and when the resources are available

Implementation

HPC cluster OpenStack cloud

Resource monitorHPC

Implementation

Resource monitor

OpenVPN

Implementation

Control daemon

Resource monitors

OpenVPN

Implementation

Control daemon

enStack clo

Resource monitors

OpenVPN

HPC jobs

HPC job arrives

Implementation

Control daemon

Resource monitors

OpenVPN

enStack clo

HTC jobs moved to Cloud

Implementation

Control daemon

Resource monitors

OpenVPN

enStack clo

Cloud utilization increases

Implementation

Control daemon

Resource monitors

OpenVPN

enStack clo

HTC job suspended to release resources for cloud

Implementation

Control daemon

Resource monitors

OpenVPN

enStack clo

Cloud utilization goes low

Implementation

Control daemon

Resource monitors

OpenVPN

enStack clo

HTC jobs resumed on cloud

Modifications to Slurm

Slurm – A workload manager for HPC cluster

• Manages the resource and job scheduling

• Marks a node DOWN and removes the jobs for an unreachable node

• Does the same for a suspended virtual node

Modified Slurm to manage the suspended node and keep the job states intact

Future prospects

• Harden and utilize full data center performance (hardware, network etc.)

• Running multi-node jobs in virtual environment

• Move the jobs between Virtual Machine and Bare metal nodes

• Experiment with container frameworks

Conclusion

• Dynamic HPC/HTC cluster with least overhead and impact

• Better productive utilization of the HPC/HTC cluster

• Better resource utilization of the cloud

http://info.massopencloud.org

hpc/htc and cloud · rajul kumar northeastern university kumar.raju@husky.neu.edu evan weinberg...

Documents

hierarchical disentangled representationshierarchical...

dynamic benchmarking software development though competition...

2018 hyman cv for dept website - bu.edu

introduction to the shared compute cluster charles jahnke...

wgs - bu.edu

bpc: art and computation – fall 2006 digital media i:...

rajul use of biosensors in agriculure

zoezi&&1:&toa&maana&ya&msamiati&huu& - bu.edu ·...

introduction to parallel computing - bu.edu

who - bu.edu

basic laser safety - bu.edu

negligence and ai’s human users - bu.edu

instantiate an acme ims instance anita yadav,...

breslav@bu.edu, thedrick@bio.unc.edu, sclaroff@bu.edu...

27 28 29 30 photosynthesis laptop - bu.edu

faculty handbook - bu.edu

rajul exports rajasthan india

july 24 2016 bulletin - bu.edu

opengl & openscenegraph graphics programming katia oleinik:...

rajul computer presentation