batch scheduler and file management - inria...

Batch Scheduler and File Management

Olivier Richard - Associate ProfessorMescal Team - LIG/INRIA

23 juin 2010

Olivier Richard () Batch Scheduler and File Management 23 juin 2010 1 / 23

Outline

1 Generalities on Batch Scheduler

2 Challenges, Recent Features and Trends

3 File System and File ManagementData StagingApplications

4 Toward Data-Aware Batch Scheduler

5 Benchmarks, Workloads, Experimentations

6 Conclusions


Backgrounds on our ongoing works

: A Versatile Resource Management System (Batch Scheduler).

: A Management System for Lightweight Grid and Bag-of-TasksExecution.

: A Scalable Low-Level Deployment/Provisioning Tool.

: An Adaptive Parallel Launcher

: A Platform Dedicated to Experimentation (HPC, Dist. Sys., Net.)


Generalities on Batch Scheduler

Batch Scheduler : a Global Picture 1

Batch Scheduler

Users

Submissions

Nodes

Queues

Clu

ster

Goals : Allocate resources for each applications with respect of theirrequirements and users’ rights. Satisfy users (response time, reliability) andadministrators (high resource utilization, efficiency, energy management...).

Loadleveler(IBM), PBS, LSF, Slurm(LLNL), Torque, SGE, Condor, OAR 1

1. for more exhaustive list see : http ://en.wikipedia.org/wiki/Job scheduleOlivier Richard () Batch Scheduler and File Management 23 juin 2010 4 / 23


Batch Scheduler : a General Picture 2

Resource Management System

Users

Submissions

Nodes

Queues

Clu

ster

Job Management System (Scheduling)

Resource Management Layer : launching, cleaning, monitoring...

Job Management Layer : batch/interactive job, Backfilling (EASY orConservative) Scheduling, Suspend/Resume, Preemption, Dependences,Resubmission, Advance Reservation...



Batch Scheduler : a General Picture 3


Users

Submissions

Nodes

Queues

Clu

ster

Job Management System

Workload Management System

Sche

dulin

g

Workload/Job Management : more complete job scheduling policies

Fairsharing, Quality of Service (QoS), SLA (Service Level Agreement),Energy Saving, Time Varying Policies (day/night, week-end, holidays ...)

Dedicated software : MAUI and Catalina,

There is not true separation into some systems, for instance Slurm and OAR.



Architecture and main components

Submission

Scheduler

Matchingof resource

Launching and control of execution

Client

Server

Computing nodes

Users

Log, Accounting

Monitoring

Few components, but the number of jobs and resources states plus thescheduling policies and a huge number of configurable parameters lead to agreat system complexity.

It’s not so easy to tune and to optimize a Batch Scheduler.


Challenges, Recent Features and Trends


Scalability (remains the number one issue)

Topology constraint (hierarchy, NUMA, I/O Bandwidth)

Energy Saving (node power on/off, DVFS, no so simple)

Dynamic jobs, massive submission,

Infrastructure diversity (virtual compute node, multi-cluster,GPGPU...)

Master the increase of (global) complexity

How to track the global efficiency of the global computinginfrastructure (and how to optimize it) ?



Topology-aware Scheduling

node 2*4cores

X1 GB/s

X2 GB/s

X2>>X1X1 GB/s

X2 GB/s

switch

switch

Application

Application

Better Perfs.

Bottleneck

See [JP09] for problem definition

Two ways : 1) respect strictly topology constraint (OAR). 2) If it’s not possible to respectconstraint, BS allocate next contiguous set of resource (clever resources numbering in 3Dtorus) (Slurm).

Trade-offs between response-time and execution time.



Scalability

Which granularity for resource representation and manipulation

core, thread (too fine) ? (generally a flat data structure in batchscheduler)

nodes (most used) (Slurm can manage upto 64K nodes, how manycores ?)

Add some policies for fine tuning (cpuset, cgroup, CPU affinity, BulkI/O, (next steps bandwidth)...)

partitions (set of nodes) (sometimes used in large cluster)

Other resources issues

Memory, network cards, L3 Cache partitioning (Power 7), DVFScontrol...


File System and File Management


Some facts

Scaling up by adding I/O capacities is expensive.

File System remains very sensitive to bottleneck effect.

At very large scale checkpointing increase pressure on file system.

SSD will offer new options but imply more complexity in systemmanagement ?

File System and Batch Scheduler

Few studies address Batch Scheduler and File System interactions.

Upto now, administrator use Batch Scheduler’s configurability toinsert workarounds and to develop ad-hoc (not portable) strategies.



Hierarchical Storage Management

Hierarchical storage management (HSM) provides mechanisms to automatically move, migrateand recall files be tape and disk.

Cluster Nodes

HPSSMover

GPFSserver

HPSSMover

GPFSserver

HPSSMover

GPFSserver

HPSSserver

GPFS HPSSscheduler

Tape/Disk

Tape/Disk

Tape/Disk

Disk

Disk

Disk

Disk/SSD?

Disk/SSD?

Disk/SSD?

Typically HSM for archival purpose. Example of such system : High Performance StorageSystem (HPSS) 2

HPSS provides both striping (for large files) and file aggregation (for little ones)

Coupling HPSS / GPFS :

Automatic migration file via a rule based system which identified files to migrate(ILM policy/engine).

2. http ://www.hpss-collaboration.org/Olivier Richard () Batch Scheduler and File Management 23 juin 2010 12 / 23


File System and File Management in HPC Cluster

2 main ways

Parallel File System

A shared parallel file system

Lustre, GPFS, PVFS ... NFSv4.1 (pNFS) protocol

Use on most systems

Implicit/No Control

Data Staging

Job’s data requirements are identified and provided by user in submitted script.

Stage-in : Input Files are transfered to local disk of compte nodes before job starting.

Stage-out : Ouptut Files are transfered from nodes to mass storage after execution.

Nowadays, rarely used on cluster, mainly used in Grid context

Explicit Control


File System and File Management Data Staging

Data Staging 1

Example 1 : data staging approach purposed in Loadleveler a

a. excerpt from official documentation

1. A single replica of the data files needed by a job have to be createdon a common file system.

2. A replica of the data files has to be created on every machine onwhich the job will run.

Time when data staging occur can be specified in submitted script(via AT SUBMIT and AT JUST IN TIME keywords)


File System and File Management Data Staging

Data Staging 2

Example 2 : on a large system managed with Slurm

Slurm does not support directly data staging.

Approach : introduce an helper job (HJ) to trigger data staging-out.

HJ have a dependence on job which generates output data.

HJ keeps only one core by node to realize the data transfer to storagesystem. Others cores are release to the system (job dynamics orco-allocation).


File System and File Management Applications

Applications

Libraries

Parallel format HDF5, NetCFD

MPI I/O (ROMIO, etc.)

Specialized Environments with their own file storage/management runtine

a la Map-Reduce (Hadoop) (Data mining, data analysis)

Blobseer / Visualization (see Bouge, Antoniu)

Falkon/Swift (Many Lossely Coupled Tasks)

Application with specific checkpointing runtime and strategies

Workflow : to coordinate computing, visualization, post-process analysis tasks (dedicatedscheduling).

Can we obtained accurate I/O profiles and I/O requirements from applications ?

Which level of variability (impact of failure recovery) ?

These kind of information are needed to allocate resources and I/O bandwidth betweenconcurrent applications.


Toward Data-Aware Batch Scheduler


Cluster Nodes

HPSSMover

GPFSserver

HPSSMover

GPFSserver

HPSSMover

GPFSserver

HPSSserver

GPFS HPSSscheduler

Tape/Disk

Tape/Disk

Tape/Disk

Disk

Disk

Disk

Disk/SSD?

Disk/SSD?

Disk/SSD?


Users Submissions

Queues

Job Management System

Workload Management System

Sche

dulin

g

Data-AwareScheduling

Goal : Data-Aware Scheduler which can better interact with Global File System andApplication/Runtime

Needs : accurate I/O profiles, I/O requirements

Identify possibilities of control : runtime,automatic data staging (BW)

I/O bandwidth, I/O nodes co-allocation (topology aware and locality preservation),

I/O Overlapping, Gang-Scheduling, Time-Sharing, specific scheduling algorithms...

Lot of factor, parameter, features interactions...Olivier Richard () Batch Scheduler and File Management 23 juin 2010 17 / 23


Example of interaction : application/checkpoint/batchscheduler Unused Cores

Nodes Powered OFF

CHKPTIN RAM

NODE

1)

2)

3)

4)

APPLICATION A APPLICATION B

CHKPT A

CHKPT B

Diskless checkpoint with RS encode/decode fragments (as proposed by Gomez and Al.)

1) better resilience against failures (good chkpt distribution) but waste of resources

2) lower resilience but better use of resources

3) lower resilience

4) better resilience (good chkpt distribution)


Benchmarks, Workloads, Experimentations

Benchmarks, Workloads and Experimentations

What we have at our disposal ?

Parallel I/O Benchmarks

A list of benchmarks, applications and traces maintained by Rajeev Thakurof ANL a

FLASH I/O, IOR, mpiBLAST, NASBT I/O, Qbox...

a. http ://www.mcs.anl.gov/ thakur/pio-benchmarks.html

Batch Schedulers

Only one benchmark : ESP, a system utilization benchmark [WOK00]

Parallel Workloads Archive : Standard Workload Format a

a. http ://www.cs.huji.ac.il/labs/parallel/workload/swf.html


Benchmarks, Workloads, Experimentations

Benchmarks, Workloads and Experimentations

Testbed

Testbed like Grid’5000 dedicated platform

But we don’t have

An integrated workload/trace : Batch Scheduler + Parallel I/O System +failures

An HSM like HPFSS(+GPFS) to play with (Can we determine model andbuild simulator and emulator of it ?)

Obtain/build a realistic testbed platform is a challenge


Conclusions

Conclusion

The increase of computing infrastructures’ complexity is a big challenge forBatch Schedulers (and for Data-aware Batch Schedulers)

File system can be a bottleneck at large scale in some situation.

Interactions between Batch Scheduler and File Management System havenot been extensively studied in recent (and old ?) literature.

There are surely opportunities to enhance overall infrastructure efficiencythrough interactions/coordinations between BS/FS/Appli..

Experimentation raises several issues.

Workloads, benchmarks, failures injection, HSM emulation.

Testbed elaboration.


Conclusions

Thank YouQuestions ?


Conclusions

References I

[JP09] J Miguel-Alonso J Pascual, J Navaridas.

Effects of topology-aware allocation policies on scheduling performance.

In In Proceedings of the 14th Workshop on Job Scheduling Strategies for Parallel Processing, 2009.

[MTNM08] Yuya Machida, Shin’ichiro Takizawa, Hidemoto Nakada, and Satoshi Matsuoka.

Intelligent data staging with overlapped execution of grid applications.

Future Generation Comp. Syst., 24(5) :425–433, 2008.

[PSCS07] Nilton Cezar Paula, Gisele Silva Craveiro, and Liria Matsumoto Sato.

Data transfer in advance on cluster.

In PaCT 2007 : Proceedings of the 9th international conference on Parallel Computing Technologies, pages 599–607,Berlin, Heidelberg, 2007. Springer-Verlag.

[Wikipedia] Wikipedia.

Job scheduler.

http://en.wikipedia.org/wiki/Job_scheduler, June 2010.

[WOK00] Adrian T. Wong, Leonid Oliker, William T. C. Kramer, Teresa L. Kaltz, and David H. Bailey.

Esp : A system utilization benchmark.

In In Proceedings of the Supercomputing 2000 Conference, 2000.


http://en.wikipedia.org/wiki/Job_scheduler

batch scheduler and file management - inria...

Documents