batch scheduler and file management - inria...
TRANSCRIPT
Batch Scheduler and File Management
Olivier Richard - Associate ProfessorMescal Team - LIG/INRIA
23 juin 2010
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 1 / 23
Outline
1 Generalities on Batch Scheduler
2 Challenges, Recent Features and Trends
3 File System and File ManagementData StagingApplications
4 Toward Data-Aware Batch Scheduler
5 Benchmarks, Workloads, Experimentations
6 Conclusions
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 2 / 23
Backgrounds on our ongoing works
: A Versatile Resource Management System (Batch Scheduler).
: A Management System for Lightweight Grid and Bag-of-TasksExecution.
: A Scalable Low-Level Deployment/Provisioning Tool.
: An Adaptive Parallel Launcher
: A Platform Dedicated to Experimentation (HPC, Dist. Sys., Net.)
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 3 / 23
Generalities on Batch Scheduler
Batch Scheduler : a Global Picture 1
Batch Scheduler
Users
Submissions
Nodes
Queues
Clu
ster
Goals : Allocate resources for each applications with respect of theirrequirements and users’ rights. Satisfy users (response time, reliability) andadministrators (high resource utilization, efficiency, energy management...).
Loadleveler(IBM), PBS, LSF, Slurm(LLNL), Torque, SGE, Condor, OAR 1
1. for more exhaustive list see : http ://en.wikipedia.org/wiki/Job scheduleOlivier Richard () Batch Scheduler and File Management 23 juin 2010 4 / 23
Generalities on Batch Scheduler
Batch Scheduler : a General Picture 2
Resource Management System
Users
Submissions
Nodes
Queues
Clu
ster
Job Management System (Scheduling)
Resource Management Layer : launching, cleaning, monitoring...
Job Management Layer : batch/interactive job, Backfilling (EASY orConservative) Scheduling, Suspend/Resume, Preemption, Dependences,Resubmission, Advance Reservation...
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 5 / 23
Generalities on Batch Scheduler
Batch Scheduler : a General Picture 3
Resource Management System
Users
Submissions
Nodes
Queues
Clu
ster
Job Management System
Workload Management System
Sche
dulin
g
Workload/Job Management : more complete job scheduling policies
Fairsharing, Quality of Service (QoS), SLA (Service Level Agreement),Energy Saving, Time Varying Policies (day/night, week-end, holidays ...)
Dedicated software : MAUI and Catalina,
There is not true separation into some systems, for instance Slurm and OAR.
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 6 / 23
Generalities on Batch Scheduler
Architecture and main components
Submission
Scheduler
Matchingof resource
Launching and control of execution
Client
Server
Computing nodes
Users
Log, Accounting
Monitoring
Few components, but the number of jobs and resources states plus thescheduling policies and a huge number of configurable parameters lead to agreat system complexity.
It’s not so easy to tune and to optimize a Batch Scheduler.
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 7 / 23
Challenges, Recent Features and Trends
Challenges, Recent Features and Trends
Scalability (remains the number one issue)
Topology constraint (hierarchy, NUMA, I/O Bandwidth)
Energy Saving (node power on/off, DVFS, no so simple)
Dynamic jobs, massive submission,
Infrastructure diversity (virtual compute node, multi-cluster,GPGPU...)
Master the increase of (global) complexity
How to track the global efficiency of the global computinginfrastructure (and how to optimize it) ?
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 8 / 23
Challenges, Recent Features and Trends
Topology-aware Scheduling
node 2*4cores
X1 GB/s
X2 GB/s
X2>>X1X1 GB/s
X2 GB/s
switch
switch
Application
Application
Better Perfs.
Bottleneck
See [JP09] for problem definition
Two ways : 1) respect strictly topology constraint (OAR). 2) If it’s not possible to respectconstraint, BS allocate next contiguous set of resource (clever resources numbering in 3Dtorus) (Slurm).
Trade-offs between response-time and execution time.
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 9 / 23
Challenges, Recent Features and Trends
Scalability
Which granularity for resource representation and manipulation
core, thread (too fine) ? (generally a flat data structure in batchscheduler)
nodes (most used) (Slurm can manage upto 64K nodes, how manycores ?)
Add some policies for fine tuning (cpuset, cgroup, CPU affinity, BulkI/O, (next steps bandwidth)...)
partitions (set of nodes) (sometimes used in large cluster)
Other resources issues
Memory, network cards, L3 Cache partitioning (Power 7), DVFScontrol...
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 10 / 23
File System and File Management
File System and File Management
Some facts
Scaling up by adding I/O capacities is expensive.
File System remains very sensitive to bottleneck effect.
At very large scale checkpointing increase pressure on file system.
SSD will offer new options but imply more complexity in systemmanagement ?
File System and Batch Scheduler
Few studies address Batch Scheduler and File System interactions.
Upto now, administrator use Batch Scheduler’s configurability toinsert workarounds and to develop ad-hoc (not portable) strategies.
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 11 / 23
File System and File Management
Hierarchical Storage Management
Hierarchical storage management (HSM) provides mechanisms to automatically move, migrateand recall files be tape and disk.
Cluster Nodes
HPSSMover
GPFSserver
HPSSMover
GPFSserver
HPSSMover
GPFSserver
HPSSserver
GPFS HPSSscheduler
Tape/Disk
Tape/Disk
Tape/Disk
Disk
Disk
Disk
Disk/SSD?
Disk/SSD?
Disk/SSD?
Typically HSM for archival purpose. Example of such system : High Performance StorageSystem (HPSS) 2
HPSS provides both striping (for large files) and file aggregation (for little ones)
Coupling HPSS / GPFS :
Automatic migration file via a rule based system which identified files to migrate(ILM policy/engine).
2. http ://www.hpss-collaboration.org/Olivier Richard () Batch Scheduler and File Management 23 juin 2010 12 / 23
File System and File Management
File System and File Management in HPC Cluster
2 main ways
Parallel File System
A shared parallel file system
Lustre, GPFS, PVFS ... NFSv4.1 (pNFS) protocol
Use on most systems
Implicit/No Control
Data Staging
Job’s data requirements are identified and provided by user in submitted script.
Stage-in : Input Files are transfered to local disk of compte nodes before job starting.
Stage-out : Ouptut Files are transfered from nodes to mass storage after execution.
Nowadays, rarely used on cluster, mainly used in Grid context
Explicit Control
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 13 / 23
File System and File Management Data Staging
Data Staging 1
Example 1 : data staging approach purposed in Loadleveler a
a. excerpt from official documentation
1. A single replica of the data files needed by a job have to be createdon a common file system.
2. A replica of the data files has to be created on every machine onwhich the job will run.
Time when data staging occur can be specified in submitted script(via AT SUBMIT and AT JUST IN TIME keywords)
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 14 / 23
File System and File Management Data Staging
Data Staging 2
Example 2 : on a large system managed with Slurm
Slurm does not support directly data staging.
Approach : introduce an helper job (HJ) to trigger data staging-out.
HJ have a dependence on job which generates output data.
HJ keeps only one core by node to realize the data transfer to storagesystem. Others cores are release to the system (job dynamics orco-allocation).
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 15 / 23
File System and File Management Applications
Applications
Libraries
Parallel format HDF5, NetCFD
MPI I/O (ROMIO, etc.)
Specialized Environments with their own file storage/management runtine
a la Map-Reduce (Hadoop) (Data mining, data analysis)
Blobseer / Visualization (see Bouge, Antoniu)
Falkon/Swift (Many Lossely Coupled Tasks)
Application with specific checkpointing runtime and strategies
Workflow : to coordinate computing, visualization, post-process analysis tasks (dedicatedscheduling).
Can we obtained accurate I/O profiles and I/O requirements from applications ?
Which level of variability (impact of failure recovery) ?
These kind of information are needed to allocate resources and I/O bandwidth betweenconcurrent applications.
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 16 / 23
Toward Data-Aware Batch Scheduler
Toward Data-Aware Batch Scheduler
Cluster Nodes
HPSSMover
GPFSserver
HPSSMover
GPFSserver
HPSSMover
GPFSserver
HPSSserver
GPFS HPSSscheduler
Tape/Disk
Tape/Disk
Tape/Disk
Disk
Disk
Disk
Disk/SSD?
Disk/SSD?
Disk/SSD?
Resource Management System
Users Submissions
Queues
Job Management System
Workload Management System
Sche
dulin
g
Data-AwareScheduling
Goal : Data-Aware Scheduler which can better interact with Global File System andApplication/Runtime
Needs : accurate I/O profiles, I/O requirements
Identify possibilities of control : runtime,automatic data staging (BW)
I/O bandwidth, I/O nodes co-allocation (topology aware and locality preservation),
I/O Overlapping, Gang-Scheduling, Time-Sharing, specific scheduling algorithms...
Lot of factor, parameter, features interactions...Olivier Richard () Batch Scheduler and File Management 23 juin 2010 17 / 23
Toward Data-Aware Batch Scheduler
Example of interaction : application/checkpoint/batchscheduler Unused Cores
Nodes Powered OFF
CHKPTIN RAM
NODE
1)
2)
3)
4)
APPLICATION A APPLICATION B
CHKPT A
CHKPT B
Diskless checkpoint with RS encode/decode fragments (as proposed by Gomez and Al.)
1) better resilience against failures (good chkpt distribution) but waste of resources
2) lower resilience but better use of resources
3) lower resilience
4) better resilience (good chkpt distribution)
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 18 / 23
Benchmarks, Workloads, Experimentations
Benchmarks, Workloads and Experimentations
What we have at our disposal ?
Parallel I/O Benchmarks
A list of benchmarks, applications and traces maintained by Rajeev Thakurof ANL a
FLASH I/O, IOR, mpiBLAST, NASBT I/O, Qbox...
a. http ://www.mcs.anl.gov/ thakur/pio-benchmarks.html
Batch Schedulers
Only one benchmark : ESP, a system utilization benchmark [WOK00]
Parallel Workloads Archive : Standard Workload Format a
a. http ://www.cs.huji.ac.il/labs/parallel/workload/swf.html
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 19 / 23
Benchmarks, Workloads, Experimentations
Benchmarks, Workloads and Experimentations
Testbed
Testbed like Grid’5000 dedicated platform
But we don’t have
An integrated workload/trace : Batch Scheduler + Parallel I/O System +failures
An HSM like HPFSS(+GPFS) to play with (Can we determine model andbuild simulator and emulator of it ?)
Obtain/build a realistic testbed platform is a challenge
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 20 / 23
Conclusions
Conclusion
The increase of computing infrastructures’ complexity is a big challenge forBatch Schedulers (and for Data-aware Batch Schedulers)
File system can be a bottleneck at large scale in some situation.
Interactions between Batch Scheduler and File Management System havenot been extensively studied in recent (and old ?) literature.
There are surely opportunities to enhance overall infrastructure efficiencythrough interactions/coordinations between BS/FS/Appli..
Experimentation raises several issues.
Workloads, benchmarks, failures injection, HSM emulation.
Testbed elaboration.
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 21 / 23
Conclusions
Thank YouQuestions ?
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 22 / 23
Conclusions
References I
[JP09] J Miguel-Alonso J Pascual, J Navaridas.
Effects of topology-aware allocation policies on scheduling performance.
In In Proceedings of the 14th Workshop on Job Scheduling Strategies for Parallel Processing, 2009.
[MTNM08] Yuya Machida, Shin’ichiro Takizawa, Hidemoto Nakada, and Satoshi Matsuoka.
Intelligent data staging with overlapped execution of grid applications.
Future Generation Comp. Syst., 24(5) :425–433, 2008.
[PSCS07] Nilton Cezar Paula, Gisele Silva Craveiro, and Liria Matsumoto Sato.
Data transfer in advance on cluster.
In PaCT 2007 : Proceedings of the 9th international conference on Parallel Computing Technologies, pages 599–607,Berlin, Heidelberg, 2007. Springer-Verlag.
[Wikipedia] Wikipedia.
Job scheduler.
http://en.wikipedia.org/wiki/Job_scheduler, June 2010.
[WOK00] Adrian T. Wong, Leonid Oliker, William T. C. Kramer, Teresa L. Kaltz, and David H. Bailey.
Esp : A system utilization benchmark.
In In Proceedings of the Supercomputing 2000 Conference, 2000.
Olivier Richard () Batch Scheduler and File Management 23 juin 2010 23 / 23