nomadic grid applications: the cactus worm g.lanfermann max planck institute for gravitational...
TRANSCRIPT
Nomadic Grid Applications: The Cactus WORM
G.Lanfermann
Max Planck Institute for Gravitational Physics
Albert-Einstein-Institute, Golm
Dave Angulo
University of Chicago
Chicago, Il.
Grid Application Development Software Project
Outline
The Worm - Migration on the Grid: – Motivation
– Design The Worm: Adaptive Migration
– Data Transfer using GridFTP
– Resource Selection using MDS-2 and ClassAds
– Contract Monitoring using MDS-2
– Intelligent Migration using Gram
Grid Application Development Software Project
This talk:http://people.cs.uchicago.edu/~dangulo/
grads/CactusGrADS-Aug7-GlobusRetreat.ppt
Other documents on GrADS in Cactus architecture:http://people.cs.uchicago.edu/~dangulo/
grads/arch/http://www.cactuscode.org
Paper available in back of 2001 Globus Retreat book
Grid Application Development Software Project
Resource Broker
Requested & Available Resource
Payload
MigrationPayload
Your Grid
Migration on the GridMigration on the Grid
Grid Application Development Software Project
Large Scale HPC Simulation: Daily Routine
The “daily” routine of doing large scale numerical simulations: Take an educated guess at memory requirements, number of
processors, disk space needed Start with the first parameter in a range of values to explore
the behavior of your simulation.
– Select a machine and submit to the queuing system. Wait.
– Archive and analyze the data; make changes to the parameter file, resubmit to the queuing system. Wait.
For the large production run, increase resolution of your experiment, take educated guess at memory,….
Select a big machine, submit to the queue. Wait 3-7 days. Archive data & checkpoint file, resubmit to the queue. Wait 3-7 days. Archive data & checkpoint file, resubmit. Wait 3-7 days. Archive data & checkpoint file, resubmit. Wait 3-7 days. ….
Grid Application Development Software Project
Automating the Routine Let the computer find out about the code’s resource
requirements. Automatically contact appropriate machines, stage
executables and submit to the queuing system. Let the computer monitor the quality of the requested
resources as the simulation progresses. Perform multiple simulations over a range of
parameters automatically and in parallel. Archive the data and give the user a uniform access.
There is plenty of room to automate the way simulations are carried out today.
Grid Application Development Software Project
Cactus + Grid
Cactus based Application ThornsThe Physics: Initial Data, Evolution, Analysis, etc
Grid Aware Application ThornsDrivers for Contract Management, Dynamic Resource Detection,
Simulation Relocation
Grid Enabled Communication Library
MPICH-G2 implementation of MPI, can run MPI programs across heterogeneous computing
resources
Standard MPI
SingleProc
Grid Application Development Software Project
The Grid Layer Concept
Application Thornsprovides: Initial Data,
Analysis,Evolution
Grid Thorns provide:Migration & Resource
Management
Grid Enabled Simulation
Grid Enabled Computational Framework
Cactus Computational Framework
Grid Application Development Software Project
Migrating Applications on the Grid
PayloadApplicationInformation
Server
AIS
Migration Unit
Resource Management
Resource SelectorClient
Worm Layer
Hibernation Storage
Off Site Data ServerResource Broker
Resource Broker
Grid Application Development Software Project
Current ArchitectureUnder Development
Resource Selection Client ThornExternal Resource Selection Service
“Worm” Migration ModuleCactus Worm Server
ThornsCactus Application Unit
Cactus Flesh
Performance Degradation Detection
User Supplied Application Payload
External Processes
Migration Logic Manager
GridFTP Client Thorn
External GridFTP Server (Source)
External GridFTP Server (Destination)
Data transfer
Gram
Grid Application Development Software Project
Migration of Checkpoint Files
Uses alpha version of GridFTP Allows Third Party Transfer
– Without this, need to> do a GET to transfer files from source to Migrator
> do a PUT to transfer files from Migrator to destination
Uses GSI security– Allows grid-proxy with only a single sign-on
while retaining tight security Allows fast, efficient, reliable transfer
Grid Application Development Software Project
Resource Selector Architecture(ClassAds) Resource Selection Client Thorn
ClassAds library
Resource Selection Engine
Request in ClassAds format
Response in XML
GIIS
NWS
Resources
UTk Project
GRISs
GRISs
MDS-2
Grid Application Development Software Project
MDS-2 Future Plans
Resource selector goes to GRIS directly after resources discovered
To investigate: strategies for managing update traffic
Would like persistent queries to support notification of changes in resource status
Grid Application Development Software Project
Resource Selection:Example Input: ClassAds format
[
Type="request";
Owner="dangulo";
RequiredDomains={"cs.uiuc.edu", "ucsd.edu"};
requirements= "other.opSys==‘LINUX’ &
other.minMemSize> (100G/other.CPUCount) &&
Include(other.domains, RequiredDomains)
";
Rank= other.minCPUSpeed * other.CPUCount / (other.maxCPULoad+1);
]
Grid Application Development Software Project
Resource Selection:Example output
<virtualMachine> <result statusCode="200" statusMessage="OK"/> <machineList> <machine dns="amajor.cs.uiuc.edu" processor=" 1"> <machine dns="bmajor.cs.uiuc.edu" processor=" 1"> <machine dns="cmajor.cs.uiuc.edu" processor=" 1"> <machine dns="dmajor.cs.uiuc.edu" processor=" 1"> <machine dns="emajor.cs.uiuc.edu" processor=" 1"> <machine dns="fmajor.cs.uiuc.edu" processor=" 1"> <machine dns="hmajor.cs.uiuc.edu" processor=" 1"> </machineList></virtualMachine>
Grid Application Development Software Project
Performance ModelWorking on putting Performance Model into ClassAdsEvery processor is assigned to computer XYZ/N grid points.Requested Memory > 16(constant) + 512 * (10E-6)(constant) *
(XYZ / N) (MB)Time needed to perform an iteration= (computation time +
communication time) * slowdown800 Floating point operations every grid point per iteration.Computation time= 800(constant) * (XYZ / N)/ cpuspeed
cpuspeed is FLOPS
Communication time= 1/G * 2*( T1 + 2 * T2 * GXYR)T1 is the communication latency between two processors.
latency from NWS
T2 is the transmit time for a wordT2 = 1 / (available bandwidth)
available bandwidth from NWS
Slowdown=1 + cpuload
Grid Application Development Software Project
Contract Monitor
Driven by three user-controllable parameters– Time quantum for “time per iteration”– % degradation in time per iteration (relative to
prior average) before noting violation– Number of violations before migration
Potential causes of violation– Competing load on CPU– Computation requires more processing power:
e.g., mesh refinement, new subcomputation– Hardware problems
Grid Application Development Software Project
Contract Monitor Details The end user specifies several variables. These variables can be changed during runtime by
contacting the application with an HTTP interface. These variables include:
– time quantum– % degradation– number of violations before migration
The system will then calculate the average wall clock time per iteration for each time quantum.
If the average iteration in any time quantum has lower performance (by the percentage specified) than the average for all the other previous quanta, then a violation is noted.
Grid Application Development Software Project
Actions Taken on Contract Violation
Occurs when more than the specified number of violations have been noted
New set of resources requested from the ResourceSelector
Checkpoints the application Moves checkpoint data to the new
resources along with other data needed for restart
Restarts application on the new resources
Grid Application Development Software Project
Migration Manager
Allows RS selection to occur asynchronously
Make intelligent choice on whether migration will actually help– Will not migrate to seemingly lower quality
resources
Grid Application Development Software Project
Summary
The Worm gives easy adaptability to changing grid environments to researchers in physics and computational science
Data Transfer using GridFTP Resource Selection using MDS-2 and
ClassAds Contract Monitoring using MDS-2 Intelligent Migration using Gram