ndgf co2 community grid olli tourunen nordunet 2008 espoo, finland april 10th 2008
TRANSCRIPT
NDGF CO2 Community NDGF CO2 Community GridGrid
Olli TourunenNORDUnet 2008Espoo, FinlandApril 10th 2008
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
2
Topics Topics
Project overview First use case Requirements and architecture Implementation Experiences Statistics Future
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
3
CO2-CG overviewCO2-CG overview
NDGF Community Grid (CO2-CG) project aims to build an application environment for scientists studying CO2 sequestration
CO2-CG was selected in NDGF call for community projects along with BioGrid NDGF provides project coordinator and half FTE for
application grid integration plus funding for full FTE for community software development
One year project, started in fall 2007 Project coordinator: Michael Gronager (NDGF) Project leader: Klaus Johannsen (BCCS, Bergen) Science specialist: Philip Binning (DTU, Copenhagen) Software developer: Csaba Anderlik (BCCS, Bergen) Grid specialist: Olli Tourunen (NDGF)
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
4
First use case for CO2-CGFirst use case for CO2-CG
Parameter study of different attributes of potential CO2 sequestration reservoirs
Software: MUFTE-UG, a general purpose simulator for multi-phase, multi-component flow in porous media
Pilot user: Andreas Kopp (University of Stuttgart)
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
5
First use case (contd.)First use case (contd.)
Order of hundreds of 32 to 64 processor parallel simulations, computationally bound (not data intensive)
One simulation covers a time frame of approximately 50 years starting from CO2 injection to the reservoir
Why parallel? Isn’t this a parameter study after all? A single 32 process run typically takes 3-4 days to
complete With 16 processes we might be running over a week
Resources for these simulations are provided mainly by NOTUR, the Norwegian national infrastructure for computational science
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
6
RequirementsRequirements
Main target: Provide scientists with transparent access to computational resources in the grid
Input: User’s working directory containing MUFTE-UG source code and the input files for the simulation
Output: Simulation results returned to the user in a user specified directory
Support for NorduGrid ARC middleware Standard grid credential handling to avoid need
for custom security policies with participating sites
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
7
ARC
ARC
Cluster A
MUFTERuntime
Environment
Cluster B
MUFTERuntime
Environment
Architecture overviewArchitecture overview
Grid Job
Manager
Supercomputer C
MUFTERuntime
Environment
Application server
DB
Job descr 1Job descr 2Job descr 3
RS
S
RR
CommandLine UI
S
S S
S Software
R Results
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
8
ArchitectureArchitecture
Command line UI (application server) Introduces one keyword ‘grid’ which can be
invoked with different options á la openssl Example:
user prepares the source code and input files for a simulation in a directory of her choice
User issues command like ‘grid submit –np 32’ The submit module packages the simulation
directory into the spool directory and inserts the parameters into the database
User tracks the progress by running ‘grid status’ The results are made available to user when the
job finishes
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
9
Architecture (contd.)Architecture (contd.)
Grid Job Manager (application server) Scans the database for new jobs Prepares the new jobs for grid based on job
parameters Submits the jobs into grid Keeps track of the grid jobs Downloads the results when a job is ready Downloads the evidence for autopsy when a job fails
MUFTE Runtime Environment (grid resource) Standard ARC Runtime Environment Compiles the software based on local configuration
and environment Runs the simulation
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
10
ImplementationImplementation
Grid Job Manager (GJM) There is one GJM instance per user “One sweep at a time” -job, intended to be
launched from cron Runs under user credentials Spools active jobs in /var/spool/co2-cg/<user> Written in Python Uses an object-RDB –mapper called SQLAlchemy Interacts with ARC grid middleware through
standard user commands Python API for ARC is also available, might use that
in the future
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
11
Implementation (contd.)Implementation (contd.)
Database Standard PostgreSQL relational database 3 main tables plus some auxiliary ones
Runtime Environment Compilation is done on the ARC server host before
the job is submitted using user’s credentials Compilation and execution parameters are based
on the job attributes in the DB Supported levels of parallelism are encoded in the
RE name (e.g. MUFTE-MPI-64-1.0)
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
12
ChallengesChallenges
Transparent grid credential handling Balance the security policies and ease of use
Parallel run parameterization User needs vs. types of resources vs. available
resources No explicit brokering support for this in ARC This can be done with clever RE naming
Database access right management (not really an issue until this goes to bigger scale) Lots of different possibilities to solve this if needed (DB
level access rights, per user tablespaces, row change staging, n-layer architecture outside the DB…). So far applied KISS.
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
13
Experiences: User sideExperiences: User side
User can access a significant number of distributed resources in a transparent manner Peak so far: 512 cores simultaneously in use
Problems Memory specifications for the jobs Walltime specifications for the jobs Getting all the information to debug the jobs that
have crashed Non-converging jobs
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
14
Experiences: Operator sideExperiences: Operator side
It takes around a day setup the MUFTE RE in a new cluster If the site has experience in running MPI-jobs
through ARC, the process is quite straightforward In one case we have also had to set up a cross
compiling facility AA is easy to configure since the users are
managed in NDGF VOMS Since there are not that many parallel jobs run
in the grid, ARC LRMS interface needed some tweaks in some clusters
Thanks for all the sysadmins that have helped us along the way!
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
15
StatisticsStatistics
Since February 12th 2008, over 400 simulations of 16 to 64 processors have been run
Total compute time around 230000 hours Disclaimer: measurements done from the
application server side, not from resources accounting.
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
16
Statistics (contd.)Statistics (contd.)
Walltime usage 2008
010000
200003000040000
5000060000
7000080000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Week
Wal
ltim
e h
ou
rs
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
17
Statistics (contd.)Statistics (contd.)
Walltime distribution
213000
240003500 600
0
50000
100000
150000
200000
250000
Titan (Oslo)
Fyrkat(Aalborg)
Fimm(Bergen)
Stallo(Tromsö)
Wal
ltim
e h
ou
rs
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
18
Future developmentsFuture developments
Switching focus to operation Software and application server hardening Automated tests for the runtime environments
+ blacklisting Cleanup procedures Integrate CO2-CG into the NDGF accounting
system Track the simulations that are not converging
Easier certificate handling Possibly a web portal for job tracking and
collaboration Include the new Cray XT4 in Bergen
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
19
ConclusionsConclusions
With moderate effort, simple tools and application specific user interface the grid resource usage can be made easy for the end users
On-demand compilation works for selected applications
Parallel jobs can be run in a large scale in the grid with little effort
NDGF CO2-Community GridNORDUnet 2008, Espoo, Finland, April 10th 2008
20
Thank you!
Questions, comments?