Dr. David Wallom
Experience of Setting up and Running a Production Grid on a University CampusJuly 2004
2 Outline
• The Centre for e-Research Bristol & its place in national efforts
• University of Bristol Grid• Available tool choices• Support models for a distributed system• Problems encountered• Summary
3 Centre for e-Research Bristol
• Established as a Centre of Excellence in visualisation.
• Currently has one full time member of staff with several shared resources.
• Intended to lead the University e-Research effort including as many departments and non-traditional computational users as possible.
4 NGS (www.ngs.ac.uk) UK National Grid Service
• ‘Free’ dedicated resources accessible only through
Grid interfaces, i.e. GSI-SSH, Globus Toolkit
• Compute clusters (York & Oxford)– 64 dual CPU Intel 3.06 GHz nodes, 2GB RAM– Gigabit & Myrinet networking
• Data clusters (Manchester & RAL)– 20 dual CPU Intel 3.06 GHz nodes, 4GB RAM– Gigabit & Myrinet networking– 18TB Fibre SAN
• Also national HPC resources: HPC(x), CSAR
• Affiliates: Bristol, Cardiff, …
5 The University of Bristol Grid
• Established as a way of leveraging extra use from existing resources.
• Planned to consist of ~400 CPU from 1.2 → 3.2GHz arranged in 6 clusters. Currently about 100 CPU in 3 clusters.
• Initially legacy OS though all now moving to Red Hat Enterprise Linux 3.
• Based in and maintained by several different departments.
6 The University of Bristol Grid
• Decided to construct a campus grid to gain experience with middleware & system management before formally joining NGS.
• Central services all run on Viglen servers.– Resource Broker– Monitoring and Discovery System & Systems Monitoring.– Virtual Organisation Management– Storage Resource Broker Vault– myProxy Server
• Choices of software to provide these was lead by personal experience & other UK efforts to standardise.
7 The University of Bristol Grid, 2
• Based in and maintained by several different
departments.
• Each system with a different System Manager!
• Different OS’s, initiall just Linux & Windows, though
others will come.
• Linux versions initially legacy though all now
moving to Red Hat Enterprise Linux.
8 The System Layout
9 System Installation Model
Draw it on the board!
10 Middleware
• Virtual Data Toolkit.– Chosen for stability and support
structure.– Widely used in other European
production grid systems.• Contains the standard Globus Toolkit
version 2.4 with several enhancements.
11 Resource Brokering
• Uses the Condor-G job distribution mechanism.
• Custom script for determination of resource priority.
• Integrated the Condor job submission system to the Globus Monitoring and Discovery Service.
13 Accessing the Grid with Condor-G
• Condor-G allows the user to treat the Grid as a
local resource, and the same command-line tools
perform basic job management such as: – Submit a job, indicating an executable, input and output files, and arguments – Query a job's status – Cancel a job – Be informed when events happen, such as normal job termination or errors – Obtain access to detailed logs that provide a complete history of a job
• Condor-G extends basic Condor functionality to the
grid, providing resource management while still
providing fault tolerance and exactly-once
execution semantics.
14 How to submit a job to the system
15 Limitations of Condor-G
• Submitting jobs to run under Globus has not yet
been perfected. The following is a list of known
limitations: – No checkpoints. – No job exit codes. Job exit codes are not
available. – Limited platform availability. Condor-G is only
available on Linux, Solaris, Digital UNIX, and IRIX. HP-UX support will hopefully be available later.
16 Resource Broker Operation
17 Load Management
• Only defines the raw numbers of jobs running, idle
& with problems.
• Has little measure of relative performance of nodes
within grid.
• Once a job has been allocated to a remote cluster
then rescheduling elsewhere is difficult.
18
Provision of a Shared Filesystem
• Providing a Grid means it is beneficial to provide a shared file system.
• Newest machines come with minimum of 80GB hard-drives of which minimum is necessary for OS & user scratch space
• System will have 1TB Storage Resource Broker Vault as one of the core services.
– Take this one step further buy partitioning system drives on core servers,– Create virtual disk of ~400GB using spare space on then all!– Install SRB client on all machines so that they can directly access shared storage.
19
Automation of Processes for Maintenance
• Installation• Grid state monitoring• System maintenance• User control• Grid Testing
20
Individual System Installation
• Simple shell scripts for overall control.• Ensures middleware, monitoring and user
software all installed in consistent place.• Ensures ease of system upgrades.• Ensures system managers have a chance
to view installation method before hand.
21 Overall System Status and Status of the Grid
22 Ensuring the System Availability
• Uses the Big Brother™ system.– Monitoring occurs through server-client
model.– Server maintains limits settings and
pings resources listed.– Clients record system information and
report this to the server using secure port.
23 Big Brother™ Monitoring
24 Grid Middleware Testing
• Uses the Grid Interface Test Script (GITS)
developed for the ETF.– Tests the following;
• Globus Gatekeeper running and available.• Globus Jobsubmission system• Presence of machine within the Monitoring & Discovery Service.• Ability to retrieve and distribute files through GridFTP.
• Run within the UoB grid every 3 hours.
• Latest results available on the Service webpage.
• Only downside is that it also needs to run as a
standard user not system.
25 Grid Middleware Testing
26 What is currently running and how do I find out?
27 Authorisation And Authentication on the University of Bristol Grid
• Make use of the standard UK e-Science Certification Authority.
• Bristol is an authorised Registration Authority for this CA.
• Uses x509 type certificates and proxies for user AAA.
• May be replaced at a later date dependant on the current system scaling model.
28 User Management
• Globus uses a mapping between Distinguished name as defined in a Digital Certificate to local usernames on resources.
• Located in controlled disk space.• Important that for each resource that a user is
expecting to use, his DN is mapped locally.• Distributing this is Organisation Management
29 Virtual Organisation Management and Resource Usage Monitoring/Accounting
30 Virtual Organisation Management and Resource Usage Monitoring/Accounting, 2
• The server (previous) runs as a grid service using the ICENI framework.
• Clients located on machines that form part of the Virtual Organisation.
• Drawback currently is that this service must run using a personal certificate instead of machine certificate that would be ideal.
• Coming in new versions from OMII.
31 Locally Supporting a Distributed System
• Within the university first point of contact is always Information Services Helpdesk.– Given a preset list of questions to ask and log files to see
if available.– Not expected to do any actual debugging.– Pass problems onto Grid experts who then pass problems
on a system by system basis to their own maintenance staff.
• As one of the UK e-Science Centres we also have access to the Grid Operations and Support Centre.
32 Supporting a Distributed System
• Having a system that is well defined simplifies the support model.
• Trying to define a Service Level Description for each department to the UOBGrid as well as a overall UOBGrid Service Level Agreement to users.– Defines hardware support levels and availability.– Defines at a basic level the software support
that will also be available.
33 Problems Encountered
• Some of the middleware that we have been trying to use has not been a reliable as we would have hoped.
– MDS is a prime examples where necessity for reliability has defined our usage model.
– More software than desired still has to make use of a user with an individual DN to operate. This must change for a production system.
• Getting time and effort from some already overworked
System Managers has been tricky with sociological barriers – “Won’t letting other people use my system just mean I will have less available for
me?”
34 Notes to think about!
• Choose your test application carefully
• Choose your first test users even more carefully!
• One use with a bad experience is worth 10 with
good experiences– Grid has been very over hyped so people
expect it all to work first-time every time!
35 Future Directions within Bristol
• Make sure the rest of the University clusters are installed
and running on the UoBGrid as quickly as possible.
• Ensure that the ~600 Windows CPU currently part of
Condor pools are integrated as soon as possible. This
will give ~800CPU.
• Start accepting users from outside the University as part
of our commitment to the National Grid Service.
• Run the Bristol systems as part of the WUNGrid.
36 Further Information
• Centre for e-Research Bristol:http://escience.bristol.ac.uk
• Email: [email protected]
• Telephone: +44 (0)117 928 8769