dr. david wallom experience of setting up and running a production grid on a university campus july...
TRANSCRIPT
![Page 1: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/1.jpg)
Dr. David Wallom
Experience of Setting up and Running a Production Grid on a University CampusJuly 2004
![Page 2: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/2.jpg)
2 Outline
• The Centre for e-Research Bristol & its place in national efforts
• University of Bristol Grid• Available tool choices• Support models for a distributed system• Problems encountered• Summary
![Page 3: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/3.jpg)
3 Centre for e-Research Bristol
• Established as a Centre of Excellence in visualisation.
• Currently has one full time member of staff with several shared resources.
• Intended to lead the University e-Research effort including as many departments and non-traditional computational users as possible.
![Page 4: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/4.jpg)
4 NGS (www.ngs.ac.uk) UK National Grid Service
• ‘Free’ dedicated resources accessible only through
Grid interfaces, i.e. GSI-SSH, Globus Toolkit
• Compute clusters (York & Oxford)– 64 dual CPU Intel 3.06 GHz nodes, 2GB RAM– Gigabit & Myrinet networking
• Data clusters (Manchester & RAL)– 20 dual CPU Intel 3.06 GHz nodes, 4GB RAM– Gigabit & Myrinet networking– 18TB Fibre SAN
• Also national HPC resources: HPC(x), CSAR
• Affiliates: Bristol, Cardiff, …
![Page 5: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/5.jpg)
5 The University of Bristol Grid
• Established as a way of leveraging extra use from existing resources.
• Planned to consist of ~400 CPU from 1.2 → 3.2GHz arranged in 6 clusters. Currently about 100 CPU in 3 clusters.
• Initially legacy OS though all now moving to Red Hat Enterprise Linux 3.
• Based in and maintained by several different departments.
![Page 6: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/6.jpg)
6 The University of Bristol Grid
• Decided to construct a campus grid to gain experience with middleware & system management before formally joining NGS.
• Central services all run on Viglen servers.– Resource Broker– Monitoring and Discovery System & Systems Monitoring.– Virtual Organisation Management– Storage Resource Broker Vault– myProxy Server
• Choices of software to provide these was lead by personal experience & other UK efforts to standardise.
![Page 7: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/7.jpg)
7 The University of Bristol Grid, 2
• Based in and maintained by several different
departments.
• Each system with a different System Manager!
• Different OS’s, initiall just Linux & Windows, though
others will come.
• Linux versions initially legacy though all now
moving to Red Hat Enterprise Linux.
![Page 8: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/8.jpg)
8 The System Layout
![Page 9: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/9.jpg)
9 System Installation Model
Draw it on the board!
![Page 10: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/10.jpg)
10 Middleware
• Virtual Data Toolkit.– Chosen for stability and support
structure.– Widely used in other European
production grid systems.• Contains the standard Globus Toolkit
version 2.4 with several enhancements.
![Page 11: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/11.jpg)
11 Resource Brokering
• Uses the Condor-G job distribution mechanism.
• Custom script for determination of resource priority.
• Integrated the Condor job submission system to the Globus Monitoring and Discovery Service.
![Page 12: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/12.jpg)
13 Accessing the Grid with Condor-G
• Condor-G allows the user to treat the Grid as a
local resource, and the same command-line tools
perform basic job management such as: – Submit a job, indicating an executable, input and output files, and arguments – Query a job's status – Cancel a job – Be informed when events happen, such as normal job termination or errors – Obtain access to detailed logs that provide a complete history of a job
• Condor-G extends basic Condor functionality to the
grid, providing resource management while still
providing fault tolerance and exactly-once
execution semantics.
![Page 13: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/13.jpg)
14 How to submit a job to the system
![Page 14: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/14.jpg)
15 Limitations of Condor-G
• Submitting jobs to run under Globus has not yet
been perfected. The following is a list of known
limitations: – No checkpoints. – No job exit codes. Job exit codes are not
available. – Limited platform availability. Condor-G is only
available on Linux, Solaris, Digital UNIX, and IRIX. HP-UX support will hopefully be available later.
![Page 15: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/15.jpg)
16 Resource Broker Operation
![Page 16: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/16.jpg)
17 Load Management
• Only defines the raw numbers of jobs running, idle
& with problems.
• Has little measure of relative performance of nodes
within grid.
• Once a job has been allocated to a remote cluster
then rescheduling elsewhere is difficult.
![Page 17: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/17.jpg)
18
Provision of a Shared Filesystem
• Providing a Grid means it is beneficial to provide a shared file system.
• Newest machines come with minimum of 80GB hard-drives of which minimum is necessary for OS & user scratch space
• System will have 1TB Storage Resource Broker Vault as one of the core services.
– Take this one step further buy partitioning system drives on core servers,– Create virtual disk of ~400GB using spare space on then all!– Install SRB client on all machines so that they can directly access shared storage.
![Page 18: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/18.jpg)
19
Automation of Processes for Maintenance
• Installation• Grid state monitoring• System maintenance• User control• Grid Testing
![Page 19: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/19.jpg)
20
Individual System Installation
• Simple shell scripts for overall control.• Ensures middleware, monitoring and user
software all installed in consistent place.• Ensures ease of system upgrades.• Ensures system managers have a chance
to view installation method before hand.
![Page 20: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/20.jpg)
21 Overall System Status and Status of the Grid
![Page 21: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/21.jpg)
22 Ensuring the System Availability
• Uses the Big Brother™ system.– Monitoring occurs through server-client
model.– Server maintains limits settings and
pings resources listed.– Clients record system information and
report this to the server using secure port.
![Page 22: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/22.jpg)
23 Big Brother™ Monitoring
![Page 23: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/23.jpg)
24 Grid Middleware Testing
• Uses the Grid Interface Test Script (GITS)
developed for the ETF.– Tests the following;
• Globus Gatekeeper running and available.• Globus Jobsubmission system• Presence of machine within the Monitoring & Discovery Service.• Ability to retrieve and distribute files through GridFTP.
• Run within the UoB grid every 3 hours.
• Latest results available on the Service webpage.
• Only downside is that it also needs to run as a
standard user not system.
![Page 24: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/24.jpg)
25 Grid Middleware Testing
![Page 25: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/25.jpg)
26 What is currently running and how do I find out?
![Page 26: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/26.jpg)
27 Authorisation And Authentication on the University of Bristol Grid
• Make use of the standard UK e-Science Certification Authority.
• Bristol is an authorised Registration Authority for this CA.
• Uses x509 type certificates and proxies for user AAA.
• May be replaced at a later date dependant on the current system scaling model.
![Page 27: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/27.jpg)
28 User Management
• Globus uses a mapping between Distinguished name as defined in a Digital Certificate to local usernames on resources.
• Located in controlled disk space.• Important that for each resource that a user is
expecting to use, his DN is mapped locally.• Distributing this is Organisation Management
![Page 28: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/28.jpg)
29 Virtual Organisation Management and Resource Usage Monitoring/Accounting
![Page 29: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/29.jpg)
30 Virtual Organisation Management and Resource Usage Monitoring/Accounting, 2
• The server (previous) runs as a grid service using the ICENI framework.
• Clients located on machines that form part of the Virtual Organisation.
• Drawback currently is that this service must run using a personal certificate instead of machine certificate that would be ideal.
• Coming in new versions from OMII.
![Page 30: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/30.jpg)
31 Locally Supporting a Distributed System
• Within the university first point of contact is always Information Services Helpdesk.– Given a preset list of questions to ask and log files to see
if available.– Not expected to do any actual debugging.– Pass problems onto Grid experts who then pass problems
on a system by system basis to their own maintenance staff.
• As one of the UK e-Science Centres we also have access to the Grid Operations and Support Centre.
![Page 31: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/31.jpg)
32 Supporting a Distributed System
• Having a system that is well defined simplifies the support model.
• Trying to define a Service Level Description for each department to the UOBGrid as well as a overall UOBGrid Service Level Agreement to users.– Defines hardware support levels and availability.– Defines at a basic level the software support
that will also be available.
![Page 32: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/32.jpg)
33 Problems Encountered
• Some of the middleware that we have been trying to use has not been a reliable as we would have hoped.
– MDS is a prime examples where necessity for reliability has defined our usage model.
– More software than desired still has to make use of a user with an individual DN to operate. This must change for a production system.
• Getting time and effort from some already overworked
System Managers has been tricky with sociological barriers – “Won’t letting other people use my system just mean I will have less available for
me?”
![Page 33: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/33.jpg)
34 Notes to think about!
• Choose your test application carefully
• Choose your first test users even more carefully!
• One use with a bad experience is worth 10 with
good experiences– Grid has been very over hyped so people
expect it all to work first-time every time!
![Page 34: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/34.jpg)
35 Future Directions within Bristol
• Make sure the rest of the University clusters are installed
and running on the UoBGrid as quickly as possible.
• Ensure that the ~600 Windows CPU currently part of
Condor pools are integrated as soon as possible. This
will give ~800CPU.
• Start accepting users from outside the University as part
of our commitment to the National Grid Service.
• Run the Bristol systems as part of the WUNGrid.
![Page 35: Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004](https://reader030.vdocument.in/reader030/viewer/2022033103/56649d1f5503460f949f28ea/html5/thumbnails/35.jpg)
36 Further Information
• Centre for e-Research Bristol:http://escience.bristol.ac.uk
• Email: [email protected]
• Telephone: +44 (0)117 928 8769