grid and e-science technologies
DESCRIPTION
Grid and e-Science Technologies. Simon Cox Technical Director Southampton Regional e-Science Centre. Summary. The Grid problem : Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations - PowerPoint PPT PresentationTRANSCRIPT
Grid and e-Science Technologies
Simon CoxTechnical Director
Southampton Regional e-Science Centre
Summary The Grid problem: Resource sharing & coordinated problem
solving in dynamic, multi-institutional virtual organizations Grid architecture: Protocol, service definition for
interoperability & resource sharing Grid Middleware
Globus Toolkit a source of protocol and API definitions—and reference implementations
Open Grid Services Architecture represents next step in evolution Condor High throughput Computing Web Services & W3C leveraging e-business
e-Science Projects applying Grid concepts to applications
Grid Computing
The Grid Problem“Flexible, secure, coordinated resource sharing among
dynamic collections of individuals, institutions, and resource”
- “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” by Foster, Kesselman and Tuecke
Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals - assuming the absence of … central location central control omniscience existing trust
Why Grids? (1) e-Science A biochemist exploits 10,000 computers to screen
100,000 compounds in an hour 1,000 physicists worldwide pool resources for peta-op
analyses of petabytes of data Civil engineers collaborate to design, execute, &
analyze shake table experiments Climate scientists visualize, annotate, & analyze
terabyte simulation datasets An emergency response team couples real time data,
weather model, population data
Grid Communities & Applications:Data Grids for High Energy Physics
Tier2 Centre ~1 TIPS
Online System
Offline Processor Farm
~20 TIPS
CERN Computer Centre
FermiLab ~4 TIPSFrance Regional Centre
Italy Regional Centre
Germany Regional Centre
InstituteInstituteInstituteInstitute ~0.25TIPS
Physicist workstations
~100 MBytes/sec
~100 MBytes/sec
~622 Mbits/sec
~1 MBytes/sec
There is a “bunch crossing” every 25 nsecs.
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server
Physics data cache
~PBytes/sec
~622 Mbits/sec or Air Freight (deprecated)
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Caltech ~1 TIPS
~622 Mbits/sec
Tier 0Tier 0
Tier 1Tier 1
Tier 2Tier 2
Tier 4Tier 4
1 TIPS is approximately 25,000
SpecInt95 equivalents
www.griphyn.org www.ppdg.net www.eu-datagrid.org
Network for EarthquakeEngineering Simulation
NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other
On-demand access to experiments, data streams, computing, archives, collaboration
NEESgrid: Argonne, Michigan, NCSA, UIUC, USC
DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago
tomographic reconstruction
real-timecollection
wide-areadissemination
desktop & VR clients with shared controls
Advanced Photon Source
Online Access to Scientific Instruments
archival storage
Why Grids? (2) e-Business Engineers at a multinational company collaborate on
the design of a new product A multidisciplinary analysis in aerospace couples code
and data in four companies An insurance company mines data from partner
hospitals for fraud detection An application service provider offloads excess load to
a compute cycle provider An enterprise configures internal & external resources
to support e-Business workload
Grids: Why Now?
Moore’s law highly functional end-systems Ubiquitous Internet universal connectivity Network exponentials produce dramatic changes in
geometry and geography 9-month doubling: double Moore’s law! 1986-2001: x340,000; 2001-2010: x4000?
New modes of working and problem solving emphasize teamwork, computation
New business models and technologies facilitate outsourcing
Elements of the Problem Resource sharing
Computers, storage, sensors, networks, … Heterogeneity of device, mechanism, policy Sharing conditional: negotiation, payment, …
Coordinated problem solving Integration of distributed resources Compound quality of service requirements
Dynamic, multi-institutional virtual organisations Dynamic overlays on classic org structures Map to underlying control mechanisms
http://www.globus.org/research/papers/anatomy.pdf
The Grid World: Current Status Dozens of major Grid projects in scientific & technical
computing/research & education Deployment, application, technology
Some consensus on key concepts and technologies Open source Globus Toolkit™ a de facto standard for major
protocols & services Far from complete or perfect, but out there, evolving rapidly,
and large tool/user base
Global Grid Forum a significant force Industrial interest emerging rapidly
http://www.gridforum.org
Grid Middleware(coordinate and authenticate use of grid services)
Globus (and GGF grid-computing protocols) Security Infrastructure (GSI) Resource Allocation Mechanism (GRAM) Resource Information System (GRIS) Index Information Service (GIIS) Grid-FTP Metadirectory service (MDS 2.0+) coupled to LDAP server
Condor (distributed high performance throughput system) Condor-G allows us to handle dispatching jobs to our Globus system Active collaboration from with the Condor development team at
University of Wisconsin (Miron Livny)
The Globus ProjectMaking Grid computing a reality
Close collaboration with real Grid projects in science and industry Development and promotion of standard Grid protocols to enable
interoperability and shared infrastructure Development and promotion of standard Grid software APIs and SDKs
to enable portability and code sharing The Globus Toolkit: Open source, reference software base for building
grid infrastructure and applications Global Grid Forum: Development of standard protocols and APIs for
Grid computing
http://www.gridforum.orghttp://www.globus.org
Four Key Protocols The Globus Toolkit centers around four key protocols
Connectivity layer: Security: Grid Security Infrastructure (GSI)
Resource layer: Resource Management: Grid Resource Allocation Management
(GRAM) Information Services: Grid Resource Information Protocol
(GRIP) Data Transfer: Grid File Transfer Protocol (GridFTP)
User
Userprocess #1
Proxy
Authenticate & create proxy
credential
GSI(Grid
Security Infrastruc-
ture)
Gatekeeper(factory)
Reliable remote
invocation
GRAM(Grid Resource Allocation & Management)
Reporter(registry +discovery)
Userprocess #2Proxy #2
Create process Register
The Globus Toolkit in One Slide Grid protocols (GSI, GRAM, …) enable resource sharing within virtual orgs; toolkit provides reference implementation ( = Globus Toolkit services)
Protocols (and APIs) enable other tools and services for membership, discovery, data mgmt, workflow, …
Other service(e.g. GridFTP)
Other GSI-authenticated remote service
requests
GIIS: GridInformationIndex Server (discovery)
MDS-2(Meta Directory Service)
Soft stateregistration;
enquiry
Globus Toolkit: Evaluation (+)
Good technical solutions for key problems, e.g. Authentication and authorization Resource discovery and monitoring Reliable remote service invocation High-performance remote data access
This + good engineering is enabling progress Good quality reference implementation, multi-language
support, interfaces to many systems, large user base, industrial support
Growing community code base built on tools
Globus Toolkit: Evaluation (-) Protocol deficiencies, e.g.
Heterogeneous basis: HTTP, LDAP, FTP No standard means of invocation, notification, error
propagation, authorization, termination, …
Significant missing functionality, e.g. Databases, sensors, instruments, workflow, … Virtualization of end systems (hosting envs.)
Little work on total system properties, e.g. Dependability, end-to-end QoS, … Reasoning about system properties
4
http:/ / www.cs.wisc.edu/ condor
The Condor Project (Established ‘85)Distributed High Throughput Computing researchperformed by a team of ~25 faculty, full t ime staffand students who:
h f ace sof tware engineering challenges in adistributed UNI X/ Linux/ NT environment,
h are involved in national and internationalcollaborations,
h actively interact with academic and commercialusers,
h maintain and support a large distributedproduction environment,
h and educate and train students.Funding – US Govt. (DoD, DoE, NASA, NSF),AT&T, IBM, I NTEL, Microsoft UW- Madison
What is Condor? Condor converts collections of distributively owned
workstations and dedicated clusters into a distributed high-throughput computing facility.
Condor uses ClassAd Matchmaking to make sure that everyone is happy.
Features Unix and NT Operational since 1986 Manages more than 1300 CPUs at UW-Madison Software available free on the web More than 150 Condor installations worldwide in academia
and industry Non-dedicated resources Job checkpoint and migration
What is High-Throughput Computing?
High-performance: CPU cycles/second under ideal circumstances. “How fast can I run simulation X on this machine?”
High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. “How fast can I run simulation X on this machine?” “How many times can I run simulation X in the next
month using all available machines?”
Some HTC Challenges Condor does whatever it takes to run your jobs,
even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone else
What is ClassAd Matchmaking? Condor uses ClassAd Matchmaking to make sure
that work gets done within the constraints of both users and owners.
Users (jobs) have constraints: “I need an Alpha with 256 MB RAM”
Owners (machines) have constraints: “Only run jobs when I am away from my desk and
never run jobs owned by Bob.”
Condor Pool Architecture
Mathematicians Solve NUG30 Looking for the solution to
the NUG30 quadratic assignment problem
An informal collaboration of mathematicians and computer scientists
Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites)
14,5,28,24,1,3,16,15,10,9,21,2,4,29,25,22,13,26,17,30,6,20,19,8,18,7,27,12,11,23
MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin
What Is Condor-G? Enhanced version of Condor that provides robust job
management for Globus Toolkit Robust replacement for globusrun Provides extensive fault-tolerance Brings Condor’s job management features to Globus jobs
Two Parts Globus Universe GlideIn
Excellent example of applying the general purpose Globus Toolkit to solve a particular problem (i.e. high-throughput computing) on the Grid
Why Use Condor-G Condor
Designed to run jobs within a single administrative domain
Globus Toolkit Designed to run jobs across many administrative domains
Condor-G Combine the strengths of both
Web Services Increasingly popular standards-based framework for accessing network
applications W3C standardization; Microsoft, IBM, Sun, others
XML and XML Schema Representing data in a portable format
WSDL: Web Services Description Language Interface Definition Language for Web services
SOAP: Simple Object Access Protocol XML-based RPC protocol; common WSDL target
WSDL (/ WS-Inspection) Conventions for locating service descriptions
UDDI: Universal Description, Discovery, & Integration Directory for Web services
Structure Structure (XML Schemas)(XML Schemas)
XML Web Services FrameworkXML Web Services Framework
WireWire DescriptionDescription DiscoveryDiscovery
Syntax (XML)Syntax (XML)
Envelope & Envelope & Extensibility Extensibility
(SOAP)(SOAP)InspectionInspection
(DISCO)(DISCO)
Directory (UDDI)Directory (UDDI)Service Service
DescriptionDescription(WSDL)(WSDL)
Process Process OrchestrationOrchestration
(XLANG)(XLANG)AttachmentsAttachments
W3C W3C RecRec
W3C WGW3C WG
FutureFuture
SecuritySecurity
RoutingRouting
ReliabilityReliability
ServiceServiceDescriptionDescription
(WSDL)(WSDL)
ProcessProcessOrchestrationOrchestration
(XLANG)(XLANG)
“New” GlobusOpen Grid Services Architecture (OGSA)
Service orientation to virtualize resources From Web services:
Standard interface definition mechanisms: multiple protocol bindings, multiple implementations, local/remote transparency
Building on Globus Toolkit: Grid service: semantics for service interactions Management of transient instances (& state) Factory, Registry, Discovery, other services Reliable and secure transport
Multiple hosting targets: J2EE, .NET, “C”, …http://www.globus.org/research/papers/ogsa.pdf
http://www.globus.org/research/papers/gsspec.pdf
OGSA Service Model System comprises (a typically few) persistent services &
(potentially many) transient services All services adhere to specified Grid service interfaces and
behaviours Reliable invocation, lifetime management, discovery,
authorization, notification, upgradeability, concurrency, manageability
Interfaces for managing Grid service instances Factory, registry, discovery, lifetime, etc.
=> Reliable, secure management of distributed state
Using OGSAto Construct Grid Environments
Factory RegistryService
FactoryH2R
Mapper
ServiceService Service ...
...
(a) Simple HostingEnvironment
Factory RegistryService
FactoryH2R
Mapper
ServiceService Service ...
...
F R
F M
SS S
F R
F M
SS S
(b) Virtual HostingEnvironment
E2EFactory
E2E Reg
E2E H2RMapper
...
F1
R
M
SS S
F2
R
M
SS S
E2E S E2E S E2E S
(c) Compound Services
In each case, Registry handle is effectively the uniquename for the virtual organization.
Evolution of Globus Initial exploration (1996-1999; Globus 1.0)
Extensive application experiments; core protocols
Data Grids (1999-??; Globus 2.0+) Large-scale data management and analysis
Open Grid Services Architecture (2001-??, Globus 3.0) Integration with Web services, hosting environments,
resource virtualization Databases, higher-level services
Radically scalable systems (2003-??) Sensors, wireless, ubiquitous computing
Summary The Grid problem: Resource sharing & coordinated problem
solving in dynamic, multi-institutional virtual organizations Grid architecture: Protocol, service definition for
interoperability & resource sharing Grid Middleware
Globus Toolkit a source of protocol and API definitions—and reference implementations
Open Grid Services Architecture represents next step in evolution Condor High throughput Computing Web Services & W3C leveraging e-business
e-Science Projects applying Grid concepts to applications