session 33 - production grids
TRANSCRIPT
Overview of Production Grids
Steven Newhouse
Contents
• Open Science Grid• DEISA• NAREGI• Nordic DataGrid Facility• EGEE• TeraGrid• EGI
OPEN SCIENCE GRIDRuth Pordes
Open Science GridConsortium - >100 member organizations contributing resources, software,
applications, services. Project:– Funded by DOE and NSF to deliver to the OSG Consortium for 5 years 2006-2011, 33FTEs. – VO science deliverables are OSG’s milestones.– Collaboratively focused: Partnerships, international connections, multidisciplinary
Satellites - independently funded projects contributing to the OSG Consortium program and vision:
• CI-Team User and Campus Engagement, • VOSS study of Virtual Organizations, • CILogin Integration of end point Shibboleth identity management into the OSG
infrastructure, • Funding for students to the International Summer School for Grid Computing 2009
TeraGrid'09 (Jun. 23, 2009) Paul Avery 5
5
2
OSG & Internet2 Work Closely w/ Universities~100 compute resources~20 storage resources~70 modules in software stack~35 User Communities (VOs)600,00-900,000 CPUhours/day,200K-300K jobs/day,>2000 users~5 other infrastructures ~25 resource sites.
UsersNearly all applications High Throughput. Small number of users starting MPI
production use. Major accounts: US LHC, LIGO
ATLAS, CMS >3000 physicists each.US ATLAS & US CMS Tier-1, 17 Tier-2s and new focus on Tier-3s (~35 today, expect ~70 in a year)ALICE taskforce to show usability of OSG infrastructure for their applications.LIGO Einstein@Home
US Physics Community:Tevatron - CDF & D0 FNAL and remote sites. Other Fermilab users – Neutrino, astro, simulation, theorySTARIceCube
Non-Physics:~6% of usage. ~25 single PIs or small groups from biology, molecular dynamics, chemitry, weather forecasting, mathematics, protein prediction. campus infrastructures: ~7 including universities and labs.
Non-physics use highly cyclic
Operations• All hardware contributed by members of the Consortium• Distributed operations infrastructure including security, monitoring, registration,
accounting services etc. • Central ticketing system, 24x7 problem reporting and triaging at the Grid
Operations Center. • Distributed set of Support Centers as first line of support for VOs, services (e.g.
software) and Sites.• Security incident response teams include Site Security Administrators and VO
Security Contacts.• Software distribution, patches (security) and update.• Targetted Production, Site and VO support teams.
Paul Avery 9
OSG Job Counts (2008-9)
TeraGrid'09 (Jun. 23, 2009)
300K jobs/day
100M Jobs
SoftwareOSG Virtual Data Toolkit packaged, tested, distributed, supported software stack used by
multiple projects – OSG, EGEE, NYSGrid, TG, APAC, NGS.• ~70 components covering Condor, Globus, security infrastructure, data movement,
storage implementations, job management and scheduling, network monitoring tools, validation and testing, monitoring/accounting/information, needed utilities such as Apache, Tomcat;
• Server, User Client, Worker-Node/Application Client releases. • Build and regression tested using U of Wisconsin Madison Metronome system.• Pre-release testing on 3 “VTB” sites – UofC, LBNL, Caltech• Post-release testing of major releases on Integration Testbed• Distributed team at Uof Wisconsin, Fermilab, LBNL.Improved support for incremental upgrades in OSG 1.2 release summer ’09.OSG configuration and validation scripts distributed to use the VDT. OSG does not develop software except for tools and contributions (extensions) to
external software projects delivering to OSG stakeholder requirements. Identified liaisons provide bi-directional support and communication between OSG and
External Software Provider projects.OSG Software Tools Group oversees all software developed within the project.Software vulnerability and auditing processes in place.
Paul Avery 11
VDT Progress (1.10.1 Just Released)
TeraGrid'09 (Jun. 23, 2009)
~ 70 components
Partnerships and CollaborationsPartnerships with network fabric and identity service providers – ESNET, Internet2Continuing bridging work with EGEE, SuraGrid, TeraGrid.~17 points of contact/collaboration with EGEE and WLCG.Partnership statement for EGI/NGIs.Emerging collaborations with TG on Workforce Training, Software, Security.Creator(co-sponsor) of successful e-weekly (International) Science Grid This Week.Co-sponsor of this iSSGC’09 school.Member of Production Infrastructure Policy Group (OGF affiliated).
Community Collaboratories
13
Community Collaboratory
DEISA ADVANCING SCIENCE IN EUROPEH. Lederer, A. Streit, J. Reetz - DEISA
RI-222919
www.deisa.eu
DEISA consortium and partners– Eleven Supercomputing Centres in Europe
BSC, CSC, CINECA, ECMWF, EPCC, FZJ, HLRS, IDRIS, LRZ, RZG, SARA
– Four associated partners: CEA, CSCS, JSCC, KTH
July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 1515
Co-Funded by the European Commission DEISA2 contract RI-222919
RI-222919
Infrastructure and Services
• HPC infrastructure with heterogeneous resources • State-of-the-art supercomputer
– Cray XT4/5, Linux– IBM Power5, Power6, AIX / Linux– IBM BlueGene/P, Linux – IBM PowerPC, Linux– SGI ALTIX 4700 (Itanium2 Montecito), Linux– NEC SX8/9 vector systems, Super UX
• More than 1 PetaFlop/s of aggregated peak performance• Dedicated network, 10 Gb/s links provided by GEANT2 and NRENs• Continental shared high-performance filesystem (GPFS-MC, IBM)• HPC systems are owned and operated by national HPC centres• DEISA services are layered and operated on top• Fixed fractions of the HPC resources are dedicated for DEISA• Europe-wide coordinated expert teams for operation, technology
developments, and application enabling and support
July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 1616RI-222919
July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 17
• HPC Applications– from various scientific fields: astrophysics, earth sciences, engineering,
life sciences, materials sciences, particle physics, plasma physics– require capability computing facilities (low latency, high throughput interconnect),
often application enabling and support
• Resources granted through: - DEISA Extreme Computing Initiative (DECI, annual calls) DECI call 2008
• 42 proposals accepted 50 mio CPU-h granted*DECI call 2009 (proposals currently under review)
• 75 proposals more than 200 mio CPU-h requested* *) normalized to IBM P4+
Over 160 universities and research institutes from 15 European countries with co-investigators from four other continents have already benefitted
- Virtual Science Community Support 2008: EFDA, EUFORIA, VIROLAB2009: EFDA, EUFORIA, ENES, LFI-PLANCK, VPH/VIROLAB, VIRGO
17
HPC resource usage
RI-222919
Middleware • Various services are provided on the middleware layer:
– DEISA Common Production Environment (DCPE) (Homogeneous software environment layer for heterogeneous HPC platforms)– High performance data stage-in/-out to GPFS: GridFTP– Workflow management: UNICORE– Job submission:
• UNICORE• WS-GRAM (optional) • Interactive usage of local batch systems • remote job submission between IBM P6/AIX systems (LL-MC)
– Monitoring System: INCA– Unified AAA: distributed LDAP and resource usage data bases
• Only few software components are developed within DEISA– Focus on technology evaluation, deployment and operation– Bugs are reported to the software maintainers
July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 1818RI-222919
Standards• DEISA has a vital interest in the standardization of interfaces to HPC
services– Job submission, job and workflow management, data management,
data access and archiving, networking and security (including AAA)• DEISA supports OGF standardization groups
– JSDL-WG and OGSA-BES for job submission, – UR-WG and RUS-WG for accounting– DAIS for data services
– Engagement in Production Grid Infrastructure WG • DEISA collaboration in standardization with other projects
– GIN community – Infrastructure Policy Group
(DEISA, EGEE, TeraGrid, OSG, NAREGI)Goal: Achievement of seamless interoperation of leading Grid
Infrastructures worldwide - Authentication, Authorization, Accounting (AAA) - Resource allocation policies - Portal / access policies
July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 1919RI-222919
STATUS OF CSI GRID (NAREGI)
Kento AidaNational Institute of Informatics
OverviewCurrent Status
We started pilot operation in May 2009.Organization
Computer centers in 9 universitiesresource provider
National Institute of Informaticsnetwork provider (SINET 3) and GOC
Fundingorganizations’ own funding
Kento Aida, National Institute of Informatics
21
Operational Infrastructuresites role
Information Initiative Center, Hokkaido University resource provider (linux cluster)Cyberscience Center, Tohoku University resource provider (vector computer)Center for Computational Sciences, University of Tsukuba
resource provider (linux cluster)
Information Technology Center, University of Tokyo resource provider (linux cluster)Global Scientific Information and Computing Center, Tokyo Institute of Technology
resource provider (linux cluster)
Information Technology Center, Nagoya University resource provider (linux cluster)Academic Center for Computing and Media Studies, Kyoto University
resource provider (linux cluster)
Cybermedia Center, Osaka University resource provider (linux cluster, vector computer)
Research Institute for Information Technology, Kyushu University
resource provider (linux cluster)
National Institute of Informatics network provider (SINET 3)GOC (CA, helpdesk)
Kento Aida, National Institute of Informatics
22
MiddlewareNAREGI middleware Ver. 1.1.3
developerNational Institute of Informatics
( http://middleware.naregi.org/Download/ )platform
CentOS 5.2 + PBS Pro 9.1/9.2OpenSUSE 10.3 + Sun Grid Engine v6.0
Kento Aida, National Institute of Informatics
23
NORDIC DATAGRID FACILITYMichael Gronager
25
NDGF Organization
A Co-operative Nordic Data and Computing Grid facility Nordic production grid, leveraging national grid
resources Common policy framework for Nordic production
grid Joint Nordic planning and coordination Operate Nordic storage facility for major projects Co-ordinate & host major eScience projects (i.e.,
Nordic WLGC Tier-1) Contribute to grid middleware and develop services
NDGF 2006-2010 Funded (2 M€/year) by National Research Councils
of the Nordic Countries
NOS-NDK FI NO SE
Nordic Data Grid Facility
IS
NDGF Facility - 2009Q1
NDGF People - 2009Q2
28
Application Communities
WLCG – the Worldwide Large Hadron Collider Grid
Bio-informatics sciences Screening of CO2-Sequestration
suitable reservoirs Computational Chemistry Material Science And the more horizontal:
Common Nordic User Administration, Authentication, Authorization & Accounting
29
Operations
Operation team of 5-7 people Collaboration btw. NDGF and
SNIC and NUNOC Expert 365 days a year 24x7 by Regional REN
Distributed over the Nordics Runs:
rCOD + ROC – for Nordic + Baltic
Distributed Sites (T1, T2s) Sysadmins well known by the
operation team Continuous chatroom
meetings
30
Middleware
Philosophy: We need tools to run an e-Infrastructure. Tools cost: money / in kind. In kind means Open Source tools
– hence we contribute to things we use: dCache (storage) – a DESY, FNAL, NDGF ++
collaboration ARC (computing) – a Collaboration btw Nordic,
Slovenian, Swiss insts. SGAS (accounting) and Confusa (client-cert from
IdPs) BDII, WMS, SAM, AliEn, Panda – gLite/CERN tools MonAmi, Nagios (Monitoring)
NDGF now and in the future
e-Infrastructure as a whole is important Resources count Capacity and Capability
Computing and different Network and Storage systems
The infrastructure must support different access methods (grid, ssh, application portals etc) – note that the average grid use of shared resources are only around 10-25%
Uniform User Mgmt, Id, Access, Accounting, Policy Enforcement and resource allocation and sharing Independent of access method For all users
ENABLING GRIDS FOR E-SCIENCESteven Newhouse
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 Project Status - Bob Jones - EGEE-III First Review 24-25 June 2009 33
Enabling Grids for E-Science
New scientific
community
Networking activities
Middleware activities
Service activities
Networking activities
Established user
community
Grid
oper-ations &
Networking
suppor
t 51%
User Communi-ty-
suppor
t 19%
Training8%
Middlewa-reeng. 5%
Integration and testing9%
Management 2%
Dissemination & International Cooperation
6%
• Duration: 2 years • Total Budget:
– Staff ~47 M€– H/W ~50 M€ – EC Contribution: 32 M€
• Total Effort:– 9132 person months– ~382 FTE
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Project Overview
17000 users
136000 LCPUs (cores)
25Pb disk
39Pb tape
12 million jobs/month
+45% in a year
268 sites
+5% in a year
48 countries
+10% in a year
162 VOs
+29% in a year
Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009 34
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Supporting Science
• Archeology• Astronomy• Astrophysics• Civil Protection• Comp. Chemistry• Earth Sciences• Finance• Fusion• Geophysics• High Energy Physics• Life Sciences• Multimedia• Material Sciences
Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009 35
Resource Utilisation
Proportion of HEP usage ~77%
End-user activity• 13,000 end-users in 112 VOs
• +44% users in a year• 23 core VOs
• A core VO has >10% of usage within its science cluster
Other Areas
Fusion
Earth Science
Astronomy & Astrophysics
Multidisciplinary
Life Sciences
Computational Chemistry
0 1 2 3 4 5 6 7
March 2008 to February 2009 (%) March 2007 to February 2008 (%)
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Operations
• Monitored 24x7 on a regional basis• Central help desk for all issues
– Filtered to regional and specialist support unites
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Avg. Reliability Reliability Top 20% Reliability Top 50%
target
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
EGEE Maintained Components External Components
gLite Middleware
Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009 37
Physical Resources
General Services
LHC FileCatalogue
HydraWorkload
Management Service
File TransferService
Logging &Book keeping
Service
AMGA
Storage Element
Disk Pool Manager
dCache
Information S
ervices
BDII
MON
User InterfaceUser Access
SecurityServices
Virtual Organisation Membership
Service
Authz. Service
SCAS
Proxy Server
LCAS & LCMAPS
Compute Element
CREAM LCG-CE
gLExec
BLAH
Worker Node
User Interface
JSDL & BES
GLUE 2.0
DMI
SRM
JSDL & BES
SAML
X.509 Attributes
TERAGRIDDaniel S. Katz
TeraGrid Overview• TeraGrid is run by 11 resource providers (RPs) and integrated by Grid
Infrastructure Group (GIG, at University of Chicago)– TeraGrid Forum (made of these 12 entities) decides policy by consensus
(elected chair is John Towns, NCSA)• Funding is by separate awards from the National Science Foundation to
the 12 groups– GIG sub-awards integration funding to the 11 RPs and some additional
groups• Resources (distributed at the 11 RPs across the United States,
connected by 10 Gbps paths)– 14 HPC systems (1.6 PFlops,
310 TBytes memory)– 1 HTC pool (105000 CPUs)– 7 storage systems (3.0 PBytes on-line,
60 PBytes off-line)– 2 viz systems (128 tightly integrated CPUs,
14000 loosely coupled CPUs)– Special purpose systems (GPUs, FPGAs)
Applications Community
(2008)
Use Modality
Community Size
(rough est. - number of
users)
Batch Computing on Individual Resources 850
Exploratory and Application Porting 650
Workflow, Ensemble, and Parameter Sweep 250
Science Gateway Access 500
Remote Interactive Steering and Visualization
35
Tightly-Coupled Distributed Computation
10
(2006)
Primarily HPC usage, but growing use of science gateways and workflows, lesser HTC usage
Operations Infrastructure
• Lots of services to keep this all together– Keep most things looking like
one to the users, including:• Allocations, helpdesk, accounting, web site, portal, security, data movement, information services, resource catalog, science gateways, etc.
– Working on, but don’t have in production yet:
• Single global file system, identity management integrated with universities
• Services supported by GIG, resources supported by RPs
Middleware• Coordinated TeraGrid Software Stack is made of kits
– All but one (Core Integration) are optional for RPs– Kits define a set of functionality and provide an implementation– Optional Kits: Data Movement, Remote Login Capability, Science Workflow
Support, Parallel Application Capability, Remote Compute, Application Development and Runtime Support Capability, Metascheduling Capability, Data Movement Servers Capability, Data Management Capability, Data Visualization Support, Data Movement Clients Capability, Local Resource Provider HPC Software, Wide Area GPFS File Systems, Co-Scheduling Capability, Advance Reservation Capability, Wide Area Lustre File Systems, Science Gateway Kit
– Current status:(kits only top, resources along left side, yellow means kit is installed, white meanskit is not installed)
– Some kits are now being rolled-out (ScienceGateway) and will become more widelyused, some have limited functionality(Data Visualization Support) that onlymakes sense on some resources
TeraGrid• TeraGrid considers itself the
world’s largest open scientific computing infrastructure–Usage is free, allocations are
peer-reviewed and available to all US researchers and their collaborators
• TeraGrid is a platform on which others can build–Application developers–Science Gateways
• TeraGrid is a research project–Learning how to do distributed, collaborative science on a
continental-scale, federated infrastructure–Learning how to run multi-institution shared infrastructure
Common Characteristics
• Operating a production grid requires WORK– Monitoring, reporting, chasing, ...
• No ‘off the shelf’ software solution– Plenty of components... But need verified assembly!
• No central control– Distributed expertise leads to distributed teams– Resources are federated Ownership lies elsewhere
• No ownership by the Grid of hardware resources• All driven by delivering to user communities
The Future in Europe: EGI
• EGI: European Grid Initiative• Result of the EGI Design Study
– 2 year project to build community consensus• Move from project to sustainable funding
– Leverage other sources of funding• Build on national grid initiatives (NGIs)
– Provide the European ‘glue’ around independent NGIs
The EGI Actors
46
NGI2NGI1
NGIn
…
EGI
National Grid Initiatives (NGIs)Resource Centres
Research TeamsResearch Institutes
EGI.euEGI.eu
47
EGI.eu and NGI Tasks
EGI.eu tasksNGI international tasksNGI national tasks
EGI.eu
NGI
NGINGI
NGI
Differences between EGEE & EGI
48
EGI.eu
NGI Operations SpecialisedSupport Centres
European Middleware
Initiative (EMI)
Middleware
• EGI will release UMD– Unified Middleware Distribution– Components needed to build a production grid– Initial main providers:
• ARC, gLite & UNICORE
– Expect to evolve components over time– Have defined interfaces to enable multiple providers
• EMI project from ARC, gLite & UNICORE– Supports, maintains & harmonise software– Introduction & development of standards
Plans for Year II - Steven Newhouse - EGEE-III First Review 24-25 June 2009
50
Current Status
• 8th July: EGI Council Meeting:• Confirmation of Interim Director• Establish Editorial team for the EC Proposals
• 30th July: EC Call open• 1st October: Financial contributions to EGI Collaboration due• October/November: EGI.eu established• 24th November: EC Call closed• December 2009/January 2010: EGI.eu startup phase• Winter 2010: Negotiation phase for EGI projects• 1st May 2010: EGI projects launched
European Future?
• Sustainability– E-Infrastructure is vital– Will underpin many research activities
• Activity has to be driven by active stakeholders