session 33 - production grids

51
Overview of Production Grids Steven Newhouse

Upload: issgc-summer-school

Post on 11-May-2015

560 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Session 33 - Production Grids

Overview of Production Grids

Steven Newhouse

Page 2: Session 33 - Production Grids

Contents

• Open Science Grid• DEISA• NAREGI• Nordic DataGrid Facility• EGEE• TeraGrid• EGI

Page 3: Session 33 - Production Grids

OPEN SCIENCE GRIDRuth Pordes

Page 4: Session 33 - Production Grids

Open Science GridConsortium - >100 member organizations contributing resources, software,

applications, services. Project:– Funded by DOE and NSF to deliver to the OSG Consortium for 5 years 2006-2011, 33FTEs. – VO science deliverables are OSG’s milestones.– Collaboratively focused: Partnerships, international connections, multidisciplinary

Satellites - independently funded projects contributing to the OSG Consortium program and vision:

• CI-Team User and Campus Engagement, • VOSS study of Virtual Organizations, • CILogin Integration of end point Shibboleth identity management into the OSG

infrastructure, • Funding for students to the International Summer School for Grid Computing 2009

Page 5: Session 33 - Production Grids

TeraGrid'09 (Jun. 23, 2009) Paul Avery 5

5

2

OSG & Internet2 Work Closely w/ Universities~100 compute resources~20 storage resources~70 modules in software stack~35 User Communities (VOs)600,00-900,000 CPUhours/day,200K-300K jobs/day,>2000 users~5 other infrastructures ~25 resource sites.

Page 6: Session 33 - Production Grids

UsersNearly all applications High Throughput. Small number of users starting MPI

production use. Major accounts: US LHC, LIGO

ATLAS, CMS >3000 physicists each.US ATLAS & US CMS Tier-1, 17 Tier-2s and new focus on Tier-3s (~35 today, expect ~70 in a year)ALICE taskforce to show usability of OSG infrastructure for their applications.LIGO Einstein@Home

US Physics Community:Tevatron - CDF & D0 FNAL and remote sites. Other Fermilab users – Neutrino, astro, simulation, theorySTARIceCube

Non-Physics:~6% of usage. ~25 single PIs or small groups from biology, molecular dynamics, chemitry, weather forecasting, mathematics, protein prediction. campus infrastructures: ~7 including universities and labs.

Page 7: Session 33 - Production Grids

Non-physics use highly cyclic

Page 8: Session 33 - Production Grids

Operations• All hardware contributed by members of the Consortium• Distributed operations infrastructure including security, monitoring, registration,

accounting services etc. • Central ticketing system, 24x7 problem reporting and triaging at the Grid

Operations Center. • Distributed set of Support Centers as first line of support for VOs, services (e.g.

software) and Sites.• Security incident response teams include Site Security Administrators and VO

Security Contacts.• Software distribution, patches (security) and update.• Targetted Production, Site and VO support teams.

Page 9: Session 33 - Production Grids

Paul Avery 9

OSG Job Counts (2008-9)

TeraGrid'09 (Jun. 23, 2009)

300K jobs/day

100M Jobs

Page 10: Session 33 - Production Grids

SoftwareOSG Virtual Data Toolkit packaged, tested, distributed, supported software stack used by

multiple projects – OSG, EGEE, NYSGrid, TG, APAC, NGS.• ~70 components covering Condor, Globus, security infrastructure, data movement,

storage implementations, job management and scheduling, network monitoring tools, validation and testing, monitoring/accounting/information, needed utilities such as Apache, Tomcat;

• Server, User Client, Worker-Node/Application Client releases. • Build and regression tested using U of Wisconsin Madison Metronome system.• Pre-release testing on 3 “VTB” sites – UofC, LBNL, Caltech• Post-release testing of major releases on Integration Testbed• Distributed team at Uof Wisconsin, Fermilab, LBNL.Improved support for incremental upgrades in OSG 1.2 release summer ’09.OSG configuration and validation scripts distributed to use the VDT. OSG does not develop software except for tools and contributions (extensions) to

external software projects delivering to OSG stakeholder requirements. Identified liaisons provide bi-directional support and communication between OSG and

External Software Provider projects.OSG Software Tools Group oversees all software developed within the project.Software vulnerability and auditing processes in place.

Page 11: Session 33 - Production Grids

Paul Avery 11

VDT Progress (1.10.1 Just Released)

TeraGrid'09 (Jun. 23, 2009)

~ 70 components

Page 12: Session 33 - Production Grids

Partnerships and CollaborationsPartnerships with network fabric and identity service providers – ESNET, Internet2Continuing bridging work with EGEE, SuraGrid, TeraGrid.~17 points of contact/collaboration with EGEE and WLCG.Partnership statement for EGI/NGIs.Emerging collaborations with TG on Workforce Training, Software, Security.Creator(co-sponsor) of successful e-weekly (International) Science Grid This Week.Co-sponsor of this iSSGC’09 school.Member of Production Infrastructure Policy Group (OGF affiliated).

Page 13: Session 33 - Production Grids

Community Collaboratories

13

Community Collaboratory

Page 14: Session 33 - Production Grids

DEISA ADVANCING SCIENCE IN EUROPEH. Lederer, A. Streit, J. Reetz - DEISA

RI-222919

www.deisa.eu

Page 15: Session 33 - Production Grids

DEISA consortium and partners– Eleven Supercomputing Centres in Europe

BSC, CSC, CINECA, ECMWF, EPCC, FZJ, HLRS, IDRIS, LRZ, RZG, SARA

– Four associated partners: CEA, CSCS, JSCC, KTH

July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 1515

Co-Funded by the European Commission DEISA2 contract RI-222919

RI-222919

Page 16: Session 33 - Production Grids

Infrastructure and Services

• HPC infrastructure with heterogeneous resources • State-of-the-art supercomputer

– Cray XT4/5, Linux– IBM Power5, Power6, AIX / Linux– IBM BlueGene/P, Linux – IBM PowerPC, Linux– SGI ALTIX 4700 (Itanium2 Montecito), Linux– NEC SX8/9 vector systems, Super UX

• More than 1 PetaFlop/s of aggregated peak performance• Dedicated network, 10 Gb/s links provided by GEANT2 and NRENs• Continental shared high-performance filesystem (GPFS-MC, IBM)• HPC systems are owned and operated by national HPC centres• DEISA services are layered and operated on top• Fixed fractions of the HPC resources are dedicated for DEISA• Europe-wide coordinated expert teams for operation, technology

developments, and application enabling and support

July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 1616RI-222919

Page 17: Session 33 - Production Grids

July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 17

• HPC Applications– from various scientific fields: astrophysics, earth sciences, engineering,

life sciences, materials sciences, particle physics, plasma physics– require capability computing facilities (low latency, high throughput interconnect),

often application enabling and support

• Resources granted through: - DEISA Extreme Computing Initiative (DECI, annual calls) DECI call 2008

• 42 proposals accepted 50 mio CPU-h granted*DECI call 2009 (proposals currently under review)

• 75 proposals more than 200 mio CPU-h requested* *) normalized to IBM P4+

Over 160 universities and research institutes from 15 European countries with co-investigators from four other continents have already benefitted

- Virtual Science Community Support 2008: EFDA, EUFORIA, VIROLAB2009: EFDA, EUFORIA, ENES, LFI-PLANCK, VPH/VIROLAB, VIRGO

17

HPC resource usage

RI-222919

Page 18: Session 33 - Production Grids

Middleware • Various services are provided on the middleware layer:

– DEISA Common Production Environment (DCPE) (Homogeneous software environment layer for heterogeneous HPC platforms)– High performance data stage-in/-out to GPFS: GridFTP– Workflow management: UNICORE– Job submission:

• UNICORE• WS-GRAM (optional) • Interactive usage of local batch systems • remote job submission between IBM P6/AIX systems (LL-MC)

– Monitoring System: INCA– Unified AAA: distributed LDAP and resource usage data bases

• Only few software components are developed within DEISA– Focus on technology evaluation, deployment and operation– Bugs are reported to the software maintainers

July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 1818RI-222919

Page 19: Session 33 - Production Grids

Standards• DEISA has a vital interest in the standardization of interfaces to HPC

services– Job submission, job and workflow management, data management,

data access and archiving, networking and security (including AAA)• DEISA supports OGF standardization groups

– JSDL-WG and OGSA-BES for job submission, – UR-WG and RUS-WG for accounting– DAIS for data services

– Engagement in Production Grid Infrastructure WG • DEISA collaboration in standardization with other projects

– GIN community – Infrastructure Policy Group

(DEISA, EGEE, TeraGrid, OSG, NAREGI)Goal: Achievement of seamless interoperation of leading Grid

Infrastructures worldwide - Authentication, Authorization, Accounting (AAA) - Resource allocation policies - Portal / access policies

July 2009 H. Lederer, A. Streit, J. Reetz - DEISA 1919RI-222919

Page 20: Session 33 - Production Grids

STATUS OF CSI GRID (NAREGI)

Kento AidaNational Institute of Informatics

Page 21: Session 33 - Production Grids

OverviewCurrent Status

We started pilot operation in May 2009.Organization

Computer centers in 9 universitiesresource provider

National Institute of Informaticsnetwork provider (SINET 3) and GOC

Fundingorganizations’ own funding

Kento Aida, National Institute of Informatics

21

Page 22: Session 33 - Production Grids

Operational Infrastructuresites role

Information Initiative Center, Hokkaido University resource provider (linux cluster)Cyberscience Center, Tohoku University resource provider (vector computer)Center for Computational Sciences, University of Tsukuba

resource provider (linux cluster)

Information Technology Center, University of Tokyo resource provider (linux cluster)Global Scientific Information and Computing Center, Tokyo Institute of Technology

resource provider (linux cluster)

Information Technology Center, Nagoya University resource provider (linux cluster)Academic Center for Computing and Media Studies, Kyoto University

resource provider (linux cluster)

Cybermedia Center, Osaka University resource provider (linux cluster, vector computer)

Research Institute for Information Technology, Kyushu University

resource provider (linux cluster)

National Institute of Informatics network provider (SINET 3)GOC (CA, helpdesk)

Kento Aida, National Institute of Informatics

22

Page 23: Session 33 - Production Grids

MiddlewareNAREGI middleware Ver. 1.1.3

developerNational Institute of Informatics

( http://middleware.naregi.org/Download/ )platform

CentOS 5.2 + PBS Pro 9.1/9.2OpenSUSE 10.3 + Sun Grid Engine v6.0

Kento Aida, National Institute of Informatics

23

Page 24: Session 33 - Production Grids

NORDIC DATAGRID FACILITYMichael Gronager

Page 25: Session 33 - Production Grids

25

NDGF Organization

A Co-operative Nordic Data and Computing Grid facility Nordic production grid, leveraging national grid

resources Common policy framework for Nordic production

grid Joint Nordic planning and coordination Operate Nordic storage facility for major projects Co-ordinate & host major eScience projects (i.e.,

Nordic WLGC Tier-1) Contribute to grid middleware and develop services

NDGF 2006-2010 Funded (2 M€/year) by National Research Councils

of the Nordic Countries

NOS-NDK FI NO SE

Nordic Data Grid Facility

IS

Page 26: Session 33 - Production Grids

NDGF Facility - 2009Q1

Page 27: Session 33 - Production Grids

NDGF People - 2009Q2

Page 28: Session 33 - Production Grids

28

Application Communities

WLCG – the Worldwide Large Hadron Collider Grid

Bio-informatics sciences Screening of CO2-Sequestration

suitable reservoirs Computational Chemistry Material Science And the more horizontal:

Common Nordic User Administration, Authentication, Authorization & Accounting

Page 29: Session 33 - Production Grids

29

Operations

Operation team of 5-7 people Collaboration btw. NDGF and

SNIC and NUNOC Expert 365 days a year 24x7 by Regional REN

Distributed over the Nordics Runs:

rCOD + ROC – for Nordic + Baltic

Distributed Sites (T1, T2s) Sysadmins well known by the

operation team Continuous chatroom

meetings

Page 30: Session 33 - Production Grids

30

Middleware

Philosophy: We need tools to run an e-Infrastructure. Tools cost: money / in kind. In kind means Open Source tools

– hence we contribute to things we use: dCache (storage) – a DESY, FNAL, NDGF ++

collaboration ARC (computing) – a Collaboration btw Nordic,

Slovenian, Swiss insts. SGAS (accounting) and Confusa (client-cert from

IdPs) BDII, WMS, SAM, AliEn, Panda – gLite/CERN tools MonAmi, Nagios (Monitoring)

Page 31: Session 33 - Production Grids

NDGF now and in the future

e-Infrastructure as a whole is important Resources count Capacity and Capability

Computing and different Network and Storage systems

The infrastructure must support different access methods (grid, ssh, application portals etc) – note that the average grid use of shared resources are only around 10-25%

Uniform User Mgmt, Id, Access, Accounting, Policy Enforcement and resource allocation and sharing Independent of access method For all users

Page 32: Session 33 - Production Grids

ENABLING GRIDS FOR E-SCIENCESteven Newhouse

Page 33: Session 33 - Production Grids

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667 Project Status - Bob Jones - EGEE-III First Review 24-25 June 2009 33

Enabling Grids for E-Science

New scientific

community

Networking activities

Middleware activities

Service activities

Networking activities

Established user

community

Grid

oper-ations &

Networking

suppor

t 51%

User Communi-ty-

suppor

t 19%

Training8%

Middlewa-reeng. 5%

Integration and testing9%

Management 2%

Dissemination & International Cooperation

6%

• Duration: 2 years • Total Budget:

– Staff ~47 M€– H/W ~50 M€ – EC Contribution: 32 M€

• Total Effort:– 9132 person months– ~382 FTE

Page 34: Session 33 - Production Grids

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Project Overview

17000 users

136000 LCPUs (cores)

25Pb disk

39Pb tape

12 million jobs/month

+45% in a year

268 sites

+5% in a year

48 countries

+10% in a year

162 VOs

+29% in a year

Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009 34

Page 35: Session 33 - Production Grids

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Supporting Science

• Archeology• Astronomy• Astrophysics• Civil Protection• Comp. Chemistry• Earth Sciences• Finance• Fusion• Geophysics• High Energy Physics• Life Sciences• Multimedia• Material Sciences

Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009 35

Resource Utilisation

Proportion of HEP usage ~77%

End-user activity• 13,000 end-users in 112 VOs

• +44% users in a year• 23 core VOs

• A core VO has >10% of usage within its science cluster

Other Areas

Fusion

Earth Science

Astronomy & Astrophysics

Multidisciplinary

Life Sciences

Computational Chemistry

0 1 2 3 4 5 6 7

March 2008 to February 2009 (%) March 2007 to February 2008 (%)

Page 36: Session 33 - Production Grids

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

Operations

• Monitored 24x7 on a regional basis• Central help desk for all issues

– Filtered to regional and specialist support unites

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Avg. Reliability Reliability Top 20% Reliability Top 50%

target

Page 37: Session 33 - Production Grids

Enabling Grids for E-sciencE

EGEE-III INFSO-RI-222667

EGEE Maintained Components External Components

gLite Middleware

Technical Status - Steven Newhouse - EGEE-III First Review 24-25 June 2009 37

Physical Resources

General Services

LHC FileCatalogue

HydraWorkload

Management Service

File TransferService

Logging &Book keeping

Service

AMGA

Storage Element

Disk Pool Manager

dCache

Information S

ervices

BDII

MON

User InterfaceUser Access

SecurityServices

Virtual Organisation Membership

Service

Authz. Service

SCAS

Proxy Server

LCAS & LCMAPS

Compute Element

CREAM LCG-CE

gLExec

BLAH

Worker Node

User Interface

JSDL & BES

GLUE 2.0

DMI

SRM

JSDL & BES

SAML

X.509 Attributes

Page 38: Session 33 - Production Grids

TERAGRIDDaniel S. Katz

Page 39: Session 33 - Production Grids

TeraGrid Overview• TeraGrid is run by 11 resource providers (RPs) and integrated by Grid

Infrastructure Group (GIG, at University of Chicago)– TeraGrid Forum (made of these 12 entities) decides policy by consensus

(elected chair is John Towns, NCSA)• Funding is by separate awards from the National Science Foundation to

the 12 groups– GIG sub-awards integration funding to the 11 RPs and some additional

groups• Resources (distributed at the 11 RPs across the United States,

connected by 10 Gbps paths)– 14 HPC systems (1.6 PFlops,

310 TBytes memory)– 1 HTC pool (105000 CPUs)– 7 storage systems (3.0 PBytes on-line,

60 PBytes off-line)– 2 viz systems (128 tightly integrated CPUs,

14000 loosely coupled CPUs)– Special purpose systems (GPUs, FPGAs)

Page 40: Session 33 - Production Grids

Applications Community

(2008)

Use Modality

Community Size

(rough est. - number of

users)

Batch Computing on Individual Resources 850

Exploratory and Application Porting 650

Workflow, Ensemble, and Parameter Sweep 250

Science Gateway Access 500

Remote Interactive Steering and Visualization

35

Tightly-Coupled Distributed Computation

10

(2006)

Primarily HPC usage, but growing use of science gateways and workflows, lesser HTC usage

Page 41: Session 33 - Production Grids

Operations Infrastructure

• Lots of services to keep this all together– Keep most things looking like

one to the users, including:• Allocations, helpdesk, accounting, web site, portal, security, data movement, information services, resource catalog, science gateways, etc.

– Working on, but don’t have in production yet:

• Single global file system, identity management integrated with universities

• Services supported by GIG, resources supported by RPs

Page 42: Session 33 - Production Grids

Middleware• Coordinated TeraGrid Software Stack is made of kits

– All but one (Core Integration) are optional for RPs– Kits define a set of functionality and provide an implementation– Optional Kits: Data Movement, Remote Login Capability, Science Workflow

Support, Parallel Application Capability, Remote Compute, Application Development and Runtime Support Capability, Metascheduling Capability, Data Movement Servers Capability, Data Management Capability, Data Visualization Support, Data Movement Clients Capability, Local Resource Provider HPC Software, Wide Area GPFS File Systems, Co-Scheduling Capability, Advance Reservation Capability, Wide Area Lustre File Systems, Science Gateway Kit

– Current status:(kits only top, resources along left side, yellow means kit is installed, white meanskit is not installed)

– Some kits are now being rolled-out (ScienceGateway) and will become more widelyused, some have limited functionality(Data Visualization Support) that onlymakes sense on some resources

Page 43: Session 33 - Production Grids

TeraGrid• TeraGrid considers itself the

world’s largest open scientific computing infrastructure–Usage is free, allocations are

peer-reviewed and available to all US researchers and their collaborators

• TeraGrid is a platform on which others can build–Application developers–Science Gateways

• TeraGrid is a research project–Learning how to do distributed, collaborative science on a

continental-scale, federated infrastructure–Learning how to run multi-institution shared infrastructure

Page 44: Session 33 - Production Grids

Common Characteristics

• Operating a production grid requires WORK– Monitoring, reporting, chasing, ...

• No ‘off the shelf’ software solution– Plenty of components... But need verified assembly!

• No central control– Distributed expertise leads to distributed teams– Resources are federated Ownership lies elsewhere

• No ownership by the Grid of hardware resources• All driven by delivering to user communities

Page 45: Session 33 - Production Grids

The Future in Europe: EGI

• EGI: European Grid Initiative• Result of the EGI Design Study

– 2 year project to build community consensus• Move from project to sustainable funding

– Leverage other sources of funding• Build on national grid initiatives (NGIs)

– Provide the European ‘glue’ around independent NGIs

Page 46: Session 33 - Production Grids

The EGI Actors

46

NGI2NGI1

NGIn

EGI

National Grid Initiatives (NGIs)Resource Centres

Research TeamsResearch Institutes

EGI.euEGI.eu

Page 47: Session 33 - Production Grids

47

EGI.eu and NGI Tasks

EGI.eu tasksNGI international tasksNGI national tasks

EGI.eu

NGI

NGINGI

NGI

Page 48: Session 33 - Production Grids

Differences between EGEE & EGI

48

EGI.eu

NGI Operations SpecialisedSupport Centres

European Middleware

Initiative (EMI)

Page 49: Session 33 - Production Grids

Middleware

• EGI will release UMD– Unified Middleware Distribution– Components needed to build a production grid– Initial main providers:

• ARC, gLite & UNICORE

– Expect to evolve components over time– Have defined interfaces to enable multiple providers

• EMI project from ARC, gLite & UNICORE– Supports, maintains & harmonise software– Introduction & development of standards

Page 50: Session 33 - Production Grids

Plans for Year II - Steven Newhouse - EGEE-III First Review 24-25 June 2009

50

Current Status

• 8th July: EGI Council Meeting:• Confirmation of Interim Director• Establish Editorial team for the EC Proposals

• 30th July: EC Call open• 1st October: Financial contributions to EGI Collaboration due• October/November: EGI.eu established• 24th November: EC Call closed• December 2009/January 2010: EGI.eu startup phase• Winter 2010: Negotiation phase for EGI projects• 1st May 2010: EGI projects launched

Page 51: Session 33 - Production Grids

European Future?

• Sustainability– E-Infrastructure is vital– Will underpin many research activities

• Activity has to be driven by active stakeholders