site report: the rhic computing facility

34
Site Report: The RHIC Computing Facility HEPIX – Amsterdam May 19- 23, 2003 A. Chan RHIC

Upload: elvis-young

Post on 30-Dec-2015

22 views

Category:

Documents


0 download

DESCRIPTION

Site Report: The RHIC Computing Facility. HEPIX – Amsterdam May 19-23, 2003 A. Chan RHIC Computing Facility Brookhaven National Laboratory. Outline. Background Mass Storage Central Disk Storage Linux Farms Software Development Monitoring Security - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Site Report: The RHIC         Computing Facility

Site Report: The RHIC Computing Facility

HEPIX – Amsterdam

May 19-23, 2003

A. Chan

RHIC Computing Facility

Brookhaven National Laboratory

Page 2: Site Report: The RHIC         Computing Facility

Outline

Background Mass Storage Central Disk Storage Linux Farms Software Development Monitoring Security Other services Summary

Page 3: Site Report: The RHIC         Computing Facility

Background

Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory

RCF formed in the mid-90’s to address computing needs of RHIC experiments

Became U.S. Tier 1 Center for ATLAS in late 90’s

RCF is a multi-purpose facility (NHEP and HEP)

Page 4: Site Report: The RHIC         Computing Facility

Background (continued)

Currently 25 staff members (need more)

RHIC first collisions in 2000, now in year 3 of operations

5 RHIC experiments (BRAHMS, PHENIX, PHOBOS, PP2PP and STAR)

Page 5: Site Report: The RHIC         Computing Facility

Mass Storage

4 StorageTek tape silos managed via HPSS (9940A and 9940B )

Peak raw data rate to silos 350 MB/s (can do better)

Peak data rate to/from Linux Farm 180 MB/s (can do better)

Experiments have accumulated 618 TB of raw data (capacity for 5x more)

5 staff members oversee Mass Storage operations

Page 6: Site Report: The RHIC         Computing Facility

The Mass Storage System (1)

Page 7: Site Report: The RHIC         Computing Facility

The Mass Storage System (2)

Page 8: Site Report: The RHIC         Computing Facility

Central Disk Storage

24 Sun E450 servers running Solaris 8

140 TB of disks managed by Sun servers via Veritas

Fast access to processed (DST) data via NFS (back-up in HPSS)

Aggregate 600 MB/s data rate to/from Sun servers on average

5 staff members oversee Central Disk Storage operations

Page 9: Site Report: The RHIC         Computing Facility

Central Disk Storage (1)

Page 10: Site Report: The RHIC         Computing Facility

Central Disk Storage (2)

Page 11: Site Report: The RHIC         Computing Facility

Linux Farms

Provide the majority of CPU power in the RCF

Used for mass processing of RHIC data

Listed as 3rd largest cluster according to http://www.clusters500.org

5 staff members oversee all Linux Farm operations

Page 12: Site Report: The RHIC         Computing Facility

Linux Farm Hardware

Built with commercially available Intel-based servers

1097 rack-mounted, dual CPU servers

917,728 SpecInt2000

Reliable (0.0052 hardware failures/month-machine –about 6 failures/month at current size)

Page 13: Site Report: The RHIC         Computing Facility

The growth of the Linux Farm

0100200

300400

500600

700800

9001000

1999 2000 2001 2002 2003

KSpecInt2000

Page 14: Site Report: The RHIC         Computing Facility

The Linux Farm in the RCF (1)

Page 15: Site Report: The RHIC         Computing Facility

The Linux Farm in the RCF (2)

Page 16: Site Report: The RHIC         Computing Facility

Linux Farm Software

RedHat 7.2 (RHIC) and 7.3 (ATLAS)

Image installed with Kickstart

Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel)

Support for network file systems (AFS, NFS)

Page 17: Site Report: The RHIC         Computing Facility

Linux Farm Software (continued)

Support for LSF and RCF-designed batch software

System administration software to monitor & control hardware, software and infrastructure

GRID-like software (Ganglia, Condor, GLOBUS, etc)

Scalability an important operational requirement

Page 18: Site Report: The RHIC         Computing Facility

Batch jobs in the Linux Farm (1)

0

5000

10000

15000

20000

25000

30000

35000

Tota

l Job

s Su

bmitt

ed/M

onth

1999 2000 2001 2002 2003

Year

CRS Batch Job Statistics

BRAHMSPHENIXPHOBOSSTAR

Page 19: Site Report: The RHIC         Computing Facility

Batch Jobs in the Linux Farm (2)

0

20

40

60

80

100

Effic

ienc

y

1999 2000 2001 2002 2003

Year

CRS Batch Job Statistics

BRAHMSPHENIXPHOBOSSTAR

Page 20: Site Report: The RHIC         Computing Facility

Software Development

GRID-like services for RHIC and ATLAS

GRID monitoring tools

GRID user management issues

4 staff members involved

Page 21: Site Report: The RHIC         Computing Facility

The USATLAS GRID Testbed

4/15/034/15/03CHEP 03, La Jolla 1

Internet

HPSS

BNL US ATLAS Grid Configuration

Submit Grid Jobs

LSFServer1 LSF

Server2

GatekeeperJob manager

DisksGrid Job Requests

Globus client

2TB

30MB/S

atlas00

afs04,05

amds04

gridftp serverGlobus Replica

catalog

GridFtp

GIIS ServerGrid Status

Page 22: Site Report: The RHIC         Computing Facility

GRID Monitoring

4/15/034/15/03CHEP 03, La Jolla 6

Monitoring Framework

MonitoringDatabase

(ODBC+MYSQL)Or RRD

DB Info. ProvidersData Collectors

Aggregate Service Index

(GIIS)Grid-View(Web Server)

Information Provider (GRIS)

Information Provider (GRIS)

Information Provider (GRIS)

Information Provider (GRIS)

Grid-info-search

Server HPSSNetwork

Computing Nodes

Sensor Sensor Sensor Sensor

Page 23: Site Report: The RHIC         Computing Facility

GRID User Management(1)

4/15/034/15/03CHEP 03, La Jolla 7

VirtualOrganization

GUMS: A scalable Grid User Management System

User info

User info

UNM

Page 24: Site Report: The RHIC         Computing Facility

GRID User Management (2)

4/15/034/15/03CHEP 03, La Jolla 8

Schematic Diagram

VO User Registry

Database

Regional Registration

Authority?

Local Registration

Authority

VO #3 …

VO #2

Database

Site User Info

DatabaseLocal Policy

Local Accont

Managementgrid-mapfile

Site

Push

Pull

Push

Page 25: Site Report: The RHIC         Computing Facility

Monitoring

Mix of open-source, RCF-designed and vendor-provided monitoring software

Persistency and fault-tolerant features

Near real-time information

Scalability requirements

Page 26: Site Report: The RHIC         Computing Facility

Mass Storage Monitoring

Page 27: Site Report: The RHIC         Computing Facility

Central Data Storage Monitoring

Page 28: Site Report: The RHIC         Computing Facility

Linux Farm Monitoring

Page 29: Site Report: The RHIC         Computing Facility

Batch Job Control & Monitoring

Page 30: Site Report: The RHIC         Computing Facility

Infrastructure Monitoring

Page 31: Site Report: The RHIC         Computing Facility

Security

Firewall to minimize unauthorized access

Most servers closed to direct, external access

User access through security-enhanced gateway systems

Security in the GRID-environment a big challenge

Page 32: Site Report: The RHIC         Computing Facility

Security at the RCF

Page 33: Site Report: The RHIC         Computing Facility

Other Services

E-mail

Limited printer support

Off-site data transfer services (bbftp, rftp, etc)

Nightly backups of critical file systems

Page 34: Site Report: The RHIC         Computing Facility

Summary

Implementation of GRID-like services increasing

Hardware & software scalability more important as RCF grows

Security issues in the GRID-era an important issue