rhic/us atlas tier 1 computing facility site report

17
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laborato [email protected] HEPiX Upton, NY, USA October 18, 2004

Upload: iain

Post on 21-Mar-2016

55 views

Category:

Documents


2 download

DESCRIPTION

Christopher Hollowell Physics Department Brookhaven National Laboratory [email protected]. RHIC/US ATLAS Tier 1 Computing Facility Site Report. HEPiX Upton, NY, USA October 18, 2004. Facility Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: RHIC/US ATLAS Tier 1  Computing Facility Site Report

RHIC/US ATLAS Tier 1 Computing Facility

Site Report

Christopher HollowellPhysics DepartmentBrookhaven National [email protected]

HEPiXUpton, NY, USAOctober 18, 2004

Page 2: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Facility Overview

● Created in the mid 1990's to provide centralized computing services for the RHIC experiments

● Expanded our role in the late 1990's to act as the tier 1 computing center for ATLAS in the United States

● Currently employ 28 staff members: planning on adding 5 additional employees in the next fiscal year

Page 3: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Facility Overview (Cont.)

● Ramping up resources provided to ATLAS: Data Challenge 2 (DC2) underway

● RHIC Run 5 scheduled to begin in late December 2004

Page 4: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Centralized Disk Storage

● 37 NFS Servers Running Solaris 9: recent upgrade from Solaris 8

● Underlying filesystems upgraded to VxFS 4.0– Issue with quotas on filesystems larger than 1 TB in

size● ~220 TB of fibre channel SAN-based RAID5

storage available: added ~100 TB in the past year

Page 5: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Centralized Disk Storage (Cont.)

● Scalability issues with NFS (network-limited to ~70 MB/s max per server [75-90 MB/s max local I/O] in our configuration): testing of new network storage models including Panasas and IBRIX in progress– Panasas tests look promising. 4.5 TB of storage on

10 blades available for evaluation by our user community. DirectFlow client in use on over 400 machines

– Both systems allow for NFS export of data

Page 6: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Centralized Disk Storage (Cont.)

Page 7: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Centralized Disk Storage: AFS

● Moving servers from Transarc AFS running on AIX to OpenAFS 1.2.11 on Solaris 9

● The move from Transarc to OpenAFS motivated by Kerberos4/Kerberos5 issues and Transarc AFS end of life

● Total of 7 fileservers and 6 DB servers: 2 DB servers and 2 fileservers running OpenAFS

● 2 Cells

Page 8: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Mass Tape Storage

● Four STK Powderhorn silos provided, each with the capability of holding ~6000 tapes

● 1.7 PB data currently stored● HPSS Version 4.5.1: likely upgrade to version

6.1 or 6.2 after RHIC Run 5● 45 tape drives available for use● Latest STK tape technology: 200 GB/tape ● ~12 TB disk cache in front of the system

Page 9: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Mass Tape Storage (Cont.)

● PFTP, HSI and HTAR available as interfaces

Page 10: RHIC/US ATLAS Tier 1  Computing Facility Site Report

CAS/CRS Farm

● Farm of 1423 dual-CPU (Intel) systems– Added 335 machines this year

● ~245 TB local disk storage (SCSI and IDE) ● Upgrade of RHIC Central Analysis

Servers/Central Reconstruction Servers (CAS/CRS) to Scientific Linux 3.0.2 (+updates) underway: should be complete before next RHIC run

Page 11: RHIC/US ATLAS Tier 1  Computing Facility Site Report

CAS/CRS Farm (Cont.)● LSF (5.1) and Condor (6.6.6/6.6.5) batch systems

in use. Upgrade to LSF 6.0 planned● Kickstart used to automate node installation● GANGLIA + custom software used for system

monitoring● Phasing out the original RHIC CRS Batch

System: replacing with a system based on Condor● Retiring 142 VA Linux 2U PIII 450 MHz

systems after next purchase

Page 12: RHIC/US ATLAS Tier 1  Computing Facility Site Report

CAS/CRS Farm (Cont.)

Page 13: RHIC/US ATLAS Tier 1  Computing Facility Site Report

CAS/CRS Farm (Cont.)

Page 14: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Security

● Elimination of NIS, complete transition to Kerberos5/LDAP in progress

● Expect K5 TGT to X.509 certificate transition in the future: KCA?

● Hardening/monitoring of all internal systems● Growing web service issues: unknown services

accessed through port 80

Page 15: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Grid Activities

● Brookhaven planning on upgrading external network connectivity to OC48 (2.488 Gbps) from OC12 (622 Mbps) to support ATLAS activity

● ATLAS Data Challenge 2: jobs submitted via Grid3

● GUMS (Grid User Management System)– Generates grid-mapfiles for gatekeeper hosts– In production since May 2004

Page 16: RHIC/US ATLAS Tier 1  Computing Facility Site Report

Storage Resource Manager (SRM)

● SRM: middleware providing dynamic storage allocation and data management services– Automatically handles network/space allocation

failures● HRM (Hierarchical Resource Manager)-type

SRM server in production– Accessible from within and outside the facility– 350 GB Cache– Berkeley HRM 1.2.1

Page 17: RHIC/US ATLAS Tier 1  Computing Facility Site Report

dCache

● Provides global name space over disparate storage elements– Hot spot detection– Client software data access through libdcap library or

libpdcap preload library● ATLAS & PHENIX dCache pools

– PHENIX pool expanding performance tests to production machines

– ATLAS pool interacting with HPSS using HSI: no way of throttling data transfer requests as of yet