Site Report: The Linux Farm at the RCF
HEPIX-HEPNTOctober 22-25, 2002
Ofer Rind
RHIC Computing FacilityBrookhaven National Laboratory
Ofer Rind - RHIC Computing Facility Site Report
RCF - Overview
Provide computing facilities for RHIC users:
➔General computing environment
●General interactive tasks (email, document processing, web)
➔Data analysis facility
●Computing infrastructure for RHIC experiments
➢Code development, repository & distribution
➢Raw data recording & reconstruction
➢Data analysis
ACF: US Atlas Tier 1 Computing Facility
➔ Shared infrastructure and synergy with RCF
Support staff: 25 FTE's (4 dedicated to Linux Farm)
Ofer Rind - RHIC Computing Facility Site Report
RCF - Structure
RCF - Component Summary
Mass Storage Subsystem➔ StorageTek library managed by HPSS
●4 Silos, 1.2PB capacity (expanding to 4.5PB)●In Run-2, raw data recorded at a common rate of 70MB/sec for a total of 170TB●Total data store ~300TB
Disk Storage➔ Fibre channel SAN served by NFS
●~110TB Raid5 ●14 Sun 450, Solaris 8 [2-02] (5 Sun 480 coming online)●IBM AFS servers (AIX)
Linux Server Farm
Ofer Rind - RHIC Computing Facility Site Report
Linux Farm Hardware
➔840 1U and 2U servers (pre-'99 towers have been retired)
➔69 kSPECint95, expanding to 100 kSPECint95 (2+ TFLOPS)
➔Most have 1GB mem (at least 500MB)
➔Local SCSI disks up to 140GB/node
➔Allocated by experiment➔Further allocated for Raw Data
Reconstruction (CRS) and Re- constructed Data Analysis (CAS)
VA Linux PIII 450Mz 148 Jun 99VA Linux PIII 700Mz 48 Aug 00VA Linux PIII 800Mz 168 Nov 00
IBM PIII 1000Mz 316 Aug 01IBM PIII 1400Mz 160 Oct 02
Ofer Rind - RHIC Computing Facility Site Report
Linux Farm Software Configuration
● RedHat 7.2 upgraded to 2.4.9-31 kernel
● Image(s) installed via Kickstart server and customized for RCF environment via rpm
● NFS + AFS home directory and file access
● Interactive login allowed on selected nodes
● Job management:
(CAS) LSF 4.2 - slightly re-architected for robustness. Peak throughput before summer conferences was >150K jobs/week.
(CRS) Locally produced Perl-based batch system (AIX needed for HPSS API). Approx. 670K jobs processed for Run-2.
● Expanding use of distributed disk models (rootd, ??)
● Atlas Grid testbed
Ofer Rind - RHIC Computing Facility Site Report
Tracking LSF Usage
Star queues weekly job statistics(week of Oct. 10)
Job starts/hr
Avg runtime/hr
Runtime
Ofer Rind - RHIC Computing Facility Site Report
Security and Monitoring
Security:●RCF firewall within BNL site firewall●SSH2 only access through gateway bastion nodes (Solaris x86)●User access restricted to a subset of systems (CAS only)
Monitoring:●24 hr. on-call staff for critical systems during RHIC operation●Cluster mgmt. software:
➔VACM (VA Linux)➔xCAT (IBM, http://www.x-cat.org)
●Cron scripts to "clean" nodes and head off possible problems (memory leaks, full disks, etc.)●CTS system for problem reports
Ofer Rind - RHIC Computing Facility Site Report
Farm Alert System
Web-monitoring (user-accessible) plus paging/email alerts
Python scripts running locally transferring node status information to a MySQL database.
Notification of problems with NFS/AFS (e.g. stale file handles), LSF daemons, high load, etc.
Ofer Rind - RHIC Computing Facility Site Report
Network Operation Status
Perl scripts monitor network service connectivity for all nodes (ssh, yp, etc.)
Ofer Rind - RHIC Computing Facility Site Report
Load Monitoring and History
MySQL database for usage history
History available back to Sept. '01 via web interface.
CPU Load averaged over (98) Phenix machines during the month of September.
Ofer Rind - RHIC Computing Facility Site Report
Plans for the Near Future
● 160 newly delivered IBM nodes to be brought online
● Expect purchase bid to go out for ~220 more nodes at beginning of FY03 (pending funding approval)
● Scaling up data storage capacity and throughput for Run-3 (up to 10X data increase over Run-2, starting in December)
● Evaluation of LSF 5 and Condor ongoing, with an eye towards distributed disk services
● Expanding Atlas GRID services
Ofer Rind - RHIC Computing Facility Site Report