caspur site report andrei maslennikov sector leader - systems orsay, april 2001

10
CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

Upload: oliver-goodman

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

CASPUR Site Report

Andrei Maslennikov

Sector Leader - Systems

Orsay, April 2001

Page 2: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 2

Will be shortly covered:

• Central computers• Storage• Tape-related systems• Central services• AFS, SSH...• External support• CASPUR and HEP• Projects for year 2001

Page 3: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 3

Central computers

Alpha SMP - 4100/ES40 - 48 processors - DU 4.0F+- Front-end : 4 CPU x 500Mhz/2Gb + 2 x 532Mhz/1Gb- Parallel batch (GRD/Codine) : 20 CPU x 500Mhz/2Gb + 20 x 400Mhz/1Gb- Serial batch (GRD/Codine) : 2 CPU x 667Mhz/1Gb- Systems very stable since 1 year- Batch load: 100%

IBM SMP - Power3 - 72 processors - AIX 4.3.3 ML8/ PSSP 3.2+- Front-end : 4 CPU x 375Mhz/4GB- Parallel batch (GRD/Codine) : 64 CPU x 375Mhz/16Gb - 4 nodes on Colony Switch- Serial batch : 4 CPU (2 x 375/ 2 x 200 Mhz)- Batch load: 100%

Sun SMP - 3500/4500 - 22 processors - Solaris 7+- Parallel batch (GRD/Codine) on 8 (336Mhz/2Gb) and 14 (336Mhz/3.6GB) CPUs- Not very popular, load < 40%

Page 4: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 4

Storage Scratch Areas- Local RAID-0 scratch areas on alphas and Sun (30-40 GB per node)- IBM 2102 FC Storage Server for SP3 with 560 GB - GPFS

Large-File Network Data Areas (NFS) - Two Network Appliance Filers: F540(150GB/FE) and F760(600GB/GE)- F540: mainly used for tape staging and as a temp space on the LAN - is being dismissed- F760: dedicated for number-crunching nodes and is not saturated - 1 TB is being added

Small-File Network Data Areas (AFS) - some 1.4 TB of Dothill RAID-5 on switched Fibre Channel SAN- 4 servers on CASPUR LAN ( 4x Sun UltraSparc 440 ) - 70 GB of commodity disk on WAN (Bari and Lecce, 2x Intel/Linux RH 7.0++/OAFS)

NAS and SAN - NAS: F760, AFS servers <=> Stager and main number-crunchers : being migrated to GE - SAN: 48 Brocade ports, 10 hosts, 6 disk systems, 6 tape drives : grows fast- In discussion: tests of 4 Dothill 7120 systems of 0.7TB each as GPFS core on SP3

Page 5: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 5

Tape-related systems Tape Drives and Robotics- 9740 STK Library (494 slots) is no longer sufficient to host both CASPUR and BABAR data- Just acquired a new LTO/FC 3584 system from IBM - with 300 slots and 4 drives- Choice influenced also by the work done at CERN (Baud,Collin,Curran): http://cscct.home.cern.ch/cscct/ultrium/index.htm- LTO: excellent streaming speeds - measured 15MB/sec native- LTO: positioning slow (av 100 sec vs 15 sec for 9840), IBM is working on it- LTO: 1 drive costs 1/4th of 9840, 1/?th Of 9940.- Currently on FC: 9840(bridged), DLT7000(bridged), 4xLTO(native)

Tape services - Automated ADSM backup for some 20 service hosts and Windows desktops- Automated AFS backup- Tape locking via Tape Dispatcher- Staging Servers for CASPUR and BABAR (since 1998):

o fully portable (perl+mysql)o redundant data formato multitape supported

o users handle only file names

Page 6: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 6

Central services

All our central services are Linux-based:- Syscontrol, DNS, Web, Mail, License, Print, Remote Access, DB- Linux system tree - always up-to-date- CASPUR BigBox CD with OAFS and SSH/OAFS - always fresh

During last year:

- Moved to uniform hardware: rack-mounted systems (VA Linux)

- System disks of Syscontrol and DB hosts on Mylex DAC960 CTL (RAID-5)

- Mail: implemented commercial HA solution from Steel Eye (LifeKeeper): o redundant heartbeat (serial and ethernet) o RAID-5 spool and sw on low-end Infortrend CTL with 2 host channels o ping mail while halting the current host: only 5 packets lost

Page 7: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 7

AFS, SSH...

AFS- OAFS a marvel (cheap servers possible). - Free enhanced OAFS client RPMs available at: /afs/caspur.it/project/openafs- Badly missing: AFS port for COMPAQ Tru64 5.1.

• Transarc does nothing • OAFS port may be done at KTH. Now trying to help them to get the OS source• Anybody else interested?

- Maintenance contract: IBM cannot make an offer for more than a year. We receive support free-of-charge, but hopefully it will end up soon.

SSH- 1.2.x dangerous - migrated urgently to openssh 2.3.0p1 (AFS-aware with direct authentication and watcher), on all architectures

Page 8: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 8

External support

Outside CASPUR:- 7 Clusters (number 8 just ordered) with about 80 nodes- Some 20 stand-alone machines (we are getting rid of these)- All kinds of hardware and all flavours of UNIX

Turnkey departmental solution- OAFS Cell on Linux

• automated backup required (DLT or AIT autoloaders)• redundant disk when possible• user and space management tools• Clients of UNIX and Windows, MAC AFS Gateway• Server normally stuffed with many other services: web,mail,dns, nis etc

- Organization of work• local trained person per cluster a must• no-root-pw a must• remote-only support (notification mainly via e-mail)• max 20% of total FTE resources dedicated (mainly for initial set-up)

Page 9: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 9

CASPUR and HEP

Everyday works for INFN:- Fullscale AFS system support (maintenance and hotline)- ASIS mirroring to 18 INFN Sections- SSH tree maintenance- Linux tree maintenance incl. bootable CDs at the latest patchlevel

BABAR Cluster at CASPUR- 5 E450 Sun Servers with 6TB of disk (Sun, COMPAQ, DotHill)- Linux/OAFS file server with backup- 10-host Sun MC Farm (Ultra 5)- 14-host Intel/Linux MC farm (rackmounted + 1 TB of RAID IDE)- multitape stager on E450 - 2 STK 9840 drives- GRD/Codine on all nodes

Other- Regular exchanges with CERN - Virgo (software)

Page 10: CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

A.Maslennikov - Orsay 2001 10

Projects for year 2001

Control and Monitoring- agent up and running on all Linux hosts- being ported to other architectures (encryption)- server integration with Syscontrol DB (event logs and configuration)

Syscontrol DB- mysql now, migration to InterBase by the end of 2001- Hosts’ DB and Syslog event collector DB- Hooks for syscontrol applications

Problem management- currently study possible solutions, Razor is one of the options

Console Server- planned for the second half of 2001- currently look at the serial hardware

Security- accent on host-based- host security “index” is being developed to integrate with Syscontrol