hepix meeting summary autumn 2000 jefferson laboratory, newport news, virginia, usa

13
HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Upload: primrose-montgomery

Post on 14-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

HEPiX Meeting Summary

Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Page 2: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Site Reports - 1• Jefferson lab (host): 4-6 GEV continuous beam electron accelerator. 600 staff,

70M$/year budget. STK silo with 8 Redwood (replace by 10 9840 soon) + 10 9840 drives. From 2005 double energy and run at 75-100 MB/sec to accumulate 3 PB/year. Also do lattice QCD and propose to buy 256 node Compaq Alpha cluster this year. Have negotiated good price for LSF.

• IN2P3: French national and Babar regional centre. 6 STK silos but limited Redwood. Have reserved 40 Mbits/sec line to Slac for Babar data. Bought an 96 node IBM Linux netfinity cluster in 1 AU high units in single frame. Working on 64-bit version of RFIO.

• FNAL: Started certifying Solaris 8 and Linux RH 7.ENSTORE (cf CASTOR) in ‘limited’ production use. Working on migrating to Kerberos 5 (MIT) with heavy mods for AFS. Have bought Linux PCs in a racked solution.

Page 3: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Site Reports - 2• RAL: Have closed HP farm. Looking for robotics upgrade – STK, IBM or ADIC (ex Grau-

Abba). Problem that Babar want Linux RH 6.2 while CDF want FNAL Linux.

• BNL: Have 1200 node home made QCD-SP supercomputer. Have VA-Linux rack mounted PCs and will use HP Open-View for site security management

• LAL: physicists use Windows NT4 for desktop and pressure for Linux is lowering. They find VMWARE (Linux under NT) useful.

• DESY: Working on a centrally managed and fully transparent user disk cache (ex-Eurostore 2 in my opinion). Use CODINE for batch - recently bought by SUN and made open source – and LSF but will drop LSF (too expensive). Have installed Kerberos 5 with AFS support (Swedish, not MIT version) but not yet in production. 6 man months work and worker is leaving !

Page 4: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Site Reports - 3• INFN: Heavy use of CONDOR – 200 machines since 2 years. Building

national GRID to be synchronised with CERN GRID. No plans to move central facilities to Linux because of lack of management tools.

• SACLAY: Running Veritas netbackup for all machines.

• SLAC: Going to rack mounted Linux. Largest HPSS site in volume (200 TB cf 25 at CERN). Will add second instance of HPSS to support general staging. Plan to add 20 9940 drive to existing 20 9840 but currently getting 1/3 write media failures in field test.

• LBL Parallel Distributed Systems Facility: Support Atlas and Rich (BNL) using AFS knfs gateway to CERN and BNL. Use HPSS at NERSC. Local farm of rack mounted Linux PCs.

Page 5: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

AFS discussion with IBM-Transarc• IBM Transarc lab suport AFS, DFS and DCE.• Standard IBM support pricing using local support (IBM Suisse) as first level.

Local support will be trained by IBM Transarc.• Currently AFS 3.6 (supporting RH Linux kernels 2.2.x) and no plans for 3.7.

Admit AFS does not generate a lot of revenue. Interested in Kerberos 5 support but think this should be done by Open-AFS in which case may take it back into product (not convincing !).

• Will release Solaris 8 and W2K clients soon (last Friday!). W2K is only tolerant, not MSI certified. Essentially a WNT port using no special W2K features. Includes some cache corruption bug fixes (hopefully ours !).

• Plan future NAS/SAN Enterprise-wide file system to replace AFS and DCE- 2/3 years and driven by IBM San Jose.

• AFS development team same size but partly moved to India.• Open AFS source tree not maintained by IBM (probably CMU).• Official end of 3.6 support end 2002 (changed to ‘after next major release!)• HEPIX concludes AFS is now in maintenance mode from IBM so want their

customers to either go to Open AFS or Enterprise system in a few years.

Page 6: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Cluster Monitoring and Control• FNAL developing Next Generation Operations for farm management. Has

concept of fast and slow monitoring streams unlike CERN.• CERN developing PEM. Trying to follow standards. May collaborate with Sun

Java competence centre.• SLAC have enhanced RANGER tool - based on perl scripts and rule-sets but

no system overview features.• IN2P3 have just begun to develop Global Monitoring system. They think

NGOP and PEM are too complicated.

• Conclusion– Several projects underway– Collaboration of ideas occurred – Communication earlier in the process may have resulted in more

collaboration

Page 7: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Batch systems• Jefferson - migrating from LSF to free Public Batch System (PBS) from Nasa

with good experiences.• Wisconsin - home of Condor, good overview. A live product we should try. Not

a batch system however (desktop spare cycles).• Platform - a sales talk for LSF ! Good directions for farming architecture.• IN2P3 - continuing to enhance their home written BQS including adding a Java

GUI interface both for users and administrators.• FNAL - have rewritten FBS as FBSNG (farms batch system next generation)

to be now independent of LSF.

• Conclusions– as before no common approach from HEP but gives a choice for the

future if we want to stop LSF.

Page 8: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Mass Storage• CERN - wants to migrate off Redwoods to (probably) STK 9940

– deploying CASTOR phase 1 now for all new experiment data– will keep HPSS for now for “user tapes” data– will tender at end of 2001 for next 4-5 years robotics/drives

• Jefferson - have been using OSM but now a dead product so developing their own Jasmine tape/disk data mover system. They already have a mature disk pool management system.

• Conclusion– Various projects for Mass storage interfaces: Castor, JASMine, Enstore– Continued problems with STK Redwood drives (all sites in discussion)– Reported problems with STK 9940 (SLAC, not yet seen at Cern)

Page 9: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Grid• Projects at all major labs related to data grid - INFN internal grid

to ‘synchronise’ with CERN - US GriPhyN internal physics grid - RAL developing a Globus infrastructure between it and FNAL for CDF collaboration work (4 UK universities + RAL + FNAL)

• HEP distributed computing model can exploit the concepts of the Grid work

• Technical and management solutions need to be developed

• It’s clear that coordination between sites/labs will be needed - danger of incompatible European and US Grids

Page 10: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Linux

• Usage continues to grow. Most sites using rack mounted either vendor integrated or just pizza boxes in standard racks.

• Expertise in management also continues to grow

• Commodity computing analysis farms are in our future

• Human and $ resources are in short supply everywhere

Page 11: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Work To Be Done - 1• Joint OS certification

– Cern, Fermi and Slac starting on project to jointly certify Linux and Solaris OS’s

– Aim to limit duplication of effort– Are there others who would like to work on this?

• HEPiX mail list clean up– Lists are moving to listserv.fnal.gov (web interface)

• Default setup requires list owner permission to subscribe but this process is automated

• Anyone on list can post to list• Detailed info on listserv usage online • HEPiX web pages need to be updated with new info

Page 12: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Work To Be Done - 2

• Large Cluster SIG– Alan Silverman organizing this special interest group– Proposed to meet separately from HEPiX/HEPNT

• Would also report status at HEPiX/HEPNT

• Primary goals:– Keep sites aware of what relevant work is in progress or even planned– Be aware and promote collaboration

Page 13: HEPiX Meeting Summary Autumn 2000 Jefferson Laboratory, Newport News, Virginia, USA

Future Meetings

• LAL and Cern have volunteered to host spring ’01 meeting– Since LAL offered first we will pursue this option to start – Plan for meeting in April (Easter is 4/15)

• Joint meeting with HEPNT will continue

• Volunteers for Fall ’01 North American meeting?