l. arrabito, d. bouvet, x. canehan, p. girard, y. perret, s. poulat, r. rumler sw distribution tests...
TRANSCRIPT
![Page 1: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/1.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
SW distribution tests at Lyon
Pierre GirardLuisa Arrabito, David BouvetYannick Perret, Xavier CanehanSuzanne Poulat, Rolf Rumler
Initially presented at Jamboree LHCb
March 7th, 2011Updated on March 25th for Atlas CAF
![Page 2: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/2.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Content
• AFS Latency story AFS latency problem schedule LHCb SetupProject tests Preliminary conclusions
• xxx-FS client stress tests Test suite results CVMFS tests results
• Conclusions
2
![Page 3: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/3.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 3
![Page 4: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/4.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
AFS latency problem schedule
4
08/07: New LHCb setup timeout problems
May June July Aug. Sept. Oct. Nov. Dec. Jan.
2011
17/05: SetupProject.sh timeout (5%)
17/06: SetupProject.sh timeout (0,5%)
New AFS server versionAFS Story
RO AFS Volumes for LHCb SW Area New AFS client
03/11: ATLAS timeout problems (50% failures)
05/11: Atlas increased its timeout (3600s)
26/11: PARTLY SOLVED LHCb increased its timeout (3600s)
Many WN crashes due to IO problem
Many (temporary) freezing WNs after OS patch
SL5 Story
Stable SL5 WNs after new kernel patch
LHCb test infrastructure setup for different tunings (AFS, SL5, LRMS)
Tests of different kernel parameters
Tests Story
07/01: SOLVEDCCIN2P3 reduced the number of job slots on recent HW
06/09:CCIN2P3 adding new HW (24 logical cores): 110 WNs
![Page 5: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/5.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
LHCb job setup tests
5
Source: L. Arrabito
![Page 6: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/6.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
AFS Latency / LHCb job efficiency
6
All-T1s CPU efficiency during the same period
Sou
rce:
http
://w
ww
3.eg
ee.c
esga
.es/
grid
site
/acc
ount
ing/
CE
SG
A/ti
er1_
view
.htm
l
SIteJan 10
Feb 10
Mar 10
Apr 10
May 10
Jun 10
Jul 10
Aug 10
Sep 10
Oct 10
Nov 10
Dec 10
Jan 11
Total
CC 39.9 46.3 67.1 80.3 84.7 87.3 88.5 88.6 79.8 80.7 88.6 88.5 90.9 86.0
All T1s
42.7 58.4 74.6 82.4 88.2 88.4 88.3 73.5 79.4 85.1 81.7 89.8 92.8 82.5
Visible effect of adding latest (24-cores) WNs?
![Page 7: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/7.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Preliminary conclusions
• LHCb/ATLAS environment setup is very (too much) FS-intensive By stracing SetupProject.sh
17 868 open() 110 765 stat()
• Investigate on job distribution strategy to avoid too many similar jobs on the same WN According to “lhcb-alone” and “atlas-excluded” tests results
• AFS latency problem is now a AFS client scalability problem Temporary solved by decreasing the number of job slots on the most
recent machines, but … Is that a major concern in the near future ?
• Have the other sites already experienced the same problem ?
7
![Page 8: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/8.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 8
![Page 9: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/9.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Test suite conditions
• For each test Walk of the same directory arborescence (DAVINCI) Same actions are achieved in the same order
LHCb ProjectSetup-like100 000 stat()7 000 open()
– First block is read to ensure the open() is effective
• Pre-loading the cache (if any) by pre-executing the test once
• Averaged results are taken from 4 executions
9
![Page 10: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/10.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
FS Test Results
10
![Page 11: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/11.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
CVMFS test suite conditionsDifferent cache sizes
• Dedicated SQUID Used by the tested WN only With pre-loaded LHCb cache
• On CVMFS client (0.2.53-1), before each test Cache was removed Service was restarted Different cache sizes
« ls –lR » on sibling directories to make grow up the cache
11
![Page 12: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/12.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
CVMFS Cache Size Tests Results
12
![Page 13: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/13.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
CVMFS Min/Max results
13
![Page 14: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/14.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Comparison latest CVMFS (0.2.61)
14
NEW
(2011/03/25
)
![Page 15: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/15.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 15
![Page 16: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/16.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Conclusions
• LHCb/Atlas job setup should be optimized
• Multi-VOs sites should try to implement a fair distribution of VO’ jobs over the cluster WNs Restrict the number of similar jobs on a WN
• Issue of shared FS client scalability (Most likely) Checked with AFS, NFS4.0, and CVMFS Tests must go on
NFS4.1 (pNFS) still to be tested With other HWs (for now, only Poweredge C6100) By virtualizing the WNs (“divide and rule” principle)
– First attempt was achieved by basically splitting 24-cores WN into 2x(12-cores VM-WN)– Must be further investigated
• CVMFS Interesting for VO SW distribution (without installation job) But, take care that latency increases with cache size
16
![Page 17: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/17.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Questions & Comments
17
![Page 18: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/18.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 18
![Page 19: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/19.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Other T1s CPU EfficiencyFrom January 2010 to January 2011
19
CERN KIT PIC
CNAF NL-T1 RAL
Source: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier1_view.html
![Page 20: L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests at Lyon Pierre Girard Luisa Arrabito, David Bouvet](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d835503460f94a69512/html5/thumbnails/20.jpg)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Virtualized WN: divide and rule ?
20