experiment support introduction to hammercloud for the lhcb experiment dan van der ster cern it...
TRANSCRIPT
ExperimentSupport
Introduction to HammerCloud for The LHCb Experiment
Dan van der Ster
CERN IT Experiment Support
3 June 2010
ExperimentSupport Outline
• Introduction to HammerCloud– Motivation, History, Use-Cases
• How HammerCloud works– Design and Implementation Details
• Interface Tour for Users and Admins
• Possibilities for an LHCb Plugin
HammerCloud Introduction for LHCb – 2
ExperimentSupport Introduction to HammerCloud
• HammerCloud (HC) is a Distributed Analysis testing system serving two use-cases:– Robot-like Functional Testing: frequent “ping” jobs to all
sites to perform basic site validation– DA Stress Testing: on-demand large-scale stress tests
using real analysis jobs to test one or many sites simultaneously to:• Help commission new sites• Evaluate changes to site infrastructure• Evaluate SW changes• Compare site performances…
HammerCloud Introduction for LHCb – 3
ExperimentSupport HammerCloud and Job Robots
• HammerCloud is part of an evolution of job robots:– CMS Job Robot inspired the ATLAS GangaRobot (functional testing)– In ~Sept 2008, a form of the ATLAS GangaRobot was used to
manually stress test the Italian ATLAS Tier2’s:• 5 users manually submitting hundreds of instrumented jobs simultaneously
(SIMD)• Manual results collection and summarization• Early results were shown to be very useful:
– One early test showed a bimodal performance plot that was later traced to a faulty network switch which negatively affected the performance of some WNs. The need for an automated DA stress testing system was clear.
– HammerCloud was born in November 2008 to deliver on-demand stress tests to ATLAS sites:
• Since then HC has run >1300 “Tests” using more than 4 million jobs.• ATLAS has invested >200k CPU-days in HC tests
– CMS has also agreed to use HC: in April a prototype was delivered, and now scale tests are about to begin.
HammerCloud Introduction for LHCb – 4
ExperimentSupport HC and ATLAS during STEP’09
HammerCloud Introduction for LHCb – 5
STEP’09
ExperimentSupport HammerCloud Use-Cases
• Provides On-Demand and Automated Testing
• HC Operators define test templates: FUNCTIONAL and STRESS
• Functional Tests are automatically scheduled
– Results are published on the HC website and can be pushed to other systems (e.g. SAM)
• Stress tests are generally scheduled on demand as needed by:
– Central VO managers– Cloud/Regional managers– Site managers
• For all tests, a detailed report summarizing the job success rates and performances is produced.
HammerCloud Introduction for LHCb – 6
ExperimentSupport HammerCloud Components
• The HC UI is implemented as a Django web app:– View test results– View cloud/site evolution– DB Admin
• State is maintained in a MySQL DB
• HC Logic (job submission, monitoring, resubmission) implemented on top of the Ganga Grid Programming Interface (GPI)
HammerCloud Introduction for LHCb – 7
ExperimentSupport HammerCloud Logic
• An HC Test is described by:– The analysis code to run (typically a real analysis from the user community)– The dataset pattern (which can be resolved to a set of datasets appropriate
for the analysis code)– The list of sites to be tested, and the target number of jobs to run
concurrently per site– A start time and an end time
• Test execution proceeds in 4 steps:– Generate: Test description is converted to a set of submittable jobs (e.g.
Ganga job objects, one for each site under test)– Submit: the job objects are submitted– Run: jobs are monitored, outputs recorded to the HC DB, jobs are
resubmitted to achieve the target number of running jobs per site– Exit: at the test end time, leftover jobs are killed
• Concurrently, the HC Web shows real time test results
HammerCloud Introduction for LHCb – 8
ExperimentSupport An HC-LHCb Plugin
• What customizations would be needed for an HC-LHCb plugin?
• HC is built upon Ganga and exploits its job management features:– job repository, job configuration via
python, job submission, job monitoring in background thread(s)
• Given the existing GangaLHCb plugins, modifications to HC itself would be relatively minor, e.g.– HC Test Generation:
• Query a data discovery service to form a job processing random input data
– HC Test Running:• Changes to extract LHCb-specific job
metrics from Ganga
HammerCloud Introduction for LHCb – 9
ExperimentSupport
Interface Tour
1. The Public User Interface
HammerCloud Introduction for LHCb – 10
ExperimentSupport HC Home
• The HC Homepage lists the running and scheduled tests.
HammerCloud Introduction for LHCb – 11
ExperimentSupport Viewing a Test
• The test overview gives a quick summary of: Overall job efficiency, CPU/Walltime, Events/WrapperTime
• Also shows a summary of the jobs running at each site involved in the test.
HammerCloud Introduction for LHCb – 12
ExperimentSupport Viewing a Test: Summary Stats
• The Test Overview page also gives summary statistics by site• Here you can see some example metrics (for CMS)
HammerCloud Introduction for LHCb – 13
ExperimentSupport Viewing a Test: Per-Site Plots
• View plots of the recorded metrics for each site
HammerCloud Introduction for LHCb – 14
ExperimentSupport Viewing a Test: Metric Comparisons
• View the plots for all sites for a specific metric
• Used to compare site-by-site
HammerCloud Introduction for LHCb – 15
ExperimentSupport Modify a Running Test
• Authorized users can modify the parameters of a test at run time– E.g. change the end time, or number of running jobs per site
HammerCloud Introduction for LHCb – 16
ExperimentSupport Clone a Previous Test
• Cloning a previous test is simple– Useful to repeat the test or to run an identical test at a
different set of sites
HammerCloud Introduction for LHCb – 17
ExperimentSupport Overall HC Plots
• Historical plots show previous test statistics• Currently shows # running jobs per site. Plots showing the
evolution of the performance metrics are in development.
HammerCloud Introduction for LHCb – 18
ExperimentSupport HC Robot View
• The “Robot” view is used to show the success rates of functional test jobs over the past 24 hrs. (Similar to SSB)
• Clicking a site takes you to the list of Robot jobs executed at that site
HammerCloud Introduction for LHCb – 19
ExperimentSupport
Interface Tour
2. Admin Interface
HammerCloud Introduction for LHCb – 20
ExperimentSupport HC Admin: Operator and User Views
• HC Operators have access to admin all tables in the HC DB via a web interface
• HC Users have more limited access
HammerCloud Introduction for LHCb – 21
ExperimentSupport HC Admin: Tests and Templates
Above: List all Test Templates Below: List all Tests
HammerCloud Introduction for LHCb – 22
ExperimentSupport HC Admin: Edit a Test Template
• Test templates are defined via the Admin UI
• All of the parameters of a test are here, plus:– An active flag indicating that a
template should be auto-scheduled
– A default lifetime: auto-scheduled test instances of this template will run for this time period
• Normally, functional test templates include the list of sites to be tested, whereas stress test templates do not include a list of sites.
HammerCloud Introduction for LHCb – 23
ExperimentSupport HC Admin: Adding a new Test
• Adding a new test on-demand is simple. Select the test template of interest, a start time, and an end time.
• If needed, Tests can be further customized after the template is copied over.
HammerCloud Introduction for LHCb – 24
ExperimentSupport Summary
• HammerCloud is a DA functional and stress testing system used widely by ATLAS and coming soon for CMS
• Two basic use-cases:– Continuous stream of test jobs to measure site availability– Enable central managers to define standardized (stress)
tests, and empower site managers to invoke those tests on-demand.
• An HC-LHCb plugin would leverage the existing GangaLHCb work– A prototype plugin would not take significant effort
HammerCloud Introduction for LHCb – 25