cern and helixnebula , the science cloud
DESCRIPTION
CERN and HelixNebula , the Science Cloud. Fernando Barreiro Megino (CERN IT) Ramón Medrano Llamas (CERN IT) Dan van der Ster (CERN IT) Rodney Walker (LMU Munich). HelixNebula at a glance. The ATLAS Flagship. Can ATLAS jobs run on cloud resources? - PowerPoint PPT PresentationTRANSCRIPT
www.egi.euEGI-InSPIRE RI-261323
EGI-InSPIRE
www.egi.euEGI-InSPIRE RI-261323
CERN and HelixNebula, the Science Cloud
Fernando Barreiro Megino (CERN IT)
Ramón Medrano Llamas (CERN IT)
Dan van der Ster (CERN IT)
Rodney Walker (LMU Munich)
1
www.egi.euEGI-InSPIRE RI-261323
HelixNebula at a glance
www.egi.euEGI-InSPIRE RI-261323
The ATLAS Flagship
Can ATLAS jobs run on cloud resources?
Run Monte Carlo jobs on 3 commercial providers (Atos, CloudSigma, T-Systems):o Very simple use case without any data
requirementso 10s MB in/outo CPU intensive jobs with durations of ~6-12
hours/job
3
www.egi.euEGI-InSPIRE RI-261323
Configuration
4
Head nodeWorkers
CernVM FS
PanDA server
CernVM 2.5.xgmondgmetadhttpd
rrdtoolscondorcvmfs
Condor head
Atos
Head nodeWorkers
CloudSigma
Head nodeWorkers
T-Systems
CernVM 2.5.xgmondcondorcvmfs
CERN EOS Storage Element
HELIX Atos
HELIX CloudSigma
HELIX T-Systems
jobs
Software
Inpu
t and
ou
tput
da
ta
PanDA: ATLAS workload management system
www.egi.euEGI-InSPIRE RI-2613235
Cloud deployments
• Each cloud offering was different• Different concepts of IaaS
• Persistent VMs: cloning of full disk to boot new instance• Ephemeral VMs: can be accidentally lost
• Requirement of different image formats of CernVM• Proprietary interfaces• Possibility of user contextualization only straightforward in one
provider• Otherwise we followed the “golden disk” model
Consequences: • Needed to start from scratch with each provider• Set up and configuration was time consuming and limited the amount of
testing
www.egi.euEGI-InSPIRE RI-261323
Results overview
• ATM we are the only institute to finish successfully the 3 PoCs• In total ran ~40.000 CPU days
• Few hundred cores per provider for several weeks• CloudSigma generously allowed us to run on 100 cores for several
months beyond the PoC
• Running was smooth, most errors were related to network limitations• Bandwidth was generally tight• One provider required VPN to CERN
• Besides MonteCarlo simulation, we executed HammerCloud jobs:o Standardized type of jobs
o Production Functional Testo All sites running the same workload
o Performance measuring and comparison
6
www.egi.euEGI-InSPIRE RI-261323
Total job completion time
7
www.egi.euEGI-InSPIRE RI-261323
Athena execution time
8
www.egi.euEGI-InSPIRE RI-261323
Stage in time
9
Long tailed network access distribution: VPN latency
www.egi.euEGI-InSPIRE RI-261323
Further observations for HELIX_A
10
CERN-PROD
HELIX_A
• Finished 6 x 1000-job MC tasks over ~2 weeks• 200 CPUs at HELIX_A used for production
tasks• We ran 1 identical MC simulation task at CERN
to get reference numbers
HELIX_A CERN
Failures265 failed,
6000 succeeded36 failed,
1000 succeeded
Running times 16267s±7038s 8136s±765s
• Wall clock performance cannot be compared directly, since the HW is different
• HELIX_A: ~1.5Ghz AMD Opteron 6174• CERN-PROD: ~2.3GHz Xeon L5640
8509
1122
0
1393
1
1664
2
1935
4
2206
5
2477
6
2748
7
3019
9
3291
0
3562
1
3833
2
4104
4
4375
5
4646
60
200
400
600
800
1000
1200
1400
Avg: 16267sStdev: 7038s
Running time in seconds
6277
6812
7348
7884
8420
8956
9492
1002
7
1056
3
1109
9
1163
5
1217
1
1270
7
1324
3
1377
80
20
40
60
80
100
120
140
160
180
200
Avg: 8136sStdev: 765s
Running time in seconds
www.egi.euEGI-InSPIRE RI-261323
Further observations for HELIX_A
11
• Not all machines perform in the same way!• But we would be paying the same price for both
www.egi.euEGI-InSPIRE RI-261323
Conclusions
• We demonstrated the feasibility of running ATLAS workload on the cloud
• We did not touch the storage and data management part• each job brought the data over the WAN• MC simulation is best suited for this setup
• Cloud computing is still a young technology and the adoption of standards would be welcome
• We do not have cost estimates• Best comparison would be CHF/event on the cloud and
on the sites
12
www.egi.euEGI-InSPIRE RI-261323
Next steps for the HelixNebula partnership
• TechArch Group:o Working to federate the providers
o Common API, Single-sign-ono Image & Data marketplaceo Image Factory (transparent conversion)o Federated accounting & billingo Cloud Broker service
o Currently finalizing requirementso Two implementation phases:
o Phase 1: lightweight federation, common APIo Phase 2: image management, brokerage
o Improve connectivity by joining cloud providers to GEANT network o only for non-commercial data
o Collaborate with EGI Federated Cloud Task Forceo seek common grounds
o For ATLAS: Possibly repeat PoCs after Phase 1 implementations
13