cern and helixnebula , the science cloud

www.egi.euEGI-InSPIRE RI-261323

EGI-InSPIRE


CERN and HelixNebula, the Science Cloud

Fernando Barreiro Megino (CERN IT)

Ramón Medrano Llamas (CERN IT)

Dan van der Ster (CERN IT)

Rodney Walker (LMU Munich)

1


HelixNebula at a glance


The ATLAS Flagship

Can ATLAS jobs run on cloud resources?

Run Monte Carlo jobs on 3 commercial providers (Atos, CloudSigma, T-Systems):o Very simple use case without any data

requirementso 10s MB in/outo CPU intensive jobs with durations of ~6-12

hours/job

3


Configuration

4

Head nodeWorkers

CernVM FS

PanDA server

CernVM 2.5.xgmondgmetadhttpd

rrdtoolscondorcvmfs

Condor head

Atos

Head nodeWorkers

CloudSigma

Head nodeWorkers

T-Systems

CernVM 2.5.xgmondcondorcvmfs

CERN EOS Storage Element

HELIX Atos

HELIX CloudSigma

HELIX T-Systems

jobs

Software

Inpu

t and

ou

tput

da

ta

PanDA: ATLAS workload management system


Cloud deployments

• Each cloud offering was different• Different concepts of IaaS

• Persistent VMs: cloning of full disk to boot new instance• Ephemeral VMs: can be accidentally lost

• Requirement of different image formats of CernVM• Proprietary interfaces• Possibility of user contextualization only straightforward in one

provider• Otherwise we followed the “golden disk” model

Consequences: • Needed to start from scratch with each provider• Set up and configuration was time consuming and limited the amount of

testing


Results overview

• ATM we are the only institute to finish successfully the 3 PoCs• In total ran ~40.000 CPU days

• Few hundred cores per provider for several weeks• CloudSigma generously allowed us to run on 100 cores for several

months beyond the PoC

• Running was smooth, most errors were related to network limitations• Bandwidth was generally tight• One provider required VPN to CERN

• Besides MonteCarlo simulation, we executed HammerCloud jobs:o Standardized type of jobs

o Production Functional Testo All sites running the same workload

o Performance measuring and comparison

6


Total job completion time

7


Athena execution time

8


Stage in time

9

Long tailed network access distribution: VPN latency


Further observations for HELIX_A

10

CERN-PROD

HELIX_A

• Finished 6 x 1000-job MC tasks over ~2 weeks• 200 CPUs at HELIX_A used for production

tasks• We ran 1 identical MC simulation task at CERN

to get reference numbers

HELIX_A CERN

Failures265 failed,

6000 succeeded36 failed,

1000 succeeded

Running times 16267s±7038s 8136s±765s

• Wall clock performance cannot be compared directly, since the HW is different

• HELIX_A: ~1.5Ghz AMD Opteron 6174• CERN-PROD: ~2.3GHz Xeon L5640

8509

1122

0

1393

1

1664

2

1935

4

2206

5

2477

6

2748

7

3019

9

3291

0

3562

1

3833

2

4104

4

4375

5

4646

60

200

400

600

800

1000

1200

1400

Avg: 16267sStdev: 7038s

Running time in seconds

6277

6812

7348

7884

8420

8956

9492

1002

7

1056

3

1109

9

1163

5

1217

1

1270

7

1324

3

1377

80

20

40

60

80

100

120

140

160

180

200

Avg: 8136sStdev: 765s

Running time in seconds


Further observations for HELIX_A

11

• Not all machines perform in the same way!• But we would be paying the same price for both


Conclusions

• We demonstrated the feasibility of running ATLAS workload on the cloud

• We did not touch the storage and data management part• each job brought the data over the WAN• MC simulation is best suited for this setup

• Cloud computing is still a young technology and the adoption of standards would be welcome

• We do not have cost estimates• Best comparison would be CHF/event on the cloud and

on the sites

12


Next steps for the HelixNebula partnership

• TechArch Group:o Working to federate the providers

o Common API, Single-sign-ono Image & Data marketplaceo Image Factory (transparent conversion)o Federated accounting & billingo Cloud Broker service

o Currently finalizing requirementso Two implementation phases:

o Phase 1: lightweight federation, common APIo Phase 2: image management, brokerage

o Improve connectivity by joining cloud providers to GEANT network o only for non-commercial data

o Collaborate with EGI Federated Cloud Task Forceo seek common grounds

o For ATLAS: Possibly repeat PoCs after Phase 1 implementations

13

cern and helixnebula , the science cloud

Documents

pocs cern

atlas flagshipcan atlas

cloud resources

hammercloud jobs

helixnebula results

srunning time

ster cern itrodney walker

monte carlo jobs