cern and helixnebula , the science cloud

13
www.egi.eu EGI-InSPIRE RI-261323 EGI- InSPIRE www.egi.eu EGI-InSPIRE RI-261323 CERN and HelixNebula, the Science Cloud Fernando Barreiro Megino (CERN IT) Ramón Medrano Llamas (CERN IT) Dan van der Ster (CERN IT) Rodney Walker (LMU Munich) 1

Upload: fisseha

Post on 05-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

CERN and HelixNebula , the Science Cloud. Fernando Barreiro Megino (CERN IT) Ramón Medrano Llamas (CERN IT) Dan van der Ster (CERN IT) Rodney Walker (LMU Munich). HelixNebula at a glance. The ATLAS Flagship. Can ATLAS jobs run on cloud resources? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

EGI-InSPIRE

www.egi.euEGI-InSPIRE RI-261323

CERN and HelixNebula, the Science Cloud

Fernando Barreiro Megino (CERN IT)

Ramón Medrano Llamas (CERN IT)

Dan van der Ster (CERN IT)

Rodney Walker (LMU Munich)

1

Page 2: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

HelixNebula at a glance

Page 3: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

The ATLAS Flagship

Can ATLAS jobs run on cloud resources?

Run Monte Carlo jobs on 3 commercial providers (Atos, CloudSigma, T-Systems):o Very simple use case without any data

requirementso 10s MB in/outo CPU intensive jobs with durations of ~6-12

hours/job

3

Page 4: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Configuration

4

Head nodeWorkers

CernVM FS

PanDA server

CernVM 2.5.xgmondgmetadhttpd

rrdtoolscondorcvmfs

Condor head

Atos

Head nodeWorkers

CloudSigma

Head nodeWorkers

T-Systems

CernVM 2.5.xgmondcondorcvmfs

CERN EOS Storage Element

HELIX Atos

HELIX CloudSigma

HELIX T-Systems

jobs

Software

Inpu

t and

ou

tput

da

ta

PanDA: ATLAS workload management system

Page 5: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-2613235

Cloud deployments

• Each cloud offering was different• Different concepts of IaaS

• Persistent VMs: cloning of full disk to boot new instance• Ephemeral VMs: can be accidentally lost

• Requirement of different image formats of CernVM• Proprietary interfaces• Possibility of user contextualization only straightforward in one

provider• Otherwise we followed the “golden disk” model

Consequences: • Needed to start from scratch with each provider• Set up and configuration was time consuming and limited the amount of

testing

Page 6: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Results overview

• ATM we are the only institute to finish successfully the 3 PoCs• In total ran ~40.000 CPU days

• Few hundred cores per provider for several weeks• CloudSigma generously allowed us to run on 100 cores for several

months beyond the PoC

• Running was smooth, most errors were related to network limitations• Bandwidth was generally tight• One provider required VPN to CERN

• Besides MonteCarlo simulation, we executed HammerCloud jobs:o Standardized type of jobs

o Production Functional Testo All sites running the same workload

o Performance measuring and comparison

6

Page 7: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Total job completion time

7

Page 8: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Athena execution time

8

Page 9: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Stage in time

9

Long tailed network access distribution: VPN latency

Page 10: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Further observations for HELIX_A

10

CERN-PROD

HELIX_A

• Finished 6 x 1000-job MC tasks over ~2 weeks• 200 CPUs at HELIX_A used for production

tasks• We ran 1 identical MC simulation task at CERN

to get reference numbers

HELIX_A CERN

Failures265 failed,

6000 succeeded36 failed,

1000 succeeded

Running times 16267s±7038s 8136s±765s

• Wall clock performance cannot be compared directly, since the HW is different

• HELIX_A: ~1.5Ghz AMD Opteron 6174• CERN-PROD: ~2.3GHz Xeon L5640

8509

1122

0

1393

1

1664

2

1935

4

2206

5

2477

6

2748

7

3019

9

3291

0

3562

1

3833

2

4104

4

4375

5

4646

60

200

400

600

800

1000

1200

1400

Avg: 16267sStdev: 7038s

Running time in seconds

6277

6812

7348

7884

8420

8956

9492

1002

7

1056

3

1109

9

1163

5

1217

1

1270

7

1324

3

1377

80

20

40

60

80

100

120

140

160

180

200

Avg: 8136sStdev: 765s

Running time in seconds

Page 11: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Further observations for HELIX_A

11

• Not all machines perform in the same way!• But we would be paying the same price for both

Page 12: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Conclusions

• We demonstrated the feasibility of running ATLAS workload on the cloud

• We did not touch the storage and data management part• each job brought the data over the WAN• MC simulation is best suited for this setup

• Cloud computing is still a young technology and the adoption of standards would be welcome

• We do not have cost estimates• Best comparison would be CHF/event on the cloud and

on the sites

12

Page 13: CERN and  HelixNebula , the Science Cloud

www.egi.euEGI-InSPIRE RI-261323

Next steps for the HelixNebula partnership

• TechArch Group:o Working to federate the providers

o Common API, Single-sign-ono Image & Data marketplaceo Image Factory (transparent conversion)o Federated accounting & billingo Cloud Broker service

o Currently finalizing requirementso Two implementation phases:

o Phase 1: lightweight federation, common APIo Phase 2: image management, brokerage

o Improve connectivity by joining cloud providers to GEANT network o only for non-commercial data

o Collaborate with EGI Federated Cloud Task Forceo seek common grounds

o For ATLAS: Possibly repeat PoCs after Phase 1 implementations

13