the hepcloud facility€¦ · 10 steven timm | hepcloud | fnal -uk planning meeting high...

19
Steven Timm UK/FNAL Planning Meeting 09 Oct 2018 The HEPCloud Facility

Upload: others

Post on 09-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

Steven TimmUK/FNAL Planning Meeting09 Oct 2018

The HEPCloud Facility

Page 2: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof.

Disclaimer

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting2

Page 3: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Project sponsor: P. Spentzouris• Sponsor proxies: S. Fuess, B. Holzman• Project managers: E. Berman, P. Mhashilkar• Technical Lead: A.Tiradani• Security: M. Altunay, J. Teheran• Integration/Testing: S. Timm• Decision Engine Development Lead: P. Mhashilkar• Monitoring: K. Retzke

HEPCloud project team

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting3

Page 4: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Started as a pilot project in 2015 to explore feasibility, capability of HEPCloud– Now in Phase III, moving into production by end of 2018.– Seed money provided by industry

• A portal to an ecosystem of diverse computing resources – commercial or academic – Provides “complete solutions” to users, with agreed upon levels

of service • The Facility routes to local or remote resources based on

workflow requirements, cost, and efficiency of accessing various resources

• Manages allocations of users to target compute engines • MORE THAN CLOUD

HEPCloud Facility

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting4

Page 5: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

Classes of External Resource Providers

Steven Timm | HEPCloud | FNAL -UK Planning meeting5

Grid Cloud HPC

TrustFederation EconomicModel GrantAllocation

▪CommunityClouds-SimilartrustfederationtoGrids

▪CommercialClouds- Pay-As-You-Gomodel๏Stronglyaccounted๏Near-infinitecapacity➜Elasticity๏Spotpricemarket

▪ResearchersgrantedaccesstoHPCinstallations

▪PeerreviewcommitteesawardAllocations๏AwardsmodeldesignedforindividualPIsratherthanlargecollaborations

•Virtual Organizations (VOs) of users trusted by Grid sites

•VOs get allocations ➜Pledges–Unused allocations: opportunistic resources

“Thingsyourent”“Thingsyouborrow” “Thingsyou aregiven”

10/9/18

Page 6: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

Drivers of Evolution: Capacity / Cost / Elasticity

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting6

Priceofonecore-yearonCommercialCloud1HEPneeds:10-100xtodaycapacity

Facilitysize:15kcores

NOvA jobsinthequeueatFNALUsageisnotsteady-state

Page 7: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

CMS Reaching ~60k slots on AWS with HEPCloud

7 Steven Timm | HEPCloud | FNAL -UK Planning meeting10/9/18

10%Test 25%

60000slots

10000VMEachcolorcorrespondstoadifferentregion/zone/machinetype

Page 8: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

Google CMS Demo Fall 2016

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting8

Page 9: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

Doubling all CMS computing!

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting9

Cores from Google

Page 10: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• CPU-intensive neutrino oscillation fit for NOvA

• Ran on NERSC one week, presented at conference the next.

• Significant resources available to HEP at US Dept. of Energy supercomputers

• Most of work done at NERSC so far

• Have also run MU2E and CMS workflows at scale at NERSC

Steven Timm | HEPCloud | FNAL -UK Planning meeting10

High Performance Computing--NERSC

Totalfacilitysize:64.4K

FIFEexperimentsfacilityshare:18.6K

NOvA (one of the many experiments supported at Fermilab) using HEPCloud to claim ~1M cores at NERSC to perform a large-scale analysis over a short timeframe

Newsarticle:http://news.fnal.gov/2018/07/fermilab-computing-experts-bolster-nova-evidence-1-million-cores-consumed/

10/9/18

Page 11: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Dept. of Energy Leadership class machines– Argonne and Oak Ridge are big ones– Network topology doesn’t easily support GlideinWMS model– Discussions on developing edge services to service jobs.– Applications run to-date are mostly MC generators (Pythia)– Note—DUNE computing consortium includes people that

already have allocations on both, plus several other supercomputing centers in US and elsewhere.

• Through OSG gateways we can also access US NSF centers– Texas Advanced Computing Center, Pittsburgh Supercomputing

Center.

High Performance Computing -- Leadership Class

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting11

Page 12: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

HEPCloud Architecture

Steven Timm | HEPCloud | FNAL -UK Planning meeting12

Experiment Workflow(provisioning trigger)

Facility Interface Authentication & Authorization

Decision EngineMonitoring

Provisioner(FNAL: GlideinWMS Factory)

Local ResourcesLocal ResourcesLocal Resources HPC ResourcesHPC ResourcesHPC Resources Grid ResourcesGrid ResourcesGrid ResourcesCommercial

CloudCommercial Cloud

Facility PoolMonitoring takes inputfrom all components

10/9/18

Page 13: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Builds on GlideinWMS-based provisioning structure we already have.

• Intensity frontier users including DUNE will submit jobs with jobsub as before.

• CMS continue to submit jobs through the CMS Global Pool.• Experiments will be gradually onboarded to use the HPC and

Cloud functionality– New parameters in their jobs when they do

HEPCloud as extension of existing facility

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting13

Page 14: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

User’s View of the HEPCloud Facility

Steven Timm | HEPCloud | FNAL -UK Planning meeting14 10/9/18

Page 15: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• A modular intelligent decision support system (IDSS)– Makes decisions that aid in the automatic provisioning of

resources– Supports multiple types of resources

• Cloud providers (Amazon AWS, Google GCE)• HPC centers (NERSC)• Grid computing federations.

• Design Drivers– Framework supports user supplied plugins

• Allows the injection of user-supplied code and expert knowledge– Powerful configuration functionality– Validity of data used during decision making process– Ability to replay decisions

Decision Engine (DE)

Steven Timm | HEPCloud | FNAL -UK Planning meeting15 10/9/18

Page 16: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Source modules are responsible for communicating with an external system (via native APIs to that system) to gather data that acts as input to the system

• Transform modules contain algorithms to convert input data into new data. Atransform consumes one or more data products (produced by one or moresources, transforms or both) within a Decision Channel and produces one or morenew data products

• Logic Engine is a rule-based forward chaining inference engine that helps in thedecision making process

• Publisher modules are responsible for publishing results to external systems• The DataSpace acts as a Knowledge Management system

Decision Channel Components

Steven Timm | HEPCloud | FNAL -UK Planning meeting16

SourcesSourcesSources

Decision Cycle Workflows

TransformsLogic

EnginePublishersPublishersPublishers

DataSpace:Time-ordered, User-defined Data Products (ensures reproducibility)

GetGetGet/PutPut

Anomaly detection

Financial managers

Regulation and policy managers

OperatorsResource managers

10/9/18

Page 17: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Goal is to get users away from specifying specific sites/facilities where they can run

• And instead characterize their job with the right parameters that decision engine can use to figure out if the job can run on the cloud or HPC resources

• Besides the obvious (cpu time, ram, scratch disk, number of cores)

• Input and output data size• Rate of network input and output. (output costs money)• IOPS—how many seeks, how many reads (both cost money)

Job Classification

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting17

Page 18: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Every resource has been benchmarked and has known performance

• Every resource has a price calculated or assigned.– Some resources (AWS) have spot prices that change in time.

• Figures of merit are calculated in ($/perf) where lower is better.

• Figure of merit goes up for each resource as the occupancy goes up.

• Send jobs to resources inversely proportional to figure of merit.

Figures of Merit

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting18

Page 19: The HEPCloud Facility€¦ · 10 Steven Timm | HEPCloud | FNAL -UK Planning meeting High Performance Computing--NERSC Total facility size: 64.4K FIFE experiments facility share: 18.6K

• Final security proposal currently pending at Department of Energy

• Integration testing of final Decision Engine code in progress• User training and final preparation for operations over course

of fall.• Go Live to be announced soon.

Current Status

10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting19