the hepcloud facility€¦ · 10 steven timm | hepcloud | fnal -uk planning meeting high...
Post on 09-Oct-2020
0 Views
Preview:
TRANSCRIPT
Steven TimmUK/FNAL Planning Meeting09 Oct 2018
The HEPCloud Facility
• Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof.
Disclaimer
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting2
• Project sponsor: P. Spentzouris• Sponsor proxies: S. Fuess, B. Holzman• Project managers: E. Berman, P. Mhashilkar• Technical Lead: A.Tiradani• Security: M. Altunay, J. Teheran• Integration/Testing: S. Timm• Decision Engine Development Lead: P. Mhashilkar• Monitoring: K. Retzke
HEPCloud project team
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting3
• Started as a pilot project in 2015 to explore feasibility, capability of HEPCloud– Now in Phase III, moving into production by end of 2018.– Seed money provided by industry
• A portal to an ecosystem of diverse computing resources – commercial or academic – Provides “complete solutions” to users, with agreed upon levels
of service • The Facility routes to local or remote resources based on
workflow requirements, cost, and efficiency of accessing various resources
• Manages allocations of users to target compute engines • MORE THAN CLOUD
HEPCloud Facility
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting4
Classes of External Resource Providers
Steven Timm | HEPCloud | FNAL -UK Planning meeting5
Grid Cloud HPC
TrustFederation EconomicModel GrantAllocation
▪CommunityClouds-SimilartrustfederationtoGrids
▪CommercialClouds- Pay-As-You-Gomodel๏Stronglyaccounted๏Near-infinitecapacity➜Elasticity๏Spotpricemarket
▪ResearchersgrantedaccesstoHPCinstallations
▪PeerreviewcommitteesawardAllocations๏AwardsmodeldesignedforindividualPIsratherthanlargecollaborations
•Virtual Organizations (VOs) of users trusted by Grid sites
•VOs get allocations ➜Pledges–Unused allocations: opportunistic resources
“Thingsyourent”“Thingsyouborrow” “Thingsyou aregiven”
10/9/18
Drivers of Evolution: Capacity / Cost / Elasticity
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting6
Priceofonecore-yearonCommercialCloud1HEPneeds:10-100xtodaycapacity
Facilitysize:15kcores
NOvA jobsinthequeueatFNALUsageisnotsteady-state
CMS Reaching ~60k slots on AWS with HEPCloud
7 Steven Timm | HEPCloud | FNAL -UK Planning meeting10/9/18
10%Test 25%
60000slots
10000VMEachcolorcorrespondstoadifferentregion/zone/machinetype
Google CMS Demo Fall 2016
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting8
Doubling all CMS computing!
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting9
Cores from Google
• CPU-intensive neutrino oscillation fit for NOvA
• Ran on NERSC one week, presented at conference the next.
• Significant resources available to HEP at US Dept. of Energy supercomputers
• Most of work done at NERSC so far
• Have also run MU2E and CMS workflows at scale at NERSC
Steven Timm | HEPCloud | FNAL -UK Planning meeting10
High Performance Computing--NERSC
Totalfacilitysize:64.4K
FIFEexperimentsfacilityshare:18.6K
NOvA (one of the many experiments supported at Fermilab) using HEPCloud to claim ~1M cores at NERSC to perform a large-scale analysis over a short timeframe
Newsarticle:http://news.fnal.gov/2018/07/fermilab-computing-experts-bolster-nova-evidence-1-million-cores-consumed/
10/9/18
• Dept. of Energy Leadership class machines– Argonne and Oak Ridge are big ones– Network topology doesn’t easily support GlideinWMS model– Discussions on developing edge services to service jobs.– Applications run to-date are mostly MC generators (Pythia)– Note—DUNE computing consortium includes people that
already have allocations on both, plus several other supercomputing centers in US and elsewhere.
• Through OSG gateways we can also access US NSF centers– Texas Advanced Computing Center, Pittsburgh Supercomputing
Center.
High Performance Computing -- Leadership Class
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting11
HEPCloud Architecture
Steven Timm | HEPCloud | FNAL -UK Planning meeting12
Experiment Workflow(provisioning trigger)
Facility Interface Authentication & Authorization
Decision EngineMonitoring
Provisioner(FNAL: GlideinWMS Factory)
Local ResourcesLocal ResourcesLocal Resources HPC ResourcesHPC ResourcesHPC Resources Grid ResourcesGrid ResourcesGrid ResourcesCommercial
CloudCommercial Cloud
Facility PoolMonitoring takes inputfrom all components
10/9/18
• Builds on GlideinWMS-based provisioning structure we already have.
• Intensity frontier users including DUNE will submit jobs with jobsub as before.
• CMS continue to submit jobs through the CMS Global Pool.• Experiments will be gradually onboarded to use the HPC and
Cloud functionality– New parameters in their jobs when they do
HEPCloud as extension of existing facility
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting13
User’s View of the HEPCloud Facility
Steven Timm | HEPCloud | FNAL -UK Planning meeting14 10/9/18
• A modular intelligent decision support system (IDSS)– Makes decisions that aid in the automatic provisioning of
resources– Supports multiple types of resources
• Cloud providers (Amazon AWS, Google GCE)• HPC centers (NERSC)• Grid computing federations.
• Design Drivers– Framework supports user supplied plugins
• Allows the injection of user-supplied code and expert knowledge– Powerful configuration functionality– Validity of data used during decision making process– Ability to replay decisions
Decision Engine (DE)
Steven Timm | HEPCloud | FNAL -UK Planning meeting15 10/9/18
• Source modules are responsible for communicating with an external system (via native APIs to that system) to gather data that acts as input to the system
• Transform modules contain algorithms to convert input data into new data. Atransform consumes one or more data products (produced by one or moresources, transforms or both) within a Decision Channel and produces one or morenew data products
• Logic Engine is a rule-based forward chaining inference engine that helps in thedecision making process
• Publisher modules are responsible for publishing results to external systems• The DataSpace acts as a Knowledge Management system
Decision Channel Components
Steven Timm | HEPCloud | FNAL -UK Planning meeting16
SourcesSourcesSources
Decision Cycle Workflows
TransformsLogic
EnginePublishersPublishersPublishers
DataSpace:Time-ordered, User-defined Data Products (ensures reproducibility)
GetGetGet/PutPut
Anomaly detection
Financial managers
Regulation and policy managers
OperatorsResource managers
10/9/18
• Goal is to get users away from specifying specific sites/facilities where they can run
• And instead characterize their job with the right parameters that decision engine can use to figure out if the job can run on the cloud or HPC resources
• Besides the obvious (cpu time, ram, scratch disk, number of cores)
• Input and output data size• Rate of network input and output. (output costs money)• IOPS—how many seeks, how many reads (both cost money)
Job Classification
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting17
• Every resource has been benchmarked and has known performance
• Every resource has a price calculated or assigned.– Some resources (AWS) have spot prices that change in time.
• Figures of merit are calculated in ($/perf) where lower is better.
• Figure of merit goes up for each resource as the occupancy goes up.
• Send jobs to resources inversely proportional to figure of merit.
Figures of Merit
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting18
• Final security proposal currently pending at Department of Energy
• Integration testing of final Decision Engine code in progress• User training and final preparation for operations over course
of fall.• Go Live to be announced soon.
Current Status
10/9/18 Steven Timm | HEPCloud | FNAL -UK Planning meeting19
top related