take on messages from lecture 1 lhc computing has been well sized to handle the production and...
Post on 13-Jan-2016
217 Views
Preview:
TRANSCRIPT
Take on messages from Lecture 1 LHC Computing has been well sized to handle the
production and analysis needs of LHC (very high data rates and throughputs) Based on the hierarchical Monarc model It has been very successful
WLCG operates smoothly and reliably Data is well transferred and made available in a very
short time to everybody Higgs boson discovery was announced within a week
from latest data update! Network has worked well and allows now for
computing model changes
Ian.Bird@cern.ch /
August 2012
2
Grid co
mputing enables the rapid delivery
of physic
s results
Outlook to the Future
3
44
Computing Model Evolution
Evolution of computing models
Hierarchy Mesh
5
Evolution
During the development the evolution of the WLCG Production grid has oscillated between structure and flexibility Driven by capabilities of the infrastructure and the
needs of the experiments
ALICE RemoteAccess
PD2P/Popularity
CMS Full Mesh
5
6
Structur
Data management in the WLCG has been moving to a less deterministic system as the software improved
Started with deterministic pre-placement of data on disk storage for all samples (ATLAS)
Then subscriptions driven by physics groups (CMS)
Then dynamic placement of data based on access to only replicate samples that were going to be looked at (ATLAS)
Once IO is optimized and network links improve we can send data over the wide area so jobs can run anywhere and access the data (ALICE, ATLAS, CMS)• Good for opportunistic resources, balancing, clouds, or any other
time when the sample will be accessed only once
6
Data Management Evolution
Less
Det
erm
inis
tic
7
Structur
Scheduling evolution has similar drivers We started with a very deterministic system where jobs
were sent directly to a specific site
This leads to early binding of jobs to resourcesrequests idle in long queues, no ability to reschedule
All 4 experiments evolved to use a set of pilots to make better scheduling decisions based on current information
The pilot system now evolves further to allow submission to additional resources like clouds
What began as a deterministic system has evolved to flexibility in scheduling and resources
7
Scheduling Evolution
Less
Det
erm
inis
tic
More dynamic data placement is needed
less restrictions in where the data comes from
but data is still pushed to sites8
Data Access Frequency
Ian FiskFNAL/CD
ATLAS
Tier-1
Tier-2
Tier-2
Tier-1
Tier-2
Services like the Data Popularity Service track all the file accesses and can show what data is accessed and for how long
Over a year, popular data stays that way for reasonable long periods of time
9
Popularity
Ian FiskFNAL/CD
CMS Data Popularity Service
ATLAS uses the central queue and popularity to understand how heavily used a dataset is
Additional copies of the data made Jobs re-brokered to use them
Unused copies are cleaned
10
Dynamic Data Placement
Ian FiskFNAL/CD
PANDA
Requests
Tier-1
Tier-2
With optimized IO other methods of managing the data and the storage are available
Sending data directly to applications over the WANAllows users to open any file regardless of their locations or the file’s source
Sites deploy at least one xrootd server that acts as a proxy/door
12
Wide Area Access
Ian FiskFNAL/CD
Once we have a combination of dynamic placement, wide area access to data, and reasonable networking then facilities we can be treated as part of a coherent system
Also opens doors to use new kinds of resources (opportunistic resorces, commercial clouds, data centers..)
14
Transparent Access to Data
CERN is deploying a remote computing facility in Budapest
200Gb/s of networking between the centers at 35ms ping time
As experiments we cannot really tell the difference where resources are installed
15
Example: Expanding the CERN Tier0
CERN Budapest100Gb/s
100Gb/s
Tier 0: Wigner Data Centre, Budapest
• New facility due to be ready at the end of 2012
• 1100m² (725m²) in an existing building but new infrastructure
• 2 independent HV lines• Full UPS and diesel
coverage for all IT load (and cooling)
• Maximum 2.7MW
These 100Gb/s links are the first in production for WLCG Other sites will soon follow
We have reduced the differences in site functionality
Then reduced the difference in even the perception that two sites are separate
We can begin to think of the facility as a big center and not a cluster of center This concept can be expanded to many facilities
17
Networks
The WLCG service architecture has been reasonably stable for over a decade
This is beginning to change with new Middleware for resource provisioning
A variety of places are opening their resources to “Cloud” type of provisioning
From a site perspective this is often chosen for cluster management and flexibility reasons
Everything is virtualized and services are put on top
18
Changing the Services
Grids offer primarily standard services with agreed protocols
Designed to be generic, but execute a particular task
Clouds offer the ability to build custom services and functions
More flexible, but also more work for users
19
Clouds vs Grids
CMS and ATLAS are trying to provision resources like this with the High Level Trigger farms
Open Stack interfaced to the Pilot systems In CMS we got to 6000 running cores and
the facility looks like another destination, though no grid CE exists
It will be used for large scale production running in a few weeks
Already several sites have requested similar connections to local resources
20
Trying this out
We have a grid because: We need to collaborate and share resources Thus we will always have a “grid” Our network of trust is of enormous value for us and
for (e-)science in general
We also need distributed data management That supports very high data rates and throughputs We will continually work on these tools
We are now working on how to integrate Cloud Infrastructures in WLCG
21
WLCG will remain a Grid
Evolution of the Services and Tools
Computing infrastructure is a needed piece to the ultimate core mission of HEP experiments development effort is steadily decreasing
Common solutions try to take advantage of the similarities in the experiment activities optimize development effort and offer lower long-term
maintenance and support costs
Together with the willingness of the experiments to work together Successful examples in Distributed Data Management, Data
Analysis, Monitoring( HammerCloud, Dashboards, Data Popularity, the Common Analysis Framework , …)
Taking advantage of the Long Shut-down 1
Need for Common Solutions
Architecture of the Common Analysis Framework
Evolution of Capacity: CERN & WLCG
25
Modest growth until 2014
Anticipate x2 in 2015
Anticipate x5 after 2018
What we thought was needed at LHC start
What we actually used at LHC start!
Resource Utilization was highest in 2012 for both Tier-1 and Tier-2 sites
Jan-
12
Mar
-12
May
-12
Jul-1
2
Sep-1
2
Nov-1
2
Jan-
13
Mar
-13
May
-13
Jul-1
3
Sep-1
30
20
40
60
80
100
120
140
CMS Tier-1 Pledge Usage
CMS
Jan-
12
Mar
-12
May
-12
Jul-1
2
Sep-1
2
Nov-1
2
Jan-
13
Mar
-13
May
-13
Jul-1
3
Sep-1
30
20406080
100120140160180
CMS Tier-2 Pledge Usage
CMS
CMS Resource Utilization
Growth curves for resources
CMS Resource Utilization
2012 2013 2014 2015 2016 20170
100
200
300
400
500
600
CMS Tier-1 CPU Run2
Resource Request
Flat Growth
kHS
06
2012 2013 2014 2015 2016 20170
10
20
30
40
50
60
CMS Tier-1 Disk Run2
Resource Request
Flat Growth
PB
2012 2013 2014 2015 2016 20170
20
40
60
80
100
120
140
160
Tier-1 Tape
Resource Request
Flat GrowthPB
Conclusions
28
C
First years of LHC data – WLCG has helped deliver physics rapidly
Data available everywhere within 48h
Just the start of decades of exploration of new physics Sustainable solutions!
Entering a phase of consolidation and at the same time evolution
LS1: opportunity for disruptive changes and scale testing of new technologies
Wide area access, dynamic data placement, new analysis tools, clouds
Challenges for computing – scale & complexity – will continue to increase
28
Conclusions
In the new resource provisioning model the pilot infrastructure communicates with the resource provisioning tools directly
Requesting groups of machines for periods of time
29
Evolving the Infrastructure
29
Resource Provisioning
Resource Provisioning
Pilots
ResourceRequests
Cloud Interface
CE
VM with PilotsVM with PilotsVM with PilotsVM with PilotsVM with PilotsVM with PilotsVM with Pilots
Batch Queue
WN with PilotsWN with PilotsWN with PilotsWN with PilotsWN with PilotsWN with PilotsWN with Pilots
top related