building a grid cluster from the ground up · building a grid cluster from the ground up a tale of...
TRANSCRIPT
![Page 1: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/1.jpg)
ScotGrid
EGI CF 2013
Building a grid cluster from the ground up
A Tale of Two Rooms
![Page 2: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/2.jpg)
ScotGrid
EGI CF 2013
Introduction
• Scotgrid Glasgow [GridPP]
• One of largest Tier 2 sites in UK NGI
• 4136 cores
• 1.3 PB online storage
![Page 3: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/3.jpg)
ScotGrid
EGI CF 2013
A year on the grid
• Power & A/C outages from multiple causes on different scales
• Trips/larger substation drops etc.
• Lessons learned - general good practice
• General thoughts on living with a cluster
![Page 4: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/4.jpg)
ScotGrid
EGI CF 2013
The case of two machine rooms
• Different ages of rooms - repurposing
• Different cooling solutions
• Advantages and disadvantages• In principle, with redundant links could have cluster
redundancy
• In reality, complexity from bridging cluster with that redundancy - where are the bottlenecks
![Page 5: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/5.jpg)
ScotGrid
EGI CF 2013
Site diagram80 Gb/s
X460-48t
X460-48t
X460-48t
X460-48t
X460-48t
X460-48tX460-48t
Summit X670V
Summit X670V
Summit X670V
X460-48tX460-48tX460-48tX460-48tX460-48t
Worker Nodes
10G WN
10G Servers 10G Disk
10G Disk
10G Servers
Worker NodesServers
Disk
Upper
Lower
Servers
WAN
10 Gb/s
1 Gb/s multiple
10 Gb/smultiple
Summit X670V
![Page 6: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/6.jpg)
ScotGrid
EGI CF 2013
Power & A/C failures
• Can happen to anyone
• Expect failure (like the grid philosophy)
• UPSes are very useful• Except when they’re not
• Complexities of multi-room cluster
![Page 7: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/7.jpg)
ScotGrid
EGI CF 2013
Best case
• One large data centre with ample power, cooling and network infrastructure
• Lower maintenance overheads
• Higher production uptime
• Failure prediction and multiple redundancy
![Page 8: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/8.jpg)
ScotGrid
EGI CF 2013
Reality
• Many clusters grow organically over time, even with careful planning
• Periodic capacity upgrades can lead to infrastructure difficulties
![Page 9: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/9.jpg)
ScotGrid
EGI CF 2013
Essential Cluster Infrastructure
• Power
• Cooling
• Network
![Page 10: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/10.jpg)
ScotGrid
EGI CF 2013
Power
• Clusters are not desktops - be mindful of total power draw
• Potential for mix of 3 phase &13A ring main
• Most likely to impact overall user environment if changes have to be made (whole building outages)
• Don’t mix phases within rack
• Make clear about which phases are where
![Page 11: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/11.jpg)
ScotGrid
EGI CF 2013
Cooling
• Mix of techniques (in our case)
• Compressors - gradual degradation• 4 AHUs: 4 x 2 compressors
• Liquid cooling • 3 AHUs: effectively 1 active chiller (with failover)
• Over-specification• Expect maintenance downtime
![Page 12: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/12.jpg)
ScotGrid
EGI CF 2013
Network
• An aside (not power or A/C)
• Networking now a first class citizen
• Disparate vendors -> Unified structure
• 160 Gbps backbone• 80 Gbps redundant ring
![Page 13: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/13.jpg)
ScotGrid
EGI CF 2013
Site diagram80 Gb/s
X460-48t
X460-48t
X460-48t
X460-48t
X460-48t
X460-48tX460-48t
Summit X670V
Summit X670V
Summit X670V
X460-48tX460-48tX460-48tX460-48tX460-48t
Worker Nodes
10G WN
10G Servers 10G Disk
10G Disk
10G Servers
Worker NodesServers
Disk
Upper
Lower
Servers
WAN
10 Gb/s
1 Gb/s multiple
10 Gb/smultiple
Summit X670V
![Page 14: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/14.jpg)
ScotGrid
EGI CF 2013
Best practices
• Cold starts & boot order
• Auto power on?
• Alerts for sysadmins
• Notifications & communication
• Single points of failure - startup critical path
![Page 15: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/15.jpg)
ScotGrid
EGI CF 2013
Startup procedures
• Critical path• Core infrastructure
• Core services • (NFS Master Services pool nodes DPM WN)
• More speed less haste
• Automation
• Cluster management
![Page 16: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/16.jpg)
ScotGrid
EGI CF 2013
A user’s perspective• Depending on the size of the cluster, power
and A/C concerns can have a major impact on users.
• Communication
• Notification
• Posted maintenance windows
• Postmortem
![Page 17: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/17.jpg)
ScotGrid
EGI CF 2013
Process flow
• Logging• Preventative maintenance
• Event flow
• Postmortem
• Process revision
• Escalation
![Page 18: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed7a24621f2f81ba73da167/html5/thumbnails/18.jpg)
ScotGrid
EGI CF 2013
Summary
• Cluster environment is very often externally dictated
• Organic growth
• Can happen to anyone
• Process