resource management: assessing the warm water facilities
TRANSCRIPT
![Page 1: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/1.jpg)
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Resource Management: Assessing the Warm Water Facilities and Systems Supporting ORNL’s SummitJim Rogers
Director, Computing and FacilitiesNational Center for Computational SciencesOak Ridge National Laboratory
![Page 2: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/2.jpg)
22
ORNL Leadership Class Systems 2004 - 2018
2012Cray XK7
Titan
27PF
18.5TF
25 TF
54 TF
62 TF
263 TF
1 PF
2.5PF
2004Cray X1E Phoenix
2005Cray XT3
Jaguar
2006Cray XT3
Jaguar
2007Cray XT4
Jaguar
2008Cray XT4
Jaguar
2008Cray XT5
Jaguar
2009Cray XT5
Jaguar
From 2004 – 2018, HPC systems relied on chiller-based cooling (5.5°C / 42°F
supply) with annualized PUEs to ~1.4
![Page 3: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/3.jpg)
33
ORNL’s Transition to Warmer Facility Supply Temperatures
~1.5EF
200PF
27PF
2012Cray XK7
Titan
2021/2022HPE/Cray Frontier
2018/2019IBM
Summit
Titan: Refrigerant-based per-rack cooling with direct rejection of heat to cold 5.5°C water• Below dewpoint• 100% use of chillers
Summit: A combination of direct on-package cooling and RDHX with 21°C / 70°Fsupply is > 95% room-neutral.• Above dewpoint• Contribution by chillers
~20% of the hours of the year
Frontier: Custom packaging is >95% room-neutral with a 32°C / 90°F supply.• ~100% Evaporative Cooling,
with supplemental HVAC for parasitic loads
![Page 4: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/4.jpg)
4
Motivation for “Warmer” Cooling Solutions Serving HPC Centers
• Reduced Cost, Both CAPEX and OPEX– Reduce or eliminate the need for traditional
chillers. • No chillers, no ozone-depleting refrigerants (GREEN)
– Oak Ridge calculates an annualized PUE for air--cooled devices of no better than 1.4 (ASHRAE Zone 4A – Mixed Humid)
Oak Ridge, TN
– HPC power budgets continue to grow – Summit has a design point for >12MW (HPC-only). Minimizing PUE/ITUE is critical to the budget.
• Easier, more reliable design– Design is reduced to pumps, evaporative cooling, heat exchangers.– Traditional chilled water may not be necessary at all (NREL, NCAR, et al)
![Page 5: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/5.jpg)
55
OLCF Facilities Supporting Summit• Titan – 9MW @ heavy
load• Sitting on 250
pounds/ft2 raised floor• Uses 42F water and
special CDUs (XDPs)
• Summit – 256 compute cabinets on-slab
• 100% room-neutral design uses RDHX
• 20MW warm-water cooling plant using centralized CDU/secondary loop
![Page 6: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/6.jpg)
66
• Summit– Demand: 3.3 (idle)-11.5MW;
• Secondary Loop– Supply 3300GPM (12,500
liters/min) @ 21°C; Return @ 29-33°C
– CPUs and GPUs use cold plates– DIMMs and parasitic loads use
RDHX– Storage and Network use RDHX
21°C/20MW/7700 ton Facility System Design System
![Page 7: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/7.jpg)
77
Medium Water Temperature Cooling DesignPrimary Loop uses Evaporative Cooling Towers (~80% of the hours of the year)
When the MTW RETURN is above the 21C setpoint, use a second set of Trim HX (with 5.5C on the other side) to drive MTW to the 21C setpoint.
The need for the trim-loop is about 20% of the hours in the year, and can ramp 0-100% to meet the setpoint back to Summit
A New Challenge – Titan and rotated off-line, so there is no significant load on the Chillers[1-5]. NOAA moved to the warm water loop (~500GPM demand)
Existing chilled water cooling loop
New primarycooling loop
![Page 8: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/8.jpg)
88
![Page 9: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/9.jpg)
99
![Page 10: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/10.jpg)
10
Benefits of ORNL’s Warm-Water (21°C) Mechanical Plant
Warm Water• Total Capacity to manage 20MW of IT
load
• Warm Water allows annualized PUE of 1.1– For each ~$1M cost per MW-year for
consumption on Summit;– A corresponding ~$100k cost per MW-year
for waste heat management
• Titan was $6-7M/year plus $1.8-$2M to remove the waste heat
• Summit will have a similar “power bill”, but closer to $500k annually for waste heat.
Integrated System/Facility Operation• Integration with the PLC allows us to tune
water pressure and flow– Better delta(t); less pumping energy
• Integration with IBM’s OpenBMC provides information necessary to protect ~37K CPUs and GPUs from inadequate flow across the cold plates
• Integration with the scheduler allows us to correlate power and temperature data with individual applications
![Page 11: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/11.jpg)
11
Summit’s Power Demand, Aug 2018– Aug 2019
Early Access, Gordon Bell, Acceptance Testing
Transition to Operations
Full Production
![Page 12: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/12.jpg)
12
Detailed Analysis of Summit’s Annual PUE
1.00
1.05
1.10
1.15
1.20
1.25
1.30
0
2000
4000
6000
8000
10000
12000
14000
2018
/08/
0120
18/0
8/11
2018
/08/
2220
18/0
9/02
2018
/09/
1220
18/0
9/23
2018
/10/
0420
18/1
0/14
2018
/10/
2520
18/1
1/05
2018
/11/
1620
18/1
1/26
2018
/12/
0720
18/1
2/18
2018
/12/
2820
19/0
1/08
2019
/01/
1920
19/0
1/30
2019
/02/
0920
19/0
2/20
2019
/03/
0320
19/0
3/13
2019
/03/
2420
19/0
4/04
2019
/04/
1520
19/0
4/25
2019
/05/
0620
19/0
5/17
2019
/05/
2720
19/0
6/07
2019
/06/
1820
19/0
6/28
2019
/07/
0920
19/0
7/20
2019
/07/
3120
19/0
8/10
2019
/08/
21
PUE
Cool
ing
(kW
cm)
Date
Cooling Source August 1 2018-August 31 2019 With 1 Week PUE
CHW CTW 1 Week PUE (kW)
Seasonal Transition (Summer to Fall)
Winter/Spring: 100% Evaporative Cooling
Seasonal Transition (to Summer)
![Page 13: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/13.jpg)
1313
Summit’s RM Challenge – Data Volume
13
TOTAL ~470,000 metrics per second available for real-time analytics
• Data Streams Include:
• IBM OpenBMC framework (99 metrics/node/second x 4608 nodes = ~460,000/sec) - OOB
• IBM LSF jobs data for running applications (~10sec update interval)
• NOAA weather/wet bulb for Oak Ridge area (continuous, external)
• Programmable Logic Circuit (PLC) water flow (continuous, protected as part of BAS)
![Page 14: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/14.jpg)
14
Sample from Grafana Dashboard (live version shown)
![Page 15: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/15.jpg)
15
MTW Performance
Summit (System) Impact:• No substantive impact to CPU or GPU
operating conditions (junction temperatures)
• One degree increase in room temperature (worst case about 72.5° F)
Year over Year, 2018 to 2019:• IT Demand (kW-h) is up 19%• CHW use is 39% LESS than in 2018• The one-degree adjustment of supply
temperature provided a savings of $40,000
2018:Supply 70°F & ~4500GPM 2019:Supply 71°F & ~3300GPM
Energy used by Chilled Water
Total Energy, Mechanical Plant
PUE Improves
Summit IT Load (kW-h) (Cumulative) Increases
Wet Bulb
![Page 16: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/16.jpg)
1616
Summit Power/Jobs Every Second
Click toPlayMovie
![Page 17: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/17.jpg)
1717
Summit Temperature/Jobs Every Second
Click toPlayMovie
![Page 18: Resource Management: Assessing the Warm Water Facilities](https://reader031.vdocument.in/reader031/viewer/2022020622/61edee4b6560126be428d705/html5/thumbnails/18.jpg)
1818