assessing the facilities and systems supporting hpc at ornl€¦ · summer 2019 - assessing the...

22
ORNL is managed by UT-Battelle, LLC for the US Department of Energy Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities National Center for Computational Sciences Oak Ridge National Laboratory

Upload: others

Post on 27-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL

Jim Rogers

Director, Computing and FacilitiesNational Center for Computational SciencesOak Ridge National Laboratory

Page 2: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

22

ORNL Leadership Class Systems 2004 - 2018

2012Cray XK7

Titan

27PF

18.5TF

25 TF

54 TF

62 TF

263 TF

1 PF

2.5PF

2004Cray X1E Phoenix

2005Cray XT3

Jaguar

2006Cray XT3

Jaguar

2007Cray XT4

Jaguar

2008Cray XT4

Jaguar

2008Cray XT5

Jaguar

2009Cray XT5

Jaguar

From 2004 – 2018, HPC systems relied on chiller-based cooling (5.5°C supply)

with annualized PUEs to ~1.4

Page 3: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

33

ORNL’s Transition to Warmer Facility Supply Temperatures

~1.5EF

200PF

27PF

2012Cray XK7

Titan

2021/2022Frontier

2018/2019IBM

Summit

Titan: Refrigerant-based per-rack cooling with direct rejection of heat to cold 5.5°C water• Below dewpoint• 100% use of chillers

Summit: A combination of direct on-package cooling and RDHX with 21°C supply is > 95% room-neutral.• Above dewpoint• Contribution by chillers

~20% of the hours of the year

Frontier: Custom mechanical packaging is >95% room-neutral with a 30°C supply.• ~100% Evaporative Cooling,

with supplemental HVAC for parasitic loads

Page 4: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

4

Oak Ridge National Laboratory’s Cray XK7 TitanGTC'15 Session S5566 GPU Errors on HPC Systems4

Operational from November 2012 through August 1 201918,688 compute nodes – 1 AMD Opteron + 1 NVIDIA Kepler/node

27PF (peak); 17.59PF (HPL); 9.5MW (peak)Delivered > 27B compute hours over its life

Titan entered as the #1 supercomputer in the world in November 2012, and was still #12 on the Top500 list at the time of its decommissioning

Page 5: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

5

Perspective on Titan as an Air-cooled Supercomputer

ORNL’s Cray XK7 “Titan”• 200 cabinets, each with a 3,000 CFM fan

• 600,000 CFM (air volume)• Dry air has a density of 13.076 cubic feet per pound• Titan moves 600,000/13.076/60 -> 765 lbs/sec

Airbus A340• At takeoff, each of (4) engines

generates 140kN of force and consumes 1000 lbs/sec of air

• At cruise, the A340’s engines each produce ~29kN and consumes ~200 lbs/sec of air.

Wait. What?

Titan moves as much air as a long-range Airbus A340 at cruising altitude.

Page 6: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

6

Motivation for “Warmer” Cooling Solutions Serving HPC Centers

• Reduced Cost, Both CAPEX and OPEX– Reduce or eliminate the need for traditional

chillers. • No chillers, no ozone-depleting refrigerants (GREEN)

– Oak Ridge calculates an annualized PUE for air--cooled devices of no better than 1.4 (ASHRAE Zone 4A – Mixed Humid)

Oak Ridge, TN

– HPC power budgets continue to grow – Summit has a design point for >12MW (HPC-only). Minimizing PUE/ITUE is critical to the budget.

• Easier, more reliable design– Design is reduced to pumps, evaporative cooling, heat exchangers.– Traditional chilled water may not be necessary at all (NREL, NCAR, et al)

Page 7: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

77

OLCF Facilities Supporting Summit• Titan – 9MW @ heavy

load• Sitting on 250

pounds/ft2 raised floor• Uses 42F water and

special CDUs (XDPs)

• Summit – 256 compute cabinets on-slab

• 100% room-neutral design uses RDHX

• 20MW warm-water cooling plant using centralized CDU/secondary loop

Page 8: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

88

• Summit– Demand: 3-10MW;

• Secondary Loop– Supply 3300GPM (12,500

liters/min) @ 21°C; Return @ 29-33°C

– CPUs and GPUs use cold plates

– DIMMs and parasitic loads use RDHX

– Storage and Network use RDHX

21°C/20MW/7700 ton Facility System Design System

Page 9: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

99

Cooling DesignPrimary Loop uses Evaporative Cooling Towers (~80% of the hours of the year)

When the MTW RETURN is above the 21C setpoint, use a second set of Trim HX (with 5.5C on the other side) to drive MTW to the 21C setpoint.

The need for the trim-loop is about 20% of the hours in the year, and can ramp 0-100% to meet the setpoint back to Summit

Existing chilled water cooling loop

New primarycooling loop

Page 10: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

1010

Benefits of Warm Water + Operating Dashboards

• Warm Water allows annualized PUE of 1.1– ~$1M cost per MW-year for consumption on Summit;– ~$100k cost per MW-year for waste heat management

• Integration with PLC allows us to tune water flow– Better delta(t); less pumping energy

• Integration with IBM’s OpenBMC allows us to protect these 40k components from inadequate flow across the cold plates

• Integration with the scheduler allows us to correlate power and temperature data with individual applications.

• Additional data streams to be added- most from the Facility PLC

Page 11: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

11

Comparing Energy Performance - Titan to Summit

• Demonstrated Performance on Titan (Oct 2012)

• 17.59 PF 8.9MW peak 8.3MW average

• Demonstrated Performance on Summit (Oct 2018)

• 143.5 PF 11,065kW peak 9,783kW average

8.16x performance increase 17.9% avg. power increase

Titan: ~2.1 GFLOPs/Watt

Summit: 14.66 GFLOPs/Watt

>7x increase in energy efficiency

Page 12: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

1212

Page 13: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

1313

Page 14: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

1414

Page 15: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

15

0

2000

4000

6000

8000

10000

12000

14000

35

40

45

50

55

60

65

70

75

80

85

90

95

100

23:5

1:00

23:5

8:30

00:0

6:00

00:1

3:30

00:2

1:00

00:2

8:30

00:3

6:00

00:4

3:30

00:5

1:00

00:5

8:30

01:0

6:00

01:1

3:30

01:2

1:00

01:2

8:30

01:3

6:00

01:4

3:30

01:5

1:00

01:5

8:30

02:0

6:00

02:1

3:30

02:2

1:00

02:2

8:30

02:3

6:00

02:4

3:30

02:5

1:00

02:5

8:30

03:0

6:00

03:1

3:30

03:2

1:00

03:2

8:30

03:3

6:00

03:4

3:30

03:5

1:00

03:5

8:30

04:0

6:00

04:1

3:30

04:2

1:00

04:2

8:30

04:3

6:00

04:4

3:30

04:5

1:00

04:5

8:30

05:0

6:00

05:1

3:30

05:2

1:00

05:2

8:30

05:3

6:00

kW

Tem

per

atu

re (°

F)

Time

Summit MTW Cooling Loads and TemperaturesHPL Run 5/24/19 Duration: 5:48

MTW kW (cooling) CHW kW (cooling) MTW Return Temp MTW Supply Temp

K100 Avg Space Temp Outdoor Air Wet Bulb Temp IT kW

PUE during HPL Run = 1.081

IT Load follows a traditional HPL profile

Total power load on the energy plant reflects storage and other items A small portion of

the total load required the use of CHW (trim RDHX)

OAWBT remained at/above the supply target, affecting ECT

Supply temperature to Summit stayed rock-solid at 70F.

Page 16: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

16

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0

2000

4000

6000

8000

10000

12000

123

947

771

595

311

9114

2916

6719

0521

4323

8126

1928

5730

9533

3335

7138

0940

4742

8545

2347

6149

9952

3754

7557

1359

5161

8964

2766

6569

0371

4173

7976

1778

5580

9383

3185

6988

0790

4592

8395

2197

5999

9710

235

1047

310

711

1094

911

187

1142

511

663

1190

112

139

1237

712

615

1285

313

091

1332

913

567

1380

514

043

1428

114

519

1475

714

995

1523

315

471

1570

915

947

1618

516

423

1666

116

899

1713

717

375

1761

317

851

1808

918

327

1856

518

803

1904

119

279

1951

719

755

1999

320

231

2046

920

707

2094

521

183

2142

1

Accu

mul

ated

kW

-hou

rsAx

is Ti

tle

kW

Measurement (1/second)

OLCF-4 Summit HPL Power Measurements10-25-18 - 03:31:48 to 09:32:00 Average power –

9,783 kWMax power –11,065 kW

0

10000

20000

30000

40000

50000

60000

0

2000

4000

6000

8000

10000

12000

121

542

964

385

710

7112

8514

9917

1319

2721

4123

5525

6927

8329

9732

1134

2536

3938

5340

6742

8144

9547

0949

2351

3753

5155

6557

7959

9362

0764

2166

3568

4970

6372

7774

9177

0579

1981

3383

4785

6187

7589

8992

0394

1796

3198

4510

059

1027

310

487

1070

110

915

1112

911

343

1155

711

771

1198

512

199

1241

312

627

1284

113

055

1326

913

483

1369

713

911

1412

514

339

1455

314

767

1498

115

195

1540

915

623

1583

716

051

1626

516

479

1669

316

907

1712

117

335

1754

917

763

1797

718

191

1840

518

619

1883

319

047

1926

119

475

1968

919

903

2011

720

331

2054

520

759

2097

321

187

2140

1

Accu

mul

ated

kW

-hou

rsAx

is Ti

tle

kW

Measurement (1/second)

OLCF-4 Summit HPL Power Measurements10-25-18 - 03:31:48 to 09:32:00

Average power –9,783 kW

Max power –11,065 kW

Total IT Load 58,730 kW-hoursTotal Mech Load 1,449 kW-hours

PUE 1.0246

21,612 seconds;151,284 measurements

Idle: 2.97MW

Page 17: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

17

0

2000

4000

6000

8000

10000

12000

1214

427

640

853

1066

1279

1492

1705

1918

2131

2344

2557

2770

2983

3196

3409

3622

3835

4048

4261

4474

4687

4900

5113

5326

5539

5752

5965

6178

6391

6604

6817

7030

7243

7456

7669

7882

8095

8308

8521

8734

8947

9160

9373

9586

9799

10012

10225

10438

10651

10864

11077

11290

11503

11716

11929

12142

12355

12568

12781

12994

13207

13420

13633

13846

14059

14272

14485

14698

14911

15124

15337

15550

15763

15976

16189

0

2000

4000

6000

8000

10000

120001

215

429

643

857

1071

1285

1499

1713

1927

2141

2355

2569

2783

2997

3211

3425

3639

3853

4067

4281

4495

4709

4923

5137

5351

5565

5779

5993

6207

6421

6635

6849

7063

7277

7491

7705

7919

8133

8347

8561

8775

8989

9203

9417

9631

9845

1005

910

273

1048

710

701

1091

511

129

1134

311

557

1177

111

985

1219

912

413

1262

712

841

1305

513

269

1348

313

697

1391

114

125

1433

914

553

1476

714

981

1519

515

409

1562

315

837

1605

116

265

1647

916

693

1690

717

121

1733

517

549

1776

317

977

1819

118

405

1861

918

833

1904

719

261

1947

519

689

1990

320

117

2033

120

545

2075

920

973

2118

721

401

kW

Measurement (1/second)

OLCF-4 Summit HPL Power Measurements10-25-18 - 03:31:48 to 09:32:00

June 2018122.3PF

Oct 2018143.5PF

Page 18: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

1818

Cooling System Performance - PUE

Page 19: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

191920°C 21°C 22°C

23°C24°C 25°C 26°C 27°C

Page 20: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

2020

26°C

19°C

Page 21: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

21

Caution – Secondary (Closed) Loop Concerns Worsen as Temperatures Rise…

IBM’s Water Quality Requirements• All metals less than or equal to 0.10 ppm• Calcium less than or equal to 1.0 ppm• Magnesium less than or equal to 1.0 ppm• Manganese less than or equal to 0.10 ppm• Phosphorus less than or equal to 0.50 ppm• Silica less than or equal to 1.0 ppm• Sodium less than or equal to 0.10 ppm• Bromide less than or equal to 0.10 ppm• Nitrite less than or equal to 0.50 ppm• Chloride less than or equal to 0.50 ppm• Nitrate less than or equal to 0.50 ppm• Sulfate less than or equal to 0.50 ppm• Conductivity less than or equal to 10.0 μS/cm. • pH 6.5 – 8.0• Turbidity (NTU) less than or equal to 1

Allowable Wetted Materials• Lead-free copper alloys with less than 15% zinc.• Stainless steels• EPDM• Polypropylene

Water Chemistry - Biocides • The choice of biocide depends on whether you are chasing anaerobic

bacteria, aerobic bacteria, fungi, and/or algae

EEHPCWG is looking for common ground among the HPC suppliers for water quality

Fungi might not be detected in the water,

even though it can grow and cause

blockage of cooling channels in cold plates

Accelerated CorrosionSystem BlockagesReduced System Efficiency

Page 22: Assessing the Facilities and Systems Supporting HPC at ORNL€¦ · Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL Jim Rogers Director, Computing and Facilities

2222