integration, development and results of the 500 teraflop heterogeneous cluster ( condor)

26
1 DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011) Integrity Service Excellence Integration, Development and Results of the 500 Teraflop Heterogeneous Cluster (Condor) 11 September 2012 Mark Barnell Air Force Research Laboratory

Upload: pello

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Integration, Development and Results of the 500 Teraflop Heterogeneous Cluster ( Condor). 11 September 2012. Mark Barnell Air Force Research Laboratory. Agenda. Mission RI HPC-ARC & HPC Systems Condor Cluster Success and Results Future Work Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

1DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Integrity Service Excellence

Integration, Development and Results of the 500

Teraflop Heterogeneous Cluster (Condor)

11 September 2012

Mark BarnellAir Force Research Laboratory

Page 2: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

2DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Agenda

• Mission

• RI HPC-ARC & HPC Systems

• Condor Cluster

• Success and Results

• Future Work

• Conclusions

Page 3: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

3DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)1995

20002005

2010

SKY (PowerPC)200 GFLOPS/$M

10

100INTEL PARAGON

(i860) 12 GFLOPS/$M

1,000

10,000

1M

Heterogeneous HPCXEON + FPGA81 TOPS/$M

53 TFLOP Cell Cluster147 TFLOPS/$M

500 TFLOP Cell-GPGPU250 TFLOPS/$M

Exponentially Improving Price-PerformanceMeasured by AFRL-Rome HPCs

100,000

15 ye

ars,

20,00

0X

Rough

ly 2n

Commodity

Embedded

ServersFPGAs

Gaming

MulticoreGPGPU

Page 4: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

4DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Agenda

• Mission

• RI HPC-ARC & HPC Systems

• Condor Cluster

• Success and Results

• Future Work

• Conclusions

Page 5: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

5DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Mission

• Objective: To support CS&E R&D along with HPC to the Field experiments by providing interactive access to hardware, software and user services with special attention to applications and missions supporting C4ISR.

• Technical Mission: Provide classical and unique, real-time, interactive HPC resources to the AF and DoD R&D community.

Page 6: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

6DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

• Mission

• RI HPC-ARC & HPC Systems

• Condor Cluster

• Success and Results

• Future Work

• Conclusions

Page 7: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

7DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

CONDOR CLUSTER500 TFLOPSFunding: $2M HPCMP DHPI

Urban SurveillanceCognitive ComputingQuantum Computing

HPC Facility Resources

Cell BE Cluster 53 TFLOPS PeakPerformance

EMULAB Network EmulationTestbed

HORUS 22TFLOPS TTCPField Experiments

1Dual 10 GbE

Infiniband40 Gb/s

1 GbE

2Dual 10 GbE

Infiniband40 Gb/s

1 GbE

3Dual 10 GbE

1 GbE

84Dual 10 GbE

Infiniband40 Gb/s

1 GbE

HPC Assets on HPC DREN Network

HPC SDREN AssetsMay 2012

Legend:

Online: Nov 2010

Page 8: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

8DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

CONDOR CLUSTER500 TFLOPSFunding: $2M HPCMP DHPI

Urban SurveillanceCognitive ComputingQuantum Computing

HPC Facility ResourcesGPGPU Clusters

HORUS 22TFLOPS TTCPField Experiments

1Dual 10 GbE

Infiniband40 Gb/s

1 GbE

2Dual 10 GbE

Infiniband40 Gb/s

1 GbE

3Dual 10 GbE

1 GbE

84Dual 10 GbE

Infiniband40 Gb/s

1 GbE

HPC GPGPU Assets on DREN Network

Legend:

Online: Nov 2010

ATI Cluster 32 TFLOPS ATIFirePro 8800

Online: Jan 2011

• Upgrade all Nvidia GPGPUs to C2050s & C2070s Tesla cards June 2012• 30 Kepler cards ~90K will have a 3x improvement (1.5Tflop DP) 220W

• Condor among the greenest HPC in the world (1.25 Gflop/W DP&SP)• Redistribute 60 C1060 Tesla cards to other HPC and research sites

• ASIC, UMASS, & ARSC

Page 9: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

9DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

• Mission

• RI HPC-ARC & HPC Systems

• Condor Cluster

• Success and Results

• Future Work

• Conclusions

Page 10: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

10DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

The Condor Cluster

1716 SONY Playstation3s• STI Cell Broadband Engine

• PowerPC PPE• 6 SPEs• 256 MB RAM

84 head nodes• 6 gateway access points• 78 compute nodes

• Intel Xeon X5650 dual-socket hexa-core

• (2) NVIDIA Tesla GPGPUs• 54 nodes – (108) C2050 • 24 nodes – (48) C2070/5

• 24-48 GB RAM

FY10 DHPI Key design considerations: Price/performance & Performance/Watt

Page 11: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

11DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Condor Cluster (500 Tflops)

•263 Tflops from 1,716 PS3s• 153 GFLOPS/PS3

• 78 subclusters of 22 PS3s

• 225 Tflops from server nodes• 84 sever nodes (Intel Westmere 5650 dual

socket Hexa (12 cores))

• Dual GPGPUs in 78 server nodes

• Firebird Cluster (~32 Tflops)

• Cost: Approx. $2MSustained throughput benchmarks/appications YTD: Xeon X5650: 16.8 Tflops, Cell 171.6 Tflops, C2050 : 68.2 Tflops, C2070: 34 Tflops….CONDOR TOTAL 290.6 Tflops

1Dual 10 GbE

1 GbE

2Dual 10 GbE

1 GbE

3Dual 10 GbE

1 GbE

84Dual 10 GbE

1 GbE

Online: November 2010

Infiniband40 Gb/s

Infiniband40 Gb/s

Infiniband40 Gb/s

Page 12: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

12DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Condor Cluster Networks10 GbE STAR-Bonded HUB

Switch

Rack2Switch

Switch

Rack4

Rack1

Switch

Rack6Switch

DELL RACK

Rack3Switch

SwitchRack5

Switch

Switch

Server

CS15CS16CS17CS18CS19CS20CS21CS22CS23CS24CS25CS26CS27

Server

CS29CS30CS31CS32CS33CS34CS35CS36CS37CS38CS39CS40CS41

CS28

CS42

BON

DBO

ND

BON

D

BOND

BOND

Server

CS57CS58CS59CS60CS61CS62CS63CS64CS65CS66CS67CS68CS69CS70

Server

CS71CS72CS73CS74CS75CS76CS77CS78CS79CS80CS81CS82CS83CS84

Server

CS1CS2CS3CS4CS5CS6CS7CS8CS9

CS10CS11CS12CS13CS14

Server

CS43CS44CS45CS46CS47CS48CS49CS50CS51CS52CS53CS54CS55CS56

Server

CPS40CPS41CPS42CPS43CPS44CPS45CPS46CPS47CPS48CPS49CPS50CPS51CPS52CPS53

Server

CPS53CPS54CPS55CPS56CPS57CPS58CPS59CPS60CPS61CPS62CPS63CPS64CPS65

Server

CPS66CPS67CPS68CPS69CPS70CPS71CPS72CPS73CPS74CPS75CPS76CPS77CPS78

Switch

Switch

Server

CPS1CPS2CPS3CPS4CPS5CPS6CPS7CPS8CPS9

CPS10CPS11CPS12CPS13

Server

CPS27CPS28CPS29CPS30CPS31CPS32CPS33CPS34CPS35CPS36CPS37CPS38CPS39

Switch

BON

D

Server

CPS14CPS15CPS16CPS17CPS18CPS19CPS20CPS21CPS22CPS23CPS24CPS25CPS26

SWITCHS

SWITCHS

SWITCH

SWITCHS

x13

x13

x22

x22

x22

x22

x22

x22

x14

x13

x13

x14

x13

x14

x14x13

x14

x14

Page 13: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

13DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Condor Cluster Networks

Infiniband Mesh Non-Blocking 20Gb/s(5) 12200 & (1) 12300 Qlogic 40Gb/s Infiniband (36 port) switches

B24

536

Rack 614

servers14

4

10

Rack 414

servers

14

Rack 314

servers

14

Rack 214

servers

4

6

66

6

Rack 114

servers14

6

6

6

6

632

A24

432

328

Rack 514

servers

Page 14: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

14DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Condor Web Interface

Page 15: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

15DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

• Mission

• RI HPC-ARC & HPC Systems

• Condor Cluster

• Success and Results

• Future Work

• Conclusions

Page 16: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

16DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Solving Demanding, Real-Time Military Problems

Radar processing for high resolution images

Occluded text recognition

Space object identification

…but beginning to perceive that the handcuffs were not for me and that the military had so far got….

Page 17: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

17DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

RADAR Data Processing for High Resolution Images

Radar processing for high resolution images in real-time

Page 18: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

18DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Optical Text Recognition Processing Performance

•Computing resources involved in this run– 4 Condor servers – 32 Intel Xeon processor cores– 88 PlayStation 3’s – 616 IBM Cell-BE processor cores

– 40 Condor servers – 320 Intel Xeon processor cores– 880 PS3s – 6160 IBM Cell-BE Processor cores (21 pages/sec)

Page 19: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

19DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Space Object Identification

Low resolution framesHigh resolution image

Combining frames to create high quality

images in real-time

Page 20: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

20DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Matrix Multiply

Matrix-matrix multiplication test C2050 (MAGMA vs CUBLAS)

0 2000 4000 6000 8000 10000 120000

100

200

300

400

500

600

700

MagmaCUBLAS

GFLO

PS

Matrix Size

0 2000 4000 6000 8000 10000 120000

50

100

150

200

250

300

350

400

450

500

Intel 5650 12 CoresNvidia C2050

Matrix Size

GFLO

PS

MAGMA-only, one-sided matrix factorization

Page 21: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

21DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

• Condor/Firebird provides access to next-generation hybrid CPU/GPU architectures• Critical for understanding the capability and operation prior to larger deployments• Opportunity to study non-traditional applications of HPC, e.g., C4I applications

• CPU/GPU compute nodes provides significant raw computing power• OCL N-Body benchmark with 768K particles sustained performance ~ 2 TFLOPS using 4

Tesla C2050s or 3 FireBird V8800s• Production chemistry code (LAMMPS) shows speedup with minimal effort

• Original CPU code ported to OpenCL with limited source code modifications• Exact double-precision algorithm runs on Nvidia and AMD nodes • Overall platform capability increased by 2x (2.8x) without any GPU optimization

0

10

20

30

40

Loop

Tim

e (s

ec)

LAMMPS-OCL EAM Benchmark2

(Absolutely no GPU optimizations)

Xeon X5660 FirePro V8800Tesla C2050

2

48

1

223 4

1

3

1

(cores)

(GPUs)(GPUs)

0

1000

2000

GFL

OPS

OpenCL N-Body Benchmark1

1 GPU 2 GPUs 3 GPUs 4 GPUs

Tesla C2050FirePro V8800

1MPI-modified BDT N-Body benchmark distributed with COPRTHR 1.1 2LAMMPS-OCL is a modified version of the LAMMPS molecular dynamics code ported to OpenCL by Brown Deer Technology

LAMMPS on GPUs

Page 22: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

22DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

• Mission

• RI HPC-ARC & HPC Systems

• Condor Cluster

• Success and Results

• Future Work

• Conclusions

Page 23: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

23DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Future Work

• Improved OTR applications– Multiple languages

• Space Situation Awareness– Heterogeneous algorithms

• Persistent Wide Area Surveillance

Page 24: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

24DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Autonomous Sensing in Persistent Wide-Area Surveillance

•Cross-TD effort– Investigate scalable, real-time and autonomous sensing technologies– Develop a neuromorphic computing architecture for synthetic aperture radar (SAR)

imagery information exploitation– Provide critical wide-area persistent surveillance capabilities including motion

detection, object recognition, areas-of-interest identification and predictive sensing

Page 25: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

25DISTRIBUTION A. Approved for public release; distribution unlimited (88ABW-2011-3208, 06 Jun 2011)

Conclusions

– Valuable resource to support entire AFRL/RI, AFRL and tri-service RDT&E community.

– Leading large GPGPU development and benchmarking tests.

– This investment is leveraged by many (130+) users– Technical benefits – Faster, higher fidelity problem

solution; multiple parallel solutions, heterogeneous application development

Page 26: Integration, Development and Results of the 500  Teraflop  Heterogeneous Cluster ( Condor)

26DISTRIBUTION STATEMENT A – Unclassified, Unlimited Distribution

Questions?