1 m clark. 2 contents gpu computing gpus for radio astronomy the problem is power astronomy at the...

36
1 EXASCALE RADIO ASTRONOMY M Clark

Upload: allyssa-worden

Post on 01-Apr-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

1

EXASCALE RADIO ASTRONOMYM Clark

Page 2: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

2

Contents

GPU ComputingGPUs for Radio AstronomyThe problem is powerAstronomy at the Exascale

Page 3: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

3

The March of GPUs

Page 4: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

4

What is a GPU?

Kepler K20X (2012)2688 processing cores3995 SP Gflops peak

Effective SIMD width of 32 threads (warp)Deep memory hierarchyAs we move away from registers

Bandwidth decreasesLatency increases

Limited on-chip memory65,536 32-bit registers per SM48 KiB shared memory per SM1.5 MiB L2

Programmed using a thread model

Page 5: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

5

Minimum Port, Big Speed-up

Application Code

+

GPU CPUOnly Critical Functions

Rest of SequentialCPU Code

Page 6: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

6

Page 7: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

7

Strong CUDA GPU Roadmap

SG

EM

M /

W N

orm

aliz

ed

2012 20142008 2010 2016

TeslaCUDA

FermiFP64

KeplerDynamic Parallelism

MaxwellDX12

PascalUnified Memory3D MemoryNVLink

20

16

12

8

6

2

0

4

10

14

18

Page 8: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

8

Introducing NVLINK and Stacked Memory

NVLINKGPU high speed interconnect80-200 GB/sPlanned support for POWER CPUs

Stacked Memory4x Higher Bandwidth (~1 TB/s)3x Larger Capacity4x More Energy Efficient per bit

Page 9: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

9

NVLink Enables Data Transfer At Speed of CPU Memory

TESLAGPU

CPU

DDR MemoryStacked Memory

NVLink80 GB/s

DDR450-75 GB/s

HBM1

Terabyte/s

Page 10: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

10

CorrelatorCalibration & Imaging

RF Samplers

Real-Time R-T, post R-T

O(N) O(N)

O(N2)O(N log N)

O(N)O(N2)

N

digital

Radio Telescope Data Flow

Page 11: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

11

Where can GPUs be Applied?

Cross correlation – GPU are idealPerformance similar to CGEMMHigh performance open-source library https://github.com/GPU-correlators/xGPU

Calibration and ImagingGridding - Coordinate mapping of input data to a regular grid

Arithmetic intensity scales with kernel convolution sizeCompute-bound problem maps well to GPUsDominant time sink in compute pipeline

Other image processing stepsCUFFT – Highly optimized Fast Fourier Transform libraryPFB – Computational intensity increases with number of tapsCoordinate transformations and resampling

Page 12: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

12

GPUs in Radio Astronomy

Already an essential tool in radio astronomyASKAP – Western AustraliaLEDA – United States of AmericaLOFAR – Netherlands (+ Europe) MWA – Western AustraliaNCRA - IndiaPAPER – South Africa

LOFAR

LEDA

MWAASKAP PAPER

Page 13: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

13

Cross correlation is essentially GEMMHierarchical locality

Cross Correlation on GPUs

Page 14: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

14

Correlator Efficiency

2012

2014

2008

2010

X-e

ngin

e G

FLO

PS p

er

Watt

Kepler

Tesla

Fermi

Maxwell

Pascal64

32

16

8

4

2

1

>1 TFLOPS sustained

>2.5 TFLOPS sustained

0.35 TFLOPS sustained

2016

Page 15: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

15

Software Correlation Flexibility

Why do software correlation?Software correlators inherently have a great degree of flexibility

Software correlation can do on-the-fly reconfigurationSubset correlation at increased bandwidthSubset correlation at decreased integration time Pulsar binningEasy classification of data (RFI threshold)

Software is portable, correlator unchanged since 2010Already running on 2016 architecture

Page 16: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

16

Power of 300 Petaflop

CPU-only Supercomputer

=Power for the

city of San Francisco

HPC’s Biggest Challenge: Power

Page 17: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

17

GPUs Power World’s 10 Greenest Supercomputers

Green500 Rank MFLOPS/W Site

1 4,503.17 GSIC Center, Tokyo Tech

2 3,631.86 Cambridge University

3 3,517.84 University of Tsukuba

4 3,185.91Swiss National Supercomputing (CSCS)

5 3,130.95 ROMEO HPC Center

6 3,068.71 GSIC Center, Tokyo Tech

7 2,702.16 University of Arizona

8 2,629.10 Max-Planck

9 2,629.10 (Financial Institution)

10 2,358.69 CSIRO

37 1959.90 Intel Endeavor (top Xeon Phi cluster)

49 1247.57 Météo France (top CPU cluster)

Page 18: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

18

The End of Historic Scaling

C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011

Page 19: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

19

The End of Voltage Scaling

In the Good Old DaysLeakage was not important, and voltage scaled with feature size

L’ = L/2V’ = V/2E’ = CV2 = E/8f’ = 2fD’ = 1/L2 = 4DP’ = P

Halve L and get 4x the transistors and 8x the

capability for thesame power

The New RealityLeakage has limited threshold voltage, largely ending voltage

scaling

L’ = L/2V’ = ~VE’ = CV2 = E/2f’ = 2fD’ = 1/L2 = 4DP’ = 4P

Halve L and get 4x the transistors and 8x the

capability for 4x the power,

or 2x the capability for the same power in ¼ the

area.

Page 20: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

20

Major Software Implications

Need to expose massive concurrencyExaflop at O(GHz) clocks

O(billion-way) parallelism!

Need to expose and exploit localityData motion more expensive than computation> 100:1 global v. local energy

Need to start addressing resiliency in the applications

Page 21: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

21

How Parallel is Astronomy?

SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth

Correlator5 Pflops of computationData-parallel across visibilitiesTask-parallel across frequency channelsO(trillion-way) parallelism

Page 22: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

22

How Parallel is Astronomy?

SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth

Gridding (W-projection)Kernel size 100x100Parallel across kernel size and visibilities (J. Romein)O(10 billion-way) parallelism

Page 23: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

23

Energy Efficiency Drives Locality

20mm

64-bit DP

1000 pJ

28nm IC

256-bit access8 kB SRAM 50 pJ

16000 pJ DRAM Rd/Wr

500 pJ Efficient off-chip link

20 pJ 26 pJ 256 pJ

256bits

Page 24: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

24

Energy Efficiency Drives Locality

FMA

Registe

rs L1 L2

DRAM

stac

ked

poin

t to

poin

t10

100

1000

10000

100000p

ico

Jou

les

Page 25: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

25

Energy Efficiency Drives Locality

This is observable todayWe have lots of tunable parameters:

Register tile size: how many much work should each thread do?Thread block size: how many threads should work together?Input precision: size of the input words

Quick and dirty cross correlation example4x4 => 8x8 register tiling 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt

Page 26: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

26

CorrelatorCalibration & Imaging

RF Samplers

8-bit digitization O(100) PFLOPSO(10) PFLOPSN = 1024

digital

SKA1 LOW Sketch

10 Tb/s 50 Tb/s

Page 27: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

27

Energy Efficiency Drives Locality

FMA

Registe

rs L1 L2

DRAM

stac

ked

poin

t to

poin

t10

100

1000

10000

100000p

ico

Jou

les

Page 28: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

28

Do we need Moore’s Law?

Moore’s law come from shrinking processMoore’s law is slowing down

Denard scaling is deadIncreasing wafer costs means that it takes longer to move to the next process

Page 29: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

29

Improving Energy Efficiency @ Iso-Process

We don’t know how to build the perfect processorHuge focus on improved architecture efficiencyBetter understanding of a given process

Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nmGF117: 96 cores, peak 192 GflopsGK107: 384 cores, peak 770 GflopsGM107: 640 cores, peak 1330 Gflops

Use cross-correlation benchmarkOnly measure GPU power

Page 30: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

30

Improving Energy Efficiency @ 28 nm

Fermi Kepler Maxwell0

5

10

15

20

25

80% 55% 80%

GF

LO

PS

/ w

att

Page 31: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

31

How Hot is Your Supercomputer?

1. TSUBAME-KFCTokyo Tech, oil cooled4503 GFLOPS / watt

2. Wilkes ClusterU. Cambridge, air cooled3631 GFLOPS / watt

Number 1 is 24% more efficient than number 2

Page 32: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

32

Temperature is Power

Power is dynamic and staticDynamic power is workStatic power is leakage

Dominant term from sub-threshold leakage

Voltage terms:Vs: Gate to source voltageVth: Switching threshold voltagen: transistor sub-threshold swing coeff

Device specifics:A: Technology specific constantL, W: device channel length & width

Thermal Voltage:8.62×10−5 eV/K26 mV at room temperature

Page 33: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

33

Temperature is Power

Tesla K20mGK110, 28nm

Geforce GTX 580GF110, 40nm

Page 34: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

34

Tuning for Power Efficiency

A given processor does not have a fixed power efficiencyDependent on

Clock frequencyVoltageTemperatureAlgorithm

Tune in this multi-dimensional space for power efficiencyE.g., cross-correlation on Kepler K20

12.96 -> 18.34 GFLOPS / watt

Bad news: no two processors are exactly alike

Page 35: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

35

Precision is Power

Power scales with the square of the multiplier (approximately)Most computation done in FP32 / FP64Should use the minimum precision required by science needs

Maxwell GPUs have 16-bit integer multiply-add at FP32 rate

Algorithms should increasingly use hierarchical precisionOnly invoke in high precision when necessary

Signal processing folks known this for a long timeLesson feeding back into the HPC community...

Page 36: 1 M Clark. 2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

36

Conclusions

Astronomy has insatiable amount of computeMany-core processors are a perfect match to the processing pipelinePower is a problem but

Astronomy has oodles of parallelismKey algorithms possess localityPrecision requirements are well understood

Scientists and Engineers wedded to the problemAstronomy is perhaps the ideal application for the exascale