exascale radio astronomy

36
1 EXASCALE RADIO ASTRONOMY M Clark

Upload: talbot

Post on 25-Feb-2016

81 views

Category:

Documents


0 download

DESCRIPTION

Exascale radio astronomy. M Clark. Contents. GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale. The March of GPUs. What is a GPU?. Kepler K20X (2012) 2688 processing cores 3995 SP Gflops peak Effective SIMD width of 32 threads (warp) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exascale  radio astronomy

1

EXASCALE RADIO ASTRONOMYM Clark

Page 2: Exascale  radio astronomy

2

Contents

GPU ComputingGPUs for Radio AstronomyThe problem is powerAstronomy at the Exascale

Page 3: Exascale  radio astronomy

3

The March of GPUs

Page 4: Exascale  radio astronomy

4

What is a GPU?Kepler K20X (2012)

2688 processing cores3995 SP Gflops peak

Effective SIMD width of 32 threads (warp)Deep memory hierarchyAs we move away from registers

Bandwidth decreasesLatency increases

Limited on-chip memory65,536 32-bit registers per SM48 KiB shared memory per SM1.5 MiB L2

Programmed using a thread model

Page 5: Exascale  radio astronomy

5

Minimum Port, Big Speed-up

Application Code

+

GPU CPUOnly Critical FunctionsRest of Sequential

CPU Code

Page 6: Exascale  radio astronomy

6

Page 7: Exascale  radio astronomy

7

Strong CUDA GPU Roadmap

SGEM

M /

W N

orm

alize

d

2012 20142008 2010 2016

TeslaCUDA

FermiFP64

KeplerDynamic Parallelism

MaxwellDX12

PascalUnified Memory3D MemoryNVLink

20

16

12

8

6

2

0

4

10

14

18

Page 8: Exascale  radio astronomy

8

Introducing NVLINK and Stacked Memory

NVLINKGPU high speed interconnect80-200 GB/sPlanned support for POWER CPUs

Stacked Memory4x Higher Bandwidth (~1 TB/s)3x Larger Capacity4x More Energy Efficient per bit

Page 9: Exascale  radio astronomy

9

NVLink Enables Data Transfer At Speed of CPU Memory

TESLAGPU CPU

DDR MemoryStacked Memory

NVLink80 GB/s

DDR450-75 GB/s

HBM1

Terabyte/s

Page 10: Exascale  radio astronomy

10

Correlator Calibration & Imaging

RF Samplers

Real-Time R-T, post R-T

O(N) O(N)

O(N2)O(N log N)

O(N)O(N2)

N

digital

Radio Telescope Data Flow

Page 11: Exascale  radio astronomy

11

Where can GPUs be Applied?Cross correlation – GPU are ideal

Performance similar to CGEMMHigh performance open-source library https://github.com/GPU-correlators/xGPU

Calibration and ImagingGridding - Coordinate mapping of input data to a regular grid

Arithmetic intensity scales with kernel convolution sizeCompute-bound problem maps well to GPUsDominant time sink in compute pipeline

Other image processing stepsCUFFT – Highly optimized Fast Fourier Transform libraryPFB – Computational intensity increases with number of tapsCoordinate transformations and resampling

Page 12: Exascale  radio astronomy

12

GPUs in Radio AstronomyAlready an essential tool in radio astronomy

ASKAP – Western AustraliaLEDA – United States of AmericaLOFAR – Netherlands (+ Europe) MWA – Western AustraliaNCRA - IndiaPAPER – South Africa

LOFAR

LEDA

MWAASKAP PAPER

Page 13: Exascale  radio astronomy

13

Cross correlation is essentially GEMMHierarchical locality

Cross Correlation on GPUs

Page 14: Exascale  radio astronomy

14

Correlator Efficiency

2012

2014

2008

2010

X-en

gine

GFL

OPS

per W

att

Kepler

Tesla

Fermi

Maxwell

Pascal64

32

16

8

4

2

1

>1 TFLOPS sustained

>2.5 TFLOPS sustained

0.35 TFLOPS sustained

2016

Page 15: Exascale  radio astronomy

15

Software Correlation Flexibility

Why do software correlation?Software correlators inherently have a great degree of flexibility

Software correlation can do on-the-fly reconfigurationSubset correlation at increased bandwidthSubset correlation at decreased integration time Pulsar binningEasy classification of data (RFI threshold)

Software is portable, correlator unchanged since 2010Already running on 2016 architecture

Page 16: Exascale  radio astronomy

16

Power of 300 Petaflop CPU-only

Supercomputer =

Power for the city

of San Francisco

HPC’s Biggest Challenge: Power

Page 17: Exascale  radio astronomy

17

GPUs Power World’s 10 Greenest Supercomputers Green500

Rank MFLOPS/W Site1 4,503.17 GSIC Center, Tokyo Tech2 3,631.86 Cambridge University3 3,517.84 University of Tsukuba

4 3,185.91 Swiss National Supercomputing (CSCS)

5 3,130.95 ROMEO HPC Center6 3,068.71 GSIC Center, Tokyo Tech7 2,702.16 University of Arizona8 2,629.10 Max-Planck9 2,629.10 (Financial Institution)10 2,358.69 CSIRO37 1959.90 Intel Endeavor (top Xeon Phi cluster)49 1247.57 Météo France (top CPU cluster)

Page 18: Exascale  radio astronomy

18

The End of Historic Scaling

C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011

Page 19: Exascale  radio astronomy

19

The End of Voltage ScalingIn the Good Old Days

Leakage was not important, and voltage scaled with feature size

L’ = L/2V’ = V/2E’ = CV2 = E/8f’ = 2fD’ = 1/L2 = 4DP’ = P

Halve L and get 4x the transistors and 8x the

capability for thesame power

The New RealityLeakage has limited threshold voltage, largely ending voltage

scaling

L’ = L/2V’ = ~VE’ = CV2 = E/2f’ = 2fD’ = 1/L2 = 4DP’ = 4P

Halve L and get 4x the transistors and 8x the

capability for 4x the power,

or 2x the capability for the same power in ¼ the

area.

Page 20: Exascale  radio astronomy

20

Major Software ImplicationsNeed to expose massive concurrency

Exaflop at O(GHz) clocks O(billion-way) parallelism!

Need to expose and exploit localityData motion more expensive than computation> 100:1 global v. local energy

Need to start addressing resiliency in the applications

Page 21: Exascale  radio astronomy

21

How Parallel is Astronomy?

SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth

Correlator5 Pflops of computationData-parallel across visibilitiesTask-parallel across frequency channelsO(trillion-way) parallelism

Page 22: Exascale  radio astronomy

22

How Parallel is Astronomy?

SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth

Gridding (W-projection)Kernel size 100x100Parallel across kernel size and visibilities (J. Romein)O(10 billion-way) parallelism

Page 23: Exascale  radio astronomy

23

Energy Efficiency Drives Locality

20mm

64-bit DP

1000 pJ

28nm IC

256-bit access8 kB SRAM 50 pJ

16000 pJ DRAM Rd/Wr

500 pJ Efficient off-chip link

20 pJ 26 pJ 256 pJ

256bits

Page 24: Exascale  radio astronomy

24

Energy Efficiency Drives Locality

FMA

Regist

ers L1 L2DRAM

stacke

d

point

to po

int10

100

1000

10000

100000pi

coJo

ules

Page 25: Exascale  radio astronomy

25

Energy Efficiency Drives Locality

This is observable todayWe have lots of tunable parameters:

Register tile size: how many much work should each thread do?Thread block size: how many threads should work together?Input precision: size of the input words

Quick and dirty cross correlation example4x4 => 8x8 register tiling 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt

Page 26: Exascale  radio astronomy

26

Correlator Calibration & Imaging

RF Samplers

8-bit digitization O(100) PFLOPSO(10) PFLOPSN = 1024

digital

SKA1 LOW Sketch

10 Tb/s 50 Tb/s

Page 27: Exascale  radio astronomy

27

Energy Efficiency Drives Locality

FMA

Regist

ers L1 L2DRAM

stacke

d

point

to po

int10

100

1000

10000

100000pi

coJo

ules

Page 28: Exascale  radio astronomy

28

Do we need Moore’s Law?Moore’s law come from shrinking processMoore’s law is slowing down

Denard scaling is deadIncreasing wafer costs means that it takes longer to move to the next process

Page 29: Exascale  radio astronomy

29

Improving Energy Efficiency @ Iso-Process

We don’t know how to build the perfect processorHuge focus on improved architecture efficiencyBetter understanding of a given process

Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nmGF117: 96 cores, peak 192 GflopsGK107: 384 cores, peak 770 GflopsGM107: 640 cores, peak 1330 Gflops

Use cross-correlation benchmarkOnly measure GPU power

Page 30: Exascale  radio astronomy

30

Improving Energy Efficiency @ 28 nm

Fermi Kepler Maxwell0

5

10

15

20

25

80% 55% 80%

GFL

OPS

/ w

att

Page 31: Exascale  radio astronomy

31

How Hot is Your Supercomputer?

1. TSUBAME-KFCTokyo Tech, oil cooled4503 GFLOPS / watt

2. Wilkes ClusterU. Cambridge, air cooled3631 GFLOPS / watt

Number 1 is 24% more efficient than number 2

Page 32: Exascale  radio astronomy

32

Temperature is Power

Power is dynamic and staticDynamic power is workStatic power is leakage

Dominant term from sub-threshold leakage

Voltage terms:Vs: Gate to source voltageVth: Switching threshold voltagen: transistor sub-threshold swing coeff

Device specifics:A: Technology specific constantL, W: device channel length & width Thermal Voltage:

8.62×10−5 eV/K26 mV at room temperature

Page 33: Exascale  radio astronomy

33

Temperature is Power

Tesla K20mGK110, 28nm

Geforce GTX 580GF110, 40nm

Page 34: Exascale  radio astronomy

34

Tuning for Power Efficiency

A given processor does not have a fixed power efficiencyDependent on

Clock frequencyVoltageTemperatureAlgorithm

Tune in this multi-dimensional space for power efficiencyE.g., cross-correlation on Kepler K20

12.96 -> 18.34 GFLOPS / wattBad news: no two processors are exactly alike

Page 35: Exascale  radio astronomy

35

Precision is Power

Power scales with the square of the multiplier (approximately)Most computation done in FP32 / FP64Should use the minimum precision required by science needs

Maxwell GPUs have 16-bit integer multiply-add at FP32 rateAlgorithms should increasingly use hierarchical precision

Only invoke in high precision when necessarySignal processing folks known this for a long time

Lesson feeding back into the HPC community...

Page 36: Exascale  radio astronomy

36

Conclusions

Astronomy has insatiable amount of computeMany-core processors are a perfect match to the processing pipelinePower is a problem but

Astronomy has oodles of parallelismKey algorithms possess localityPrecision requirements are well understood

Scientists and Engineers wedded to the problemAstronomy is perhaps the ideal application for the exascale