powering real-time radio astronomy signal processing with...
TRANSCRIPT
![Page 1: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/1.jpg)
Powering Real-time Radio
Astronomy Signal Processing with
latest GPU architectures
Harshavardhan Reddy Suda
NCRA, India
Vinay Deshpande
NVIDIA, India
Bharat Kumar
NVIDIA, India
![Page 2: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/2.jpg)
What signals we are processing?
GMRT▪ The Giant Meter-wave Radio Telescope
(GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies
▪ Located 80 km north of Pune, 160 km east of Mumbai
▪ Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths
▪ Digitized baseband signals from 30 dual polarized antennas of GMRT
![Page 3: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/3.jpg)
GMRT
▪ Supports two modes of operation :
- Interferometry (correlator)- Array mode (beamformer)
▪ Frequency bands :
- 130 to 260 MHz
- 250 to 500 MHz
- 550 to 900 MHz
- 1050 to 1600 MHz
▪ Maximum instantaneous bandwidth :
400 MHz (Legacy GMRT = 32
MHz)
▪ Effective collecting area (2-3% of
SKA)
-30,000 sq m at lower frequencies
-20,000 sq m at higher frequencies
![Page 4: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/4.jpg)
The Giant Meter-wave Radio Telescope
A Google eye view
![Page 5: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/5.jpg)
GMRT receiver chain Signal processing in
digital back-end
Image courtesy : Ajith Kumar, NCRA
![Page 6: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/6.jpg)
Computation requirements
Sampler
Fourier Transform
O(NlogN)
Phase
Correction
MAC
M(M+1)/2
Antenna
Signals(M=64)
Maximum Bandwidth 400 MHz
16k point spectral channels –
3 TFlops
0.1 TFlops
6.6 TFlops
Total ~ 10 TFlops
![Page 7: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/7.jpg)
Design : Time slicing model
![Page 8: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/8.jpg)
Design : Time slicing model
A 4-node example
Ant 1, Ant 2 --- Ant 16 : Digitized data of baseband signals of Antennas
![Page 9: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/9.jpg)
Implementation
▪ 16 Dell T630 machines as Compute Nodes
▪ 16 ROACH (FPGA) boards with Atmel/e2v based ADCs developed by CASPER group, Berkeley for digitization and packetization
▪ 32 Tesla K40c GPU cards for processing
▪ 36 port Mellanox Infiniband switch for data sharing between Compute Nodes and Host Nodes
▪ Software : C/C++ and CUDA C programming with OpenMPI and OpenMP directives
▪ Developed in collaboration with Swinburne University, Australia
![Page 10: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/10.jpg)
Implementation
Image courtesy : Irappa Halagalli, NCRA
![Page 11: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/11.jpg)
Sample result
Legacy GMRT 325 MHz : 350 μJy Upgraded GMRT 300 – 500 MHz : 28 μJy
Significantly lower noise RMS and better image quality with upgraded GMRT
Dharam Vir Lal and Ishwar
Chandra, NCRA
Image of Coma cluster
![Page 12: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/12.jpg)
Computation Performance : K40
ChannelsFFT
(Gflops)MAC
(Gflops)
2048 620 626
4096 626 620
8192 512 574
16384 498 537
No. of antennas : 32 (dual pol)
CUDA 7.5
![Page 13: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/13.jpg)
Motivation for next generation GPUs
▪ Adding more compute intensive applications
- Multi-beamforming
- Processing on each beam (beam steering)
- Gated correlator
- FIR filtering with many taps for narrow-band mode implementation
▪ Working GMRT system and code provides an excellent testing ground
for the features of next generation GPUs
▪ Performance measured and compared on GP100 and V100
![Page 14: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/14.jpg)
Computation performance – K40 vs GP100
Cuda 7.5, ECC off
Performance follows CUFFT benchmarks for K40 and P100
Reference for K40 benchmark : CUDA 6.5 performance report, September 2014
Reference for P100 benchmark : CUDA 8 PERFORMANCE OVERVIEW, November 2016
![Page 15: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/15.jpg)
Computation performance : K40 vs GP100
Cuda 7.5, ECC off
No. of antennas : 32 (dual pol)
![Page 16: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/16.jpg)
Computation performance : K40 vs GP100
Cuda 7.5, ECC off
Peak Global Memory Bandwidth :
K40 – 288 GB / sec
GP100 – 732 GB / sec
Peak Performance :
K40 – 4.3 TFlops
GP100 – 9.3 TFlops
![Page 17: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/17.jpg)
Computation performance as % of Real-time
Bandwidth : 200 MHz
No. of antennas : 32 (dual pol)
Spectral Channels : 16384
![Page 18: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/18.jpg)
Computation performance : GP100 vs V100
GP100 on Cuda 7.5
V100 on Cuda 9.1 (using PSG cluster)
![Page 19: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/19.jpg)
Computation performance : GP100 vs V100
GP100 on Cuda 7.5
V100 on Cuda 9.1 (using PSG cluster)
No. of antennas : 32 (dual pol)
![Page 20: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/20.jpg)
Computation performance : GP100 vs V100
GP100 on Cuda 7.5
V100 on Cuda 9.1 (using PSG cluster)
Peak Global Memory Bandwidth :
GP100 – 732 GB / sec
V100 – 900 GB / sec
Peak Performance :
GP100 – 9.3 TFlops
V100 – 14 TFlops
![Page 21: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/21.jpg)
Reasons behind relatively low performance of
MAC
▪ Non-contiguous Global Memory access at block level
MAC input data format
▪ Low Arithmetic Intensity
![Page 22: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/22.jpg)
GPU kernel improvements
▪ MAC :
Simplified Index Arithmetic
Improved the L2 hit ratio : less then 5% to nearly 86%
Vectorized loads – Increased ILP (float4)
Exposing more parallelism by increasing the occupancy
Single Precision to Half Precision floating point – No performance gain
▪ FFT :
Single Precision to Half Precision floating point
![Page 23: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/23.jpg)
MAC : Performance gain with optimizations on
V100
No. of antennas : 32 (dual pol)
V100 on Cuda 9.1 (using PSG cluster)
![Page 24: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/24.jpg)
FFT : Performance gain with half precision on
V100
V100 on Cuda 9.1 (using PSG cluster)
![Page 25: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/25.jpg)
FFT : Error analysis with half precision in power spectrum
Spectral Channels : 2048
Batch size : 128
![Page 26: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/26.jpg)
FFT : Error analysis with half precision in phase spectrum
Spectral Channels : 2048
Batch size : 128
![Page 27: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/27.jpg)
Going forward
▪ Improving MAC using Tensor cores – potential 2x improvement
▪ Implementing the MAC optimizations and half-precision floating point FFT in the GMRT code
▪ Optimized FIR filtering routines in CUDA for narrow-band mode implementation
▪ Implementing multi-beamforming, beam steering and gated correlator
![Page 28: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/28.jpg)
Acknowledgements
▪ Prof. Yashwant Gupta, Centre Director, NCRA
▪ Ajith Kumar B., Back-end group co-ordinator, GMRT, NCRA
▪ Sanjay Kudale, GMRT, NCRA
▪ Shelton Gnanaraj, GMRT, NCRA
▪ Andrew Jameson, Swinburne University, Australia
▪ Benjamin Barsdel, Swinburne University, Australia (now at Nvidia)
▪ CASPER Group, Berkeley
▪ Digital Back-end Group, GMRT, NCRA
▪ Computer Group, GMRT, NCRA
▪ Control Room, GMRT
![Page 29: Powering Real-time Radio Astronomy Signal Processing with …on-demand.gputechconf.com/gtc/2018/presentation/s8339... · 2018. 4. 26. · Ajith Kumar B., Back-end group co-ordinator,](https://reader033.vdocument.in/reader033/viewer/2022051902/5ff26e3873947c28a5144279/html5/thumbnails/29.jpg)
Thank You