numericalcomputation• no external memory (ddr) access is required • lower resource utilization...

University of Tsukuba | Center for Computational Sciences

http://www.ccs.tsukuba.ac.jp/

OpenCL-‐ready High Speed FPGA Networking [1]lDev. tooI: Intel FPGA for OpneCLlBoard Support Package (BSP) is a hardware component to support multiple different boardsØWhich FPGA chip is used on the board ØWhat kind of peripherals are support by the board

lBasically, only minimum interfaces are supported ØTo perform inter FPGA communication, implementing network controller and integrating it into the BSP are required

Numerical Computation

contact address: [email protected]

Implementation overview

14

Our FPGA board(BittWare A10PL4)

DDR4 mem

QSFP+

OpenCL kernel

Interconnect

PCIeController

DDR4Controller

DDR4Controller

FPGA

FPGA board (A10PL4)

DDR4 mem

QSFP+

QSFP+

DDR4 mem

Driver

Host Application

Host PC

BSP

Ethernet IPController Ethernet IPController

EthernetIP CoreEthernetIP Core

additionally implementedI/O channel

specified accessQSFP+ port

calibration is needed

Authentic Radiation Transfer (ART) on FPGA [2]• Accelerated Radiative transfer on grids Oct-Tree

(ARGOT) has been developer in Center for Computational Sciences, University of Tsukuba• ART is one of algorithms used in ARGOT and

dominant part (90% or more of computation time) of ARGOT program

• ART is ray tracing based algorithm• problem space is divided

into meshes and reactions are computed on each mesh

• Memory access pattern depends on ray direction

• Not suitable for SIMD architecture

0

200

400

600

800

1000

1200

1400

(16,16,16) (32,32,32) (64,64,64) (128,128,128)

Perf

orm

ance

[M m

esh/

s]

mesh size

CPU(14C)CPU(28C)P100(x1)FPGA

better

Table 2: Resource usage and clock frequency of the implementation.

size # of PEs ALMs (%) Registers (%) M20K (%) MLAB DSP (%) Freq. [MHz](16, 16, 16) (2, 2, 2) 132,283 31% 267,828 31% 739 27% 14,310 312 21% 193.2(32, 32, 32) (2, 2, 2) 169,882 40% 344,447 40% 796 29% 21,100 312 21% 173.8(64, 64, 64) (2, 2, 2) 169,549 40% 344,512 40% 796 29% 21,250 312 21% 167.0

(128, 128, 128) (2, 2, 2) 169,662 40% 344,505 40% 796 29% 21,250 312 21% 170.4

Table 3: Performance comparison between FPGA, CPU andGPU implementations. The unit is M mesh/sec.

Size CPU(14C) CPU(28C) P100 FPGA(16,16,16) 112.4 77.2 105.3 1282.8(32,32,32) 158.9 183.4 490.4 1165.2(64,64,64) 175.0 227.2 1041.4 1111.0

(128,128,128) 95.4 165.0 1116.1 1133.5

per link) multiple interconnection links (up to 4 channels) onit. Additionally, HLS such as OpenCL programming envi-ronment is provided, and there are several tyeps of researchto involve them in FPGA computing. In [3], Kobayashi, etal. show the basic feature to utilize the high speed intercon-nection over FPGA driven by OpenCL kernels. Therefore,although the performance of our implementation is almostsame as NVIDIA P100 GPU, the overall performance withmultiple computation nodes with FPGA to be connected di-rectly can easily overcome the GPU implementation whichrequires host CPU control and kernel switching for inter-node communication. Networking overhead on FPGAs ismuch lower than one on GPUs. To improve current ARTmethod implementation with such an interconnection fea-ture on FPGA is our next step toward high performanceparallel FPGA computing.

8. CONCLUSIONIn this paper, we optimized ART method used in ARGOT

program which solves a fundamental calculation in earlystage universe with space radiative transfer phenomenon,on an FPGA using Intel FPGA SDK for OpenCL. We par-allelized the algorithm using the SDK’s channel extensionin an FPGA. We achieved 4.89 times faster performancethan the CPU implementation using OpenMP as well as al-most same performance as the GPU implementation usingCUDA.

Although the performance of the FPGA implementationis comparable to NVIDIA P100 GPU, it has room to im-prove its performance. The most important optimizationis resource optimization. If we can implement larger num-ber of PEs than one of the current, we can improve per-formance. However, it is difficult for us to reduce usageof ALMs and registers because we do not describe them di-rectly in OpenCL code. Not only resource but also frequencyis important. We suppose Arria 10 with OpenCL design canrun on 200MHz or higher frequency.

We will implement the network functionality into the ARTdesign to parallelize it among multiple FPGAs. We con-sider networking using FPGAs is an important feature forparallel applications using FPGAs. Although GPUs havehigher computation performance FLOPS and higher mem-ory bandwidth than FPGAs, I/O including networking is a

weak point for GPUs because they are connected to NICsthrough PCIe bus. In addition to networking, we will try torun our code on Stratix 10 FPGA which is the next gener-ation Intel FPGA. We expect we can implement more PEsthan Arria 10 FPGA because it has 2.2 times more ALMblocks and 3.8 times more DPS blocks.

9. REFERENCES[1] K. M. Górski, E. Hivon, A. J. Banday, B. D. Wandelt,

F. K. Hansen, M. Reinecke, and M. Bartelmann.Healpix: A framework for high-resolution discretizationand fast analysis of data distributed on the sphere. TheAstrophysical Journal, 622(2):759, 2005.

[2] K. Hill, S. Craciun, A. George, and H. Lam.Comparative analysis of opencl vs. hdl withimage-processing kernels on stratix-v fpga. In 2015IEEE 26th International Conference onApplication-specific Systems, Architectures andProcessors (ASAP), pages 189–193, July 2015.

[3] R. Kobayashi, Y. Oobata, N. Fujita, Y. Yamaguchi,and T. Boku. Opencl-ready high speed fpga network forreconfigurable high performance computing. InProceedings of the International Conference on HighPerformance Computing in Asia-Pacific Region, HPCAsia 2018, pages 192–201, New York, NY, USA, 2018.ACM.

[4] Y. Luo, X. Wen, K. Yoshii, S. Ogrenci-Memik,G. Memik, H. Finkel, and F. Cappello. Evaluatingirregular memory access on opencl fpga platforms: Acase study with xsbench. In 2017 27th InternationalConference on Field Programmable Logic andApplications (FPL), pages 1–4, Sept 2017.

[5] T. Okamoto, K. Yoshikawa, and M. Umemura. argot:accelerated radiative transfer on grids using oct-tree.Monthly Notices of the Royal Astronomical Society,419(4):2855–2866, 2012.

[6] S. Tanaka, K. Yoshikawa, T. Okamoto, andK. Hasegawa. A new ray-tracing scheme for 3d diffuseradiation transfer on highly parallel architectures.Publications of the Astronomical Society of Japan,67(4):62, 2015.

[7] H. R. Zohouri, N. Maruyama, A. Smith, M. Matsuda,and S. Matsuoka. Evaluating and optimizing openclkernels for high performance computing with fpgas. InProceedings of the International Conference for HighPerformance Computing, Networking, Storage andAnalysis, SC ’16, pages 35:1–35:12, Piscataway, NJ,USA, 2016. IEEE Press.

4) 4 4 ( 44

4

4 44

42 4)

PE Array(2x2x2)

DDR4Memory

MemoryReader

MemoryWriter

Buffer

Buffer

ChannelMemory Network

Fig. 5: Design Outline of ART on FPGA.

each other. Each kernel computes reaction between a mesh and a rayon its own computation space which is dedicated to each kernel. Whilecomputing, a ray is traversed among multiple compute kernels depend-ing on its location. If a ray goes out from kernel’s space, its data will betransferred to a neighbor kernel through a channel.Figure 5 shows the design outline of our implementation. “Memory Reader”reads mesh data from DDR4 memory which is seen as a global memoryfrom OpenCL language. “Memory Writer” is a counterpart to the readerand updates mesh data by the result of computation. It has both of readand write memory access because it computes integration of gas reaction.“Buffer” is a mesh data buffer to improve memory access performance.“PE Array” is an array of PEs (Processing Element). PE computes thekernel of ART method. The array is consists of multiple kernels. We showthe detail of PE network in the next subsection.Since our implementation is work-in-progress, it lacks some features fromthe CPU implementation. While computation in an FPGA, all mesh datamust be put into its internal BRAM (Block Random Access Memory).The FPGA implementation does not support to replace mesh data in-volved by progression of its computation. Therefore, problem size whichan FPGA can solve is limited by the size of BRAM. The CPU implemen-tation supports inter-node parallelization using MPI (Message PassingInterface), but the FPGA implementation does not support any network-ing functionality and uses only one FPGA.

4.2 Parallelization using Channel in an FPGA

We describe the structure in “PE Array” shown in Figure 5. A PE Arrayis consists of PEs and BEs (Boundary Element) as shown in Figure 6.It shows the PE Array network on the x-y dimension. We do not showconnections for z dimension to keep the figure simple. We also have asimilar connection to x-y dimension for z dimension.

SourceKernel

DestinationKernel

FIFOChannel

Global Memory(DDR4)

SourceKernel

DestinationKernel

Write Read

Off Chip

• Our implementation uses channel based approach• One of extensions to OpenCL for FPGAs by Intel

• It enables inter kernel communication much faster• No external memory (DDR) access is required• Lower resource utilization than DDR access

without channels with channels

(16x16x16) (8x8x8)

mesh

• Problem space is divided into small blocks• e.g. (16, 16, 16) → 8 � (8, 8, 8)• PE is assigned to each of small blocks

PE BEBE PE

96bit x2 (read,write)

Channel

PE PE BEBE

BEBE

BEBE

y

xRay Data

• PEs are connected by channels each other• PE: Processing Element• BE: Boundary Element

• Kernel of PEs and BEs are started automatically by autorun attribute• Lower control overhead and resource usage

because of decreasing number of host controlled kernels

4.9x faster

almost equal performance

Development of Parallel Sparse Eigensolver Package: z -‐ Pares

Hierarchical Parallel Structure Numerical Example on the K Computer

The aim of this research project is to develop numerical software forlarge-‐ scale eigenvalue problems for post-‐petascale computingenvironment. An eigensolver based on contour integral (the SSmethod) has been proposed by Sakurai and Sugiura [3]. This methodhas a hierarchical structure and is suitable for massively parallelsupercomputers [2]. Moreover, the SS method can be applicable fornonlinear eigenvalue problem [1]. Block Krylov method [4] improvesthe performance of the method. Based on these newly designedalgorithms, we have developed a massively parallel software z-‐Paresfreely-‐available from http://zpares.cs.tsukuba.ac.jp/. MATLAB versionis also available in our webpage. We have also developed CISSeigensolver in SLEPc.

Hardware is grouped according toa hierarchical structure of thealgorithm.

Application for band calculation with realspace density functional theory (RSDFT) [2].

Band structure of silicon nanowire of 9,924 atoms.(matrix size = 8,719,488, Number of cores = 6,144)

*The results are tentative since they are obtained by early access to the K computer.

Reference [1] J. Asakura, T. Sakurai, H. Tadano, T. Ikegami and K. Kimura, A numerical method for nonlinear eigenvalue problems using contour integrals, JSIAM Letters, 1 (2009) 52-‐55. [2] Y. Futamura, T. Sakurai, S. Furuya and J.-‐I. Iwata, Efficient algorithm for linear systems arising in solutions of eigenproblems and its application to electronic-‐structure calculations, Proc. 10th International Meeting on High-‐Performance Computing for Computational Science (VECPAR 2012), 7851 (2013), 226-‐235. [3] T. Sakurai and H. Sugiura, A projection method for generalized eigenvalue problems, J. Comput. Appl. Math., 159 (2003) 119-‐128. [4] H. Tadano, T. Sakurai and Y. Kuramashi, Block BiCGGR: A new block Krylov subspace method for computing high accuracy solutions, JSIAM Letters, 1 (2009) 44-‐47. AcknowledgmentThis research was partially supported by the Core Research for Evolutional Science and Technology (CREST) Project of JST and the Grant-‐in-‐Aid for Scientific Research of Ministry of Education, Culture, Sports, Science and Technology, Japan, Grant number: 23105702.

Algorithm HardwareTop level

Middle level

Bottom level

Γ1 Γ2 Γ3

z1

z2

zN

T (zj)Xj = V

Automatic Tuning of Parallel 1-‐D FFT on Cluster of Intel Xeon Phi Processors

BackgroundThe fast Fourier transform (FFT) is widely used in science andengineering. Parallel FFTs on distributed-‐memory parallelcomputers require intensive all-‐to-‐all communication, whichaffects their performance. How to overlap the computation andthe all-‐to-‐all communication is an issue that needs to beaddressed for parallel FFTs.

OverviewWe proposed an automatic tuning (AT) of computation-‐communication overlap for parallel 1-‐D FFT. We used acomputation-‐communication overlap method that introduces acommunication thread with OpenMP. An automatic tuning facility

Performance of parallel 1-D FFTs（Oakforest-PACS，1024 nodes）

05001000150020002500300035004000

16M64M256M 1G 4G 16

G64G

Length of transform N

GFlops

FFTE6.2alpha (nooverlap)FFTE6.2alpha(NDIV=4)FFTE6.2alpha withATFFTW 3.3.7

implementation of aparallel 1-‐D FFT withautomatic tuning isefficient for improvingthe performance oncluster of Intel XeonPhi processors.

Reference [1] Ryohei Kobayashi, Yuma Oobata, Norihisa Fujita, Yoshiki Yamaguchi, and Taisuke Boku, OpenCL-‐ready High Speed FPGA Network for Reconfigurable High Performance Computing, HPC Asia 2018, pp.192-‐201, January 2018[2] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Yuuma Oobata, Taisuke Boku, Makito Abe, Kohji Yoshikawa, and Masayuki Umemura: Accelerating Space Radiate Transfer on FPGA using OpenCL (Accepted), International Symposium on Highly-‐Efficient Accelerators and Reconfigurable Technologies (HEART 2018)Acknowledgment This research is a part of the project titled “Development of Computing-‐Communication Unified Supercomputer in Next Generation” under the program of “Research and Development for Next-‐Generation Supercomputing Technology” by MEXT. We thank Intel University Program for providing us both of hardware and software.

for selecting the optimal parameters of the computation-‐communication overlap, the radices, and the block size wasimplemented.

PerformanceTo evaluate the parallel 1-‐D FFT with AT, we compared itsperformance against those of FFTE 6.2alpha(http://www.ffte.jp/) , FFTE 6.2alpha with AT, and FFTW3.3.7.The performance results demonstrate that the proposedimplementation

�

�

�

3D-‐Poisson Equation Solver, 19-‐point stencil computation

Size XS, S and M on 1, 2 and 4 FPGAs(Strong Scaling)

numericalcomputation• no external memory (ddr) access is required • lower resource utilization...

Documents