numericalcomputation• no external memory (ddr) access is required • lower resource utilization...

1
University of Tsukuba | Center for Computational Sciences http://www.ccs.tsukuba.ac.jp/ OpenCLready High Speed FPGA Networking [1] Dev. tooI: Intel FPGA for OpneCL Board Support Package (BSP) is a hardware component to support multiple different boards Which FPGA chip is used on the board What kind of peripherals are support by the board Basically, only minimum interfaces are supported To perform inter FPGA communication, implementing network controller and integrating it into the BSP are required Numerical Computation contact address: [email protected] 1 Our FPGA board (BittWare A10PL4) DDR4 mem QSFP+ OpenCL kernel Interconnect PCIe Controller DDR4 Controller DDR4 Controller FPGA FPGA board (A10PL4) DDR4 mem QSFP+ QSFP+ DDR4 mem Driver Host Application Host PC BSP Ethernet IP Controller Ethernet IP Controller Ethernet IP Core Ethernet IP Core additionally implemented I/O channel specified access QSFP+ port calibration is needed Authentic Radiation Transfer (ART) on FPGA [2] Accelerated Radiative transfer on grids Oct-Tree (ARGOT) has been developer in Center for Computational Sciences, University of Tsukuba ART is one of algorithms used in ARGOT and dominant part (90% or more of computation time) of ARGOT program ART is ray tracing based algorithm problem space is divided into meshes and reactions are computed on each mesh Memory access pattern depends on ray direction Not suitable for SIMD architecture 0 200 400 600 800 1000 1200 1400 (16,16,16) (32,32,32) (64,64,64) (128,128,128) Performance [M mesh/s] mesh size CPU(14C) CPU(28C) P100(x1) FPGA better Size CPU(14C) CPU(28C) P100 FPGA (16,16,16) 112.4 77.2 105.3 1282.8 (32,32,32) 158.9 183.4 490.4 1165.2 (64,64,64) 175.0 227.2 1041.4 1111.0 (128,128,128) 95.4 165.0 1116.1 1133.5 4 )4 4 (44 4 44 4 424 ) Channel Memory Network Source Kernel Destination Kernel FIFO Channel Global Memory (DDR4) Source Kernel Destination Kernel Write Read Off Chip Our implementation uses channel based approach One of extensions to OpenCL for FPGAs by Intel It enables inter kernel communication much faster No external memory (DDR) access is required Lower resource utilization than DDR access without channels with channels (16x16x16) (8x8x8) mesh Problem space is divided into small blocks e.g. (16, 16, 16) → 8 (8, 8, 8) PE is assigned to each of small blocks PE BE BE PE 96bit x2 (read,write) Channel PE PE BE BE BE BE BE BE y x Ray Data PEs are connected by channels each other PE: Processing Element BE: Boundary Element Kernel of PEs and BEs are started automatically by autorun attribute Lower control overhead and resource usage because of decreasing number of host controlled kernels 4.9x faster almost equal performance Development of Parallel Sparse Eigensolver Package: z Pares Hierarchical Parallel Structure Numerical Example on the K Computer The aim of this research project is to develop numerical software for large scale eigenvalue problems for postpetascale computing environment. An eigensolver based on contour integral (the SS method) has been proposed by Sakurai and Sugiura [3]. This method has a hierarchical structure and is suitable for massively parallel supercomputers [2]. Moreover, the SS method can be applicable for nonlinear eigenvalue problem [1]. Block Krylov method [4] improves the performance of the method. Based on these newly designed algorithms, we have developed a massively parallel software zPares freelyavailable from http://zpares.cs.tsukuba.ac.jp/. MATLAB version is also available in our webpage. We have also developed CISS eigensolver in SLEPc. Hardware is grouped according to a hierarchical structure of the algorithm. Application for band calculation with real space density functional theory (RSDFT) [2]. Band structure of silicon nanowire of 9,924 atoms. (matrix size = 8,719,488, Number of cores = 6,144) *The results are tentative since they are obtained by early access to the K computer. Reference [1] J. Asakura, T. Sakurai, H. Tadano, T. Ikegami and K. Kimura, A numerical method for nonlinear eigenvalue problems using contour integrals, JSIAM Letters, 1 (2009) 5255. [2] Y. Futamura, T. Sakurai, S. Furuya and J.I. Iwata, Efficient algorithm for linear systems arising in solutions of eigenproblems and its application to electronicstructure calculations, Proc. 10th International Meeting on HighPerformance Computing for Computational Science (VECPAR 2012), 7851 (2013), 226235. [3] T. Sakurai and H. Sugiura, A projection method for generalized eigenvalue problems, J. Comput. Appl. Math., 159 (2003) 119128. [4] H. Tadano, T. Sakurai and Y. Kuramashi, Block BiCGGR: A new block Krylov subspace method for computing high accuracy solutions, JSIAM Letters, 1 (2009) 4447. Acknowledgment This research was partially supported by the Core Research for Evolutional Science and Technology (CREST) Project of JST and the GrantinAid for Scientific Research of Ministry of Education, Culture, Sports, Science and Technology, Japan, Grant number: 23105702. Algorithm Hardware Top level Middle level Bottom level Γ 1 Γ 2 Γ 3 z 1 z 2 z N T (z j )X j = V Automatic Tuning of Parallel 1D FFT on Cluster of Intel Xeon Phi Processors Background The fast Fourier transform (FFT) is widely used in science and engineering. Parallel FFTs on distributedmemory parallel computers require intensive alltoall communication, which affects their performance. How to overlap the computation and the alltoall communication is an issue that needs to be addressed for parallel FFTs. Overview We proposed an automatic tuning (AT) of computation communication overlap for parallel 1D FFT. We used a computationcommunication overlap method that introduces a communication thread with OpenMP. An automatic tuning facility Performance of parallel 1D FFTs OakforestPACS1024 nodes0 500 1000 1500 2000 2500 3000 3500 4000 16M 64M 256M 1G 4G 16G 64G Length of transform N GFlops FFTE 6.2alpha (no overlap) FFTE 6.2alpha (NDIV=4) FFTE 6.2alpha with AT FFTW 3.3.7 implementation of a parallel 1D FFT with automatic tuning is efficient for improving the performance on cluster of Intel Xeon Phi processors. Reference [1] Ryohei Kobayashi, Yuma Oobata, Norihisa Fujita, Yoshiki Yamaguchi, and Taisuke Boku, OpenCLready High Speed FPGA Network for Reconfigurable High Performance Computing, HPC Asia 2018, pp.192201, January 2018 [2] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Yuuma Oobata, Taisuke Boku, Makito Abe, Kohji Yoshikawa, and Masayuki Umemura: Accelerating Space Radiate Transfer on FPGA using OpenCL (Accepted), International Symposium on HighlyEfficient Accelerators and Reconfigurable Technologies (HEART 2018) Acknowledgment This research is a part of the project titled “Development of ComputingCommunication Unified Supercomputer in Next Generation” under the program of “Research and Development for NextGeneration Supercomputing Technology” by MEXT. We thank Intel University Program for providing us both of hardware and software. for selecting the optimal parameters of the computation communication overlap, the radices, and the block size was implemented. Performance To evaluate the parallel 1D FFT with AT, we compared its performance against those of FFTE 6.2alpha (http://www.ffte.jp/) , FFTE 6.2alpha with AT, and FFTW 3.3.7. The performance results demonstrate that the proposed implementation 3DPoisson Equation Solver, 19point stencil computation Size XS, S and M on 1, 2 and 4 FPGAs (Strong Scaling)

Upload: others

Post on 28-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NumericalComputation• No external memory (DDR) access is required • Lower resource utilization than DDRaccess without channels with channels (16x16x16) (8x8x8) mesh • Problem

University  of  Tsukuba    | Center  for  Computational  Sciences

http://www.ccs.tsukuba.ac.jp/

OpenCL-­‐ready  High  Speed  FPGA  Networking  [1]lDev.  tooI:  Intel  FPGA  for  OpneCLlBoard  Support  Package  (BSP)  is  a  hardware  component  to  support  multiple  different  boardsØWhich  FPGA  chip  is  used  on  the  board  ØWhat  kind  of  peripherals  are  support  by  the  board  

lBasically,  only  minimum  interfaces  are  supported  ØTo  perform  inter  FPGA  communication,  implementing  network  controller  and  integrating  it  into  the  BSP  are  required

Numerical  Computation

contact  address:  [email protected]

Implementation overview

14

Our FPGA board(BittWare A10PL4)

DDR4 mem

QSFP+

OpenCL kernel

Interconnect

PCIeController

DDR4Controller

DDR4Controller

FPGA

FPGA board (A10PL4)

DDR4 mem

QSFP+

QSFP+

DDR4 mem

Driver

Host Application

Host PC

BSP

Ethernet IPController Ethernet IPController

EthernetIP CoreEthernetIP Core

additionally implementedI/O channel

specified accessQSFP+ port

calibration is needed

Authentic  Radiation  Transfer  (ART)  on  FPGA  [2]• Accelerated Radiative transfer on grids Oct-Tree

(ARGOT) has been developer in Center for Computational Sciences, University of Tsukuba• ART is one of algorithms used in ARGOT and

dominant part (90% or more of computation time) of ARGOT program

• ART is ray tracing based algorithm• problem space is divided

into meshes and reactions are computed on each mesh

• Memory access pattern depends on ray direction

• Not suitable for SIMD architecture

0

200

400

600

800

1000

1200

1400

(16,16,16) (32,32,32) (64,64,64) (128,128,128)

Perf

orm

ance

[M m

esh/

s]

mesh size

CPU(14C)CPU(28C)P100(x1)FPGA

better

Table 2: Resource usage and clock frequency of the implementation.

size # of PEs ALMs (%) Registers (%) M20K (%) MLAB DSP (%) Freq. [MHz](16, 16, 16) (2, 2, 2) 132,283 31% 267,828 31% 739 27% 14,310 312 21% 193.2(32, 32, 32) (2, 2, 2) 169,882 40% 344,447 40% 796 29% 21,100 312 21% 173.8(64, 64, 64) (2, 2, 2) 169,549 40% 344,512 40% 796 29% 21,250 312 21% 167.0

(128, 128, 128) (2, 2, 2) 169,662 40% 344,505 40% 796 29% 21,250 312 21% 170.4

Table 3: Performance comparison between FPGA, CPU andGPU implementations. The unit is M mesh/sec.

Size CPU(14C) CPU(28C) P100 FPGA(16,16,16) 112.4 77.2 105.3 1282.8(32,32,32) 158.9 183.4 490.4 1165.2(64,64,64) 175.0 227.2 1041.4 1111.0

(128,128,128) 95.4 165.0 1116.1 1133.5

per link) multiple interconnection links (up to 4 channels) onit. Additionally, HLS such as OpenCL programming envi-ronment is provided, and there are several tyeps of researchto involve them in FPGA computing. In [3], Kobayashi, etal. show the basic feature to utilize the high speed intercon-nection over FPGA driven by OpenCL kernels. Therefore,although the performance of our implementation is almostsame as NVIDIA P100 GPU, the overall performance withmultiple computation nodes with FPGA to be connected di-rectly can easily overcome the GPU implementation whichrequires host CPU control and kernel switching for inter-node communication. Networking overhead on FPGAs ismuch lower than one on GPUs. To improve current ARTmethod implementation with such an interconnection fea-ture on FPGA is our next step toward high performanceparallel FPGA computing.

8. CONCLUSIONIn this paper, we optimized ART method used in ARGOT

program which solves a fundamental calculation in earlystage universe with space radiative transfer phenomenon,on an FPGA using Intel FPGA SDK for OpenCL. We par-allelized the algorithm using the SDK’s channel extensionin an FPGA. We achieved 4.89 times faster performancethan the CPU implementation using OpenMP as well as al-most same performance as the GPU implementation usingCUDA.

Although the performance of the FPGA implementationis comparable to NVIDIA P100 GPU, it has room to im-prove its performance. The most important optimizationis resource optimization. If we can implement larger num-ber of PEs than one of the current, we can improve per-formance. However, it is difficult for us to reduce usageof ALMs and registers because we do not describe them di-rectly in OpenCL code. Not only resource but also frequencyis important. We suppose Arria 10 with OpenCL design canrun on 200MHz or higher frequency.

We will implement the network functionality into the ARTdesign to parallelize it among multiple FPGAs. We con-sider networking using FPGAs is an important feature forparallel applications using FPGAs. Although GPUs havehigher computation performance FLOPS and higher mem-ory bandwidth than FPGAs, I/O including networking is a

weak point for GPUs because they are connected to NICsthrough PCIe bus. In addition to networking, we will try torun our code on Stratix 10 FPGA which is the next gener-ation Intel FPGA. We expect we can implement more PEsthan Arria 10 FPGA because it has 2.2 times more ALMblocks and 3.8 times more DPS blocks.

9. REFERENCES[1] K. M. Górski, E. Hivon, A. J. Banday, B. D. Wandelt,

F. K. Hansen, M. Reinecke, and M. Bartelmann.Healpix: A framework for high-resolution discretizationand fast analysis of data distributed on the sphere. TheAstrophysical Journal, 622(2):759, 2005.

[2] K. Hill, S. Craciun, A. George, and H. Lam.Comparative analysis of opencl vs. hdl withimage-processing kernels on stratix-v fpga. In 2015IEEE 26th International Conference onApplication-specific Systems, Architectures andProcessors (ASAP), pages 189–193, July 2015.

[3] R. Kobayashi, Y. Oobata, N. Fujita, Y. Yamaguchi,and T. Boku. Opencl-ready high speed fpga network forreconfigurable high performance computing. InProceedings of the International Conference on HighPerformance Computing in Asia-Pacific Region, HPCAsia 2018, pages 192–201, New York, NY, USA, 2018.ACM.

[4] Y. Luo, X. Wen, K. Yoshii, S. Ogrenci-Memik,G. Memik, H. Finkel, and F. Cappello. Evaluatingirregular memory access on opencl fpga platforms: Acase study with xsbench. In 2017 27th InternationalConference on Field Programmable Logic andApplications (FPL), pages 1–4, Sept 2017.

[5] T. Okamoto, K. Yoshikawa, and M. Umemura. argot:accelerated radiative transfer on grids using oct-tree.Monthly Notices of the Royal Astronomical Society,419(4):2855–2866, 2012.

[6] S. Tanaka, K. Yoshikawa, T. Okamoto, andK. Hasegawa. A new ray-tracing scheme for 3d diffuseradiation transfer on highly parallel architectures.Publications of the Astronomical Society of Japan,67(4):62, 2015.

[7] H. R. Zohouri, N. Maruyama, A. Smith, M. Matsuda,and S. Matsuoka. Evaluating and optimizing openclkernels for high performance computing with fpgas. InProceedings of the International Conference for HighPerformance Computing, Networking, Storage andAnalysis, SC ’16, pages 35:1–35:12, Piscataway, NJ,USA, 2016. IEEE Press.

4) 4 4 ( 44

4

4 44

42 4)

PE Array(2x2x2)

DDR4Memory

MemoryReader

MemoryWriter

Buffer

Buffer

ChannelMemory Network

Fig. 5: Design Outline of ART on FPGA.

each other. Each kernel computes reaction between a mesh and a rayon its own computation space which is dedicated to each kernel. Whilecomputing, a ray is traversed among multiple compute kernels depend-ing on its location. If a ray goes out from kernel’s space, its data will betransferred to a neighbor kernel through a channel.Figure 5 shows the design outline of our implementation. “Memory Reader”reads mesh data from DDR4 memory which is seen as a global memoryfrom OpenCL language. “Memory Writer” is a counterpart to the readerand updates mesh data by the result of computation. It has both of readand write memory access because it computes integration of gas reaction.“Buffer” is a mesh data buffer to improve memory access performance.“PE Array” is an array of PEs (Processing Element). PE computes thekernel of ART method. The array is consists of multiple kernels. We showthe detail of PE network in the next subsection.Since our implementation is work-in-progress, it lacks some features fromthe CPU implementation. While computation in an FPGA, all mesh datamust be put into its internal BRAM (Block Random Access Memory).The FPGA implementation does not support to replace mesh data in-volved by progression of its computation. Therefore, problem size whichan FPGA can solve is limited by the size of BRAM. The CPU implemen-tation supports inter-node parallelization using MPI (Message PassingInterface), but the FPGA implementation does not support any network-ing functionality and uses only one FPGA.

4.2 Parallelization using Channel in an FPGA

We describe the structure in “PE Array” shown in Figure 5. A PE Arrayis consists of PEs and BEs (Boundary Element) as shown in Figure 6.It shows the PE Array network on the x-y dimension. We do not showconnections for z dimension to keep the figure simple. We also have asimilar connection to x-y dimension for z dimension.

SourceKernel

DestinationKernel

FIFOChannel

Global Memory(DDR4)

SourceKernel

DestinationKernel

Write Read

Off Chip

• Our implementation uses channel based approach• One of extensions to OpenCL for FPGAs by Intel

• It enables inter kernel communication much faster• No external memory (DDR) access is required• Lower resource utilization than DDR access

without channels with channels

(16x16x16) (8x8x8)

mesh

• Problem space is divided into small blocks• e.g. (16, 16, 16) → 8 � (8, 8, 8)• PE is assigned to each of small blocks

PE BEBE PE

96bit x2 (read,write)

Channel

PE PE BEBE

BEBE

BEBE

y

xRay Data

• PEs are connected by channels each other• PE: Processing Element• BE: Boundary Element

• Kernel of PEs and BEs are started automatically by autorun attribute• Lower control overhead and resource usage

because of decreasing number of host controlled kernels

4.9x  faster

almost  equal  performance

Development  of  Parallel  Sparse  Eigensolver Package:  z  -­‐ Pares

Hierarchical  Parallel  Structure Numerical  Example  on  the  K  Computer

The aim of this research project is to develop numerical software forlarge-­‐ scale eigenvalue problems for post-­‐petascale computingenvironment. An eigensolver based on contour integral (the SSmethod) has been proposed by Sakurai and Sugiura [3]. This methodhas a hierarchical structure and is suitable for massively parallelsupercomputers [2]. Moreover, the SS method can be applicable fornonlinear eigenvalue problem [1]. Block Krylov method [4] improvesthe performance of the method. Based on these newly designedalgorithms, we have developed a massively parallel software z-­‐Paresfreely-­‐available from http://zpares.cs.tsukuba.ac.jp/. MATLAB versionis also available in our webpage. We have also developed CISSeigensolver in SLEPc.

Hardware is grouped according toa hierarchical structure of thealgorithm.

Application for band calculation with realspace density functional theory (RSDFT) [2].

Band  structure  of  silicon  nanowire  of  9,924  atoms.(matrix  size  =  8,719,488,    Number  of  cores  =  6,144)  

*The  results  are  tentative  since  they  are  obtained  by  early  access  to  the  K  computer.  

Reference  [1]  J.  Asakura,  T.  Sakurai,  H.  Tadano,  T.  Ikegami  and  K.  Kimura,  A  numerical  method  for  nonlinear  eigenvalue  problems  using  contour  integrals, JSIAM  Letters,  1  (2009)  52-­‐55.  [2]  Y.  Futamura,  T.  Sakurai,  S.  Furuya and  J.-­‐I.  Iwata,  Efficient  algorithm  for  linear  systems  arising  in  solutions  of  eigenproblems and  its  application  to  electronic-­‐structure  calculations,  Proc.  10th  International  Meeting  on  High-­‐Performance  Computing  for  Computational  Science  (VECPAR  2012),  7851  (2013),  226-­‐235.  [3]  T.  Sakurai  and  H.  Sugiura,  A  projection  method  for  generalized  eigenvalue  problems,  J.  Comput.  Appl.  Math.,  159  (2003)  119-­‐128.  [4]  H.  Tadano,  T.  Sakurai  and  Y.  Kuramashi,  Block  BiCGGR:  A  new  block  Krylov subspace  method  for  computing  high  accuracy  solutions,  JSIAM  Letters,  1  (2009)  44-­‐47.  AcknowledgmentThis  research  was  partially  supported  by  the  Core  Research  for  Evolutional  Science  and  Technology  (CREST)  Project  of  JST  and  the Grant-­‐in-­‐Aid  for  Scientific  Research  of  Ministry  of  Education,  Culture,  Sports,  Science  and  Technology,  Japan,  Grant  number:  23105702.  

Algorithm HardwareTop  level

Middle  level

Bottom  level

Γ1 Γ2 Γ3

z1

z2

zN

T (zj)Xj = V

Automatic  Tuning  of  Parallel  1-­‐D  FFT  on  Cluster  of  Intel  Xeon  Phi  Processors

BackgroundThe fast Fourier transform (FFT) is widely used in science andengineering. Parallel FFTs on distributed-­‐memory parallelcomputers require intensive all-­‐to-­‐all communication, whichaffects their performance. How to overlap the computation andthe all-­‐to-­‐all communication is an issue that needs to beaddressed for parallel FFTs.

OverviewWe proposed an automatic tuning (AT) of computation-­‐communication overlap for parallel 1-­‐D FFT. We used acomputation-­‐communication overlap method that introduces acommunication thread with OpenMP. An automatic tuning facility

Performance  of  parallel  1-­D  FFTs(Oakforest-­PACS,1024  nodes)

05001000150020002500300035004000

16M64M256M 1G 4G 16

G64G

Length  of  transform  N

GFlops

FFTE6.2alpha  (nooverlap)FFTE6.2alpha(NDIV=4)FFTE6.2alpha  withATFFTW    3.3.7

implementation of aparallel 1-­‐D FFT withautomatic tuning isefficient for improvingthe performance oncluster of Intel XeonPhi processors.

Reference  [1]  Ryohei Kobayashi,  Yuma  Oobata,  Norihisa Fujita,  Yoshiki Yamaguchi,  and  Taisuke Boku,  OpenCL-­‐ready  High  Speed  FPGA  Network  for  Reconfigurable  High  Performance  Computing,  HPC  Asia  2018,  pp.192-­‐201,  January  2018[2]  Norihisa Fujita,  Ryohei Kobayashi,  Yoshiki Yamaguchi,  Yuuma Oobata,  Taisuke Boku,  Makito Abe,  Kohji  Yoshikawa,  and  Masayuki  Umemura:  Accelerating  Space  Radiate  Transfer  on  FPGA  using  OpenCL  (Accepted),  International  Symposium  on  Highly-­‐Efficient  Accelerators  and  Reconfigurable  Technologies  (HEART  2018)Acknowledgment  This  research  is  a  part  of  the  project  titled  “Development  of  Computing-­‐Communication  Unified  Supercomputer  in  Next  Generation”  under  the  program  of  “Research  and  Development  for  Next-­‐Generation  Supercomputing  Technology”  by  MEXT.  We  thank  Intel  University Program  for  providing  us  both  of  hardware  and  software.

for selecting the optimal parameters of the computation-­‐communication overlap, the radices, and the block size wasimplemented.

PerformanceTo evaluate the parallel 1-­‐D FFT with AT, we compared itsperformance against those of FFTE 6.2alpha(http://www.ffte.jp/) , FFTE 6.2alpha with AT, and FFTW3.3.7.The performance results demonstrate that the proposedimplementation

3D-­‐Poisson  Equation  Solver,  19-­‐point  stencil  computation

Size  XS,  S  and  M  on  1,  2  and  4  FPGAs(Strong  Scaling)