fpga-based low-energy cluster for acceleration of the ...€¦ · fpga-based low-energy cluster for...

1
FPGA-based low-energy cluster for acceleration of the document similarity analysis Michał Karwatowski, Maciej Wielgosz, Paweł Russek, Sebastian Koryciak, Rafał Frączek, Marcin Pietroń, Ernest Jamro, Agnieszka Dąbrowska-Boruch, Kazimierz Wiatr Introduction Zynq SoC & Xillybus Xilinx All Programmable SoC Zynq-7000 family combine in one chip programmable logic fabric and dual core ARM Cortex-A9 processor. Interconnection between them is made using industry standard AXI interface, which provide low latency and high bandwidth data transfers. ZedBoard Development Kit, used as a platform for this project, is based on Zynq-7020 chip. The Custom IP Core Factory online tool provides configurable set of queues visible as FIFOs to FPGA and as files to OS. Xillybus can be implemented over AXI interface allowing maximum transfer speed of 350 MB/s in both directions. System architecture Hardware accelerator was designed to enhance processing capabilities of chosen platform. Processing system uses Xillybus based queues to stream data to FPGA logic fabric. Each queue of the driver application is handled by a separate thread, forasmuch Linux operating system is used, even processor cores load is ensured. Application is written in a way that no synchronization between data sending threads is needed. Each queue is set to best fit its purpose, control queues are set as ‘command and status’ and 1 MB/s bandwidth each, data queues are set as ‘data acquisition’ and 350 MB/s bandwidth each. Sum of all throughputs exceeds 350 MB/s in one direction limit and the collision-free data transfer is ensured by Xillybus tool which automatically reduces speeds to the hardware limit. To enable maximum processing performance data is pre-sorted and concatenated into 32-bit blocks. In the programmable logic, data received from user FIFO is stored in internal distributed memory from where it can be read multiple times without the need to send it again from processing system, this approach saves costly bandwidth. Compare core calculates Hamming measure for each set of data from both streams and sends results to processing system. All hardware modules are supervised by controller module which communicates directly with software control threads. Its functionality is limited to initialization of the modules, during processing no actions are required unless for debugging purposes. Results 0 10 20 30 40 50 60 70 80 0 32 64 96 128 160 192 224 256 Speed per watt [(MB/s)/W] Single data packet size [KB] Energy efficiency ARM + FPGA Intel i5-2430M ARM For comparison purposes test were performed on Zynq-7020 with and without hardware accelerator and on Intel Core i5-2430M. The figure presents maximum transfer speed achieved divided by power consumed during operation related to single data packet size, while processing 128 MB of artificially created vectors. FPGA accelerated solution achieved 53% higher data processing speed obtained from one watt than typical notebook procesor. Acknowledgements The presented work was financed through the research program PLGrid Plus POIG.02.03.00-00-096/10. Cluster Presented solution is well scalable, allowing the construction of a cluster. At the first attempt a tower cluster was built out of four ZedBoard Development Kits connected together via Ethernet. ACK Cyfronet AGH ul. Nawojki 11, 30-950 Krakow With the rapid growth of the Internet and other electronic media, the availability of on- line text information has been significantly increased. As a result, the problem of document similarity analysis turns out to be very important because text processing has become one of the key techniques for handling and organizing data in many applications in industry, entertainment and digital libraries which require access and execution of text-based queries. Searching a large database requires repeated execution of the same operations for the various data, so the number of operations can be easily parallelized in order to reduce the time required for the processing. The system task are to be conducted by cluster nodes. Heart of the each node is an FPGA accelerated ARM processor. The FPGA module described in this work focuses on the acceleration of the similarity of the two vectors in VSM. The presented system shows that it is possible to achieve better scalability and energy consumption using FPGA-based accelerator. AGH Univeristy of Science and Technology, al. Mickiewicza 30, 30-059 Krakow Xillybus is a easy to use and efficient solution for data transfer between processor and FPGA. Xillinux operating system runs on processor allowing the use of all typical Linux functionalities. *reference: Xillybys, IP core product brief, v1.7 March 27, 2014 *reference: L. H. Crockett, R. A. Elliot, M. A. Enderwitz and R. W. Stewart, The Zynq Book: Embedded Processing with the ARM CortexA9 on the Xilinx Zynq-7000 All Programmable SoC, First Edition, Strathclyde Academic Media, 2014. Precision Recall F-measure Business 0.96 0.94 0.95 Culture 0.95 0.98 0.96 Automotive 0.91 0.95 0.93 Science 0.87 0.67 0.75 Sport 0.98 1.0 0.99 The table presents the experiments results - TF-IDF-based classifier for 4200 texts of five different categories. The texts were merged into a single folder and the system was used to split them back to the original categories.

Upload: others

Post on 23-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FPGA-based low-energy cluster for acceleration of the ...€¦ · FPGA-based low-energy cluster for acceleration of the document similarity analysis Michał Karwatowski, Maciej Wielgosz,

FPGA-based low-energy cluster for acceleration of the document similarity analysis

Michał Karwatowski, Maciej Wielgosz, Paweł Russek, Sebastian Koryciak, Rafał Frączek,Marcin Pietroń, Ernest Jamro, Agnieszka Dąbrowska-Boruch, Kazimierz Wiatr

Introduction Zynq SoC & Xillybus

Xilinx All Programmable SoCZynq-7000 family combine in onechip programmable logic fabric anddual core ARM Cortex-A9 processor.Interconnection between them ismade using industry standard AXIinterface, which provide low latencyand high bandwidth data transfers.ZedBoard Development Kit, used asa platform for this project, is basedon Zynq-7020 chip.

The Custom IP Core Factoryonline tool providesconfigurable set of queuesvisible as FIFOs to FPGA and asfiles to OS. Xillybus can beimplemented over AXIinterface allowing maximumtransfer speed of 350 MB/s inboth directions.

System architecture

Hardware accelerator was designed to enhance processing capabilities of chosenplatform. Processing system uses Xillybus based queues to stream data to FPGA logicfabric. Each queue of the driver application is handled by a separate thread, forasmuchLinux operating system is used, even processor cores load is ensured.

Application is written in a way that no synchronization between data sending threads isneeded. Each queue is set to best fit its purpose, control queues are set as ‘commandand status’ and 1 MB/s bandwidth each, data queues are set as ‘data acquisition’ and350 MB/s bandwidth each. Sum of all throughputs exceeds 350 MB/s in one directionlimit and the collision-free data transfer is ensured by Xillybus tool which automaticallyreduces speeds to the hardware limit.

To enable maximum processing performance data is pre-sorted and concatenated into32-bit blocks. In the programmable logic, data received from user FIFO is stored ininternal distributed memory from where it can be read multiple times without theneed to send it again from processing system, this approach saves costly bandwidth.Compare core calculates Hamming measure for each set of data from both streams andsends results to processing system.

All hardware modules are supervised by controller module which communicatesdirectly with software control threads. Its functionality is limited to initialization of themodules, during processing no actions are required unless for debugging purposes.

Results

0

10

20

30

40

50

60

70

80

0 32 64 96 128 160 192 224 256Spee

d p

er w

att

[(M

B/s

)/W

]

Single data packet size [KB]

Energy efficiency

ARM + FPGA

Intel i5-2430M

ARM

For comparison purposes test wereperformed on Zynq-7020 with andwithout hardware accelerator and onIntel Core i5-2430M. The figurepresents maximum transfer speedachieved divided by powerconsumed during operation relatedto single data packet size, whileprocessing 128 MB of artificiallycreated vectors. FPGA acceleratedsolution achieved 53% higher dataprocessing speed obtained from onewatt than typical notebook procesor.

Acknowledgements

The presented work was financed through the research program PLGrid PlusPOIG.02.03.00-00-096/10.

Cluster

Presented solution is well scalable,allowing the construction of acluster. At the first attempt a towercluster was built out of fourZedBoard Development Kitsconnected together via Ethernet.

ACK Cyfronet AGHul. Nawojki 11, 30-950 Krakow

With the rapid growth of the Internet and other electronic media, the availability of on-line text information has been significantly increased. As a result, the problem ofdocument similarity analysis turns out to be very important because text processinghas become one of the key techniques for handling and organizing data in manyapplications in industry, entertainment and digital libraries which require access andexecution of text-based queries.

Searching a large database requires repeated execution of the same operations for thevarious data, so the number of operations can be easily parallelized in order to reducethe time required for the processing. The system task are to be conducted by clusternodes. Heart of the each node is an FPGA accelerated ARM processor. The FPGAmodule described in this work focuses on the acceleration of the similarity of the twovectors in VSM. The presented system shows that it is possible to achieve betterscalability and energy consumption using FPGA-based accelerator.

AGH Univeristy of Science and Technology,al. Mickiewicza 30, 30-059 Krakow

Xillybus is a easy to use and efficientsolution for data transfer betweenprocessor and FPGA. Xillinuxoperating system runs on processorallowing the use of all typicalLinux functionalities.

*reference: Xillybys, IP core product brief, v1.7 March 27, 2014

*reference: L. H. Crockett, R. A. Elliot, M. A. Enderwitz and R. W. Stewart, The Zynq Book: Embedded Processing with the ARM CortexA9 on the Xilinx Zynq-7000 All Programmable SoC, First Edition, Strathclyde Academic Media, 2014.

Precision Recall F-measure

Business 0.96 0.94 0.95

Culture 0.95 0.98 0.96

Automotive 0.91 0.95 0.93

Science 0.87 0.67 0.75

Sport 0.98 1.0 0.99

The table presents the experiments results- TF-IDF-based classifier for 4200 texts offive different categories. The texts weremerged into a single folder and the systemwas used to split them back to the originalcategories.