[ieee 2012 fourth international conference on computational intelligence, modelling and simulation...

DAPDNA-2 Performance Enhancement using ALU Partitioning and Saturation Arithmetic for Digital Image/Signal Processing Applications

Muhammad Ghazanfar Department of Electrical Engineering

Usman Institute of Technology, Hamdard University Karachi Pakistan

[email protected]

Bhawani Shankar Chowdhry Department of Electronic Engineering

Mehran University of Engg. & Technology Jamshoro Pakistan

[email protected]

Shiraz Latif Department of Electrical Engineering

Usman Institute of Technology, Hamdard University Karachi Pakistan

[email protected]

Asif Siddiqui King Fahd University of Petroleum & Minerals

Saudi Arabia [email protected]

GhulamAli Mallah

Department of Computer Science Shah Abdul Latif University

Khairpur Pakistan [email protected]

Abstract— A reconfigurable processor is a microprocessor with erasable hardware that can rewire itself dynamically. This allows the chip to adapt effectively to the programming tasks demanded by the particular software they are interfacing with at any given time. To modify the ALU architecture of DAPDNA-2 reconfigurable processor in order to enhance its computational capabilities for Digital Image/Signal Processing applications; we have proposed to reduce the size of ALU from 32-bit (existing) to 8, 16 or 32-bit by employing the technique of ALU Partitioning as typical DIP/DSP applications generally consist of 8 or 16-bit data types. Also, employing Saturation can help us prevent overflows as any further increment in the register values from the maximum, is supposed to be discarded for graphics applications e.g. EA2Ch + D12Ah = FFFFh

Keywords- DAPDNA, Image Processing, Signal Processing, Microprocessor.

I. INTRODUCTION DAPDNA-2 processor, developed by IPFlex using

dynamically reconfigurable technology to achieve high performance, greater flexibility and computation power by exploiting parallel data processing. In comparison with ASICs’ performance; DAPDNA2 is proved to be a replacement to achieve the application requirements of the customers. With the gate count around 112 million with the clock frequency of 166 MHz; it has been implemented with Fujitsu 0.13 um CMOS technology. When comparing with Intel’s Pentium IV (at 3.0 GHz); it has been observed that its performance is exceptional in running typical applications.

II. DAPDNA ARCHITECTURE Figure 1, showing critical functional elements of

DAPDNA (Digital Application Processor based on Distributed Network Architecture).

DAP is designed in such a way that it acts both as a sequential execution unit as well as DNA controller which is a data processing engine built with 376 different types of PEs as shown in figure 1. Also, the DAP has instruction cache and data cache o f8 Kbytes each and five pipeline stages with the possibility of one instruction execution per cycle. Upon the requirement of better performance; the utilization of DNA through the DAP occurs by performing switching in a bank of internal preloaded configuration data on to the DNA.

Fig. 1: DAPDNA-2 Block Diagram

2012 Fourth International Conference on Computational Intelligence, Modelling and Simulation

2166-8531/12 $26.00 © 2012 IEEE

DOI 10.1109/CIMSim.2012.70

260

It is needed so that the structure of the network and PEs functionality can be modified and parallel data processing can be performed using the DNA. Only one cycle is required for this type of dynamic reconfiguration. The DAP may suspend its operation till the DNA provides the result of parallel data processing, or, it can operate separately from the DNA.

As shown in figure 2; there are four double buffered data input and output buffers in DNA. Additionally, there are four configuration memory banks and six segments of PE matrices. There are several types of PEs provided which are preloaded with configuration data which can be switched in single cycle. If more configuration data is needed, it may be loaded, using background process, from external memory (DDRSDRAM). When all the activated PEs are configured on the DNA; they can be executed simultaneously without feeding of instructions continuously. In an application, generally in “C” language coding, several looping instructions or processes in repeated processes can be found. Most of these processes can be replaced as DNA configuration data. Algorithms with parallel data processing are the special cases where the execution of these iterative processes can be accelerated and up to50 time higher performance can be achieved in comparison with Pentium-IV 3.0 GHz. Furthermore, its power consumption is only 2 to 3 watts as described in figure 3. In addition, regardless of the application mapping or fitting; the DAPDNA2 guarantees performance at 166 MHz.

Every PE has two independent 32 bit input buses and one 32 bit output bus. Using these buses, any PE in a given segment can be connected with a PE in the same segment. The bus structure provides flexibility to realize such

sophisticated bus connections by allowing any PE to connect a fix horizontal bus and it can choose one of either 8 x 32 bit horizontal X or Y buses as its output and any one of 8 x 32 hit vertical buses as its input. Multiplexers between the vertical and horizontal busses then complete the connection.

III. SUGGESTED IMPROVEMENTS Considering the power of DAPDNA2, we have planned

to improve DAPDNA2 by employing the technique of Partitioned ALU with Saturation in order to be used as a specialized dynamically reconfigurable processor for Image Processing/Digital Signal Processing applications.

As the data types typically used for Image Processing/Digital Signal Processing applications are narrow (4, 8 or 16bits) and using 32bit PEs for them will be wastage of resources and the dynamic ALU partitioning can yield enhanced performance. A partitioned ALU is capable of performing arithmetic operations on narrow data elements. E.g., a 32bit ALU can perform a single 32bit operation, two 16bit operations or four 8bit operations as shown in figure 4. The Tri State Buffer is employed to stop/allow the carry from one small ALU to other depending on the state of Partitioning i.e. discard carry if partitioned else pass carry. Saturating arithmetic means, that, if an overflow occurs, the number is clamped to the maximum possible value as shown in figure 5.

• Gives a result that is closer to the correct value • Used in DSP, Graphic applications • Requires extra hardware to be added to binary

adder • Pentium MMX instructions have option for

saturating arithmetic.

Fig. 2: PE Matrix

Figure 3: Performance Comparison

Figure 4: Partitioned ALU for 16-bits

Figure 5: Saturation Example

261

IV. MAPPING OF DIGITAL IMAGE PROCESSING APPLICATIONS ON PROPOSED RECONFIGURABLE

ARCHITECTURE

A. Mapping EEMBC RGB to CMYK Benchmark [6] The Algorithm for converting from RGB to CMYK is

part of the EEMBC (The Embedded Microprocessor Benchmark Consortium) Consumer Benchmark suite. It is the benchmark that checks the target CPU capability for basic arithmetic and minimum value detection.

The Algorithm is as follows: RED, GREEN and BLUE (RGB) 8-bit pixel color image

input is fed to the following equation: /* calculation of complementary colors */ cn = 255 – RED; mg = 255 – GREEN; ylw = 255 – BLUE; /* find the black level BLK */ BLK = minimum (c,m,y) /* correction in complementary color level using BLK */

CYAN = cn – BLK MAGENTA = mg – BLK YELLOW = ylw – BLK

The range of RGB and CMYK values is [0:255]. The above algorithm uses 8-bit values of RGB

components of a pixel and is a very strong candidate for execution on a partitioned ALU. In figure 10, the DFG is presented for mapping on the proposed architecture.

1) Performance Evaluation For a data size of 1024, the e partitioned ALU will take

1024*6 = 6144 (as mentioned in fig 10) operations whereas for a data size of 1024 (due to no parallel addition/subtraction i.e. each addition/subtraction will be scheduled in different time period), the non-partitioned ALU will take 1024 operations giving a clear speedup. The graph in figure 7 compares the performance of Partitioned and non-partitioned ALUs.

Figure 6: Sequencing Graph for Mapping EEMBC RGB to CMYK Benchmark

Fig. 8: Partial (to compute Y) Sequencing Graph for Mapping

EEMBC RGB to YIQ Benchmark

Fig. 7: Performance Comparison in terms of no. of operations a PE

requires completing a task

Fig. 9: Partial (to compute I) Sequencing Graph for Mapping EEMBC RGB to YIQ Benchmark

262

B. Mapping EEMBC RGB to YIQ Conversion Benchmark RGB to YIQ conversion is used in the NTSC encoder

where the camera inputs in the form of RGB are converted to luminance (Y) and the two chrominance information (I and Q). These I, Q signals are modulated by a subcarrier and added to the Y signal in the NTSC encoder. This benchmark checks the capacity of the CPU to perform a basic matrix multiplication/accumulate calculation. The R, G, B 8-bit input of the pixel color image processed as:

Y = 0.299*RED + 0.587*GREEN + 0.114*BLUE I = 0.596*RED – 0.275*GREEN – 0.321*BLUE Q = 0.212*RED – 0.523*GREEN + 0.311*BLUE RGB values are in the range of [0:255]. The coefficients

for conversion are 16- bits. The multiplication/accumulate results are shifted 16 bits to the right. Before the shift; 1 is added to the bit position right to the LSB of the shifted result for rounding to the nearest integer.

• Figure 8 shows the Mapping for Y of Mapping EEMBC RGB to YIQ Conversion Benchmark.

• Figure 9 shows the Mapping for I of Mapping EEMBC RGB to YIQ Conversion Benchmark.

• Figure 10 shows the Mapping for Q of Mapping EEMBC RGB to YIQ Conversion Benchmark.

1) PERFORMANCE EVALUATION For a data size of 1024, the partitioned ALU will take

1024*3 = 3072 (as mentioned in fig 8, 9 & 10) operations whereas for a data size of 1024, the non-partitioned ALU will take 1024*5 = 5120 operations (each operation will be scheduled in different time period) giving a clear speedup.

2) COST [7] Nowsuppose,

Component Cost AND2 2CU AND3 3CU OR2 2CU NOR3 2CU XOR2 4CU Tri State Buffer 1CU

Therefore, Cost of a full = 4 + 2 + 4 + 2 + 2 = 14CU Cost of a single bit ALU = 14 + 2 + 2 + 3 + 3 + 4 =

28CU Similarly, the cost of 32-bit ALU = 28 x 32 = 896CU

Fig. 12: Full Adder

Fig. 11: Partitioned ALU for 8-bits

Fig. 10: Partial (to compute Q) Sequencing Graph for Mapping

EEMBC RGB to YIQ Benchmark

Figure 13: 1-Bit ALU [7]

263

Fig. 14: 8-Bit ALU [7]

So from figure 11, the cost of Partitioned ALU = 896 + 3 + Cost of Steering Logic

** The cost of steering logic will definitely be much less than that of an ALU.

V. CONCLUSION In this paper we have tried to enhance the performance of

DAPDNA-2 by employing the technique of ALU Partitioning and Saturation Arithmetic for Digital Image Processing applications. We have verified the performance by mapping benchmarks and found that the suggested improvements are feasible as one ALU can act as 4 with some more logic implementation. If we suppose (worse case) the cost of extra logic is equal to that of an ALU, still we take the advantage of enhanced performance (i-e. 2 to 1).

ACKNOWLEDGMENT We would like to thank the management and faculty of

Mehran University of Engineering & Technology, King Fahd University of Petroleum & Minerals – Saudi Arabia, and, Usman Institute of Technology – Pakistan for their support.

REFERENCES [1] Sato, T.; Watanabe, H.; Shiba, K, “Implementation of dynamically

reconfigurable processor DAPDNA-2”, IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test, 2005. (VLSI-TSA-DAT)

[2] http://whatis.techtarget.com/definition/0,,sid9_gci760378,00.html [3] Cowell, M.; Postula, A., “Rachael SPARC: An Open Source 32-bit

Microprocessor Core for SoCs”, 9th IEEE EUROMICRO Conference on Digital System Design: Architectures, Methods and Tools, 2006.

[4] Gerard J. M. Smit ; André B. J. Kokkeler ; Pascal T. Wolkotte ; Philip K. F. Hölzenspies ; Marcel D. van de Burgwal ; Paul M. Heysters, “The Chameleon Architecture for Streaming DSP Applications”, EURASIP Journal on Embedded Systems 2007

[5] R. Enzler, C. Plessl and M. Platzner, “System-level performance evaluation of reconfigurable processors”, Microprocessors and Microsystems, Volume 29, Issues 2-3, 1 April 2005, Pages 63

[6] http://www.eembc.com [7] http://www.dgp.toronto.edu/

264

[ieee 2012 fourth international conference on computational intelligence, modelling and simulation...

Documents