[ieee 2012 fourth international conference on computational intelligence, modelling and simulation...
Post on 12-Mar-2017
213 Views
Preview:
TRANSCRIPT
DAPDNA-2 Performance Enhancement using ALU Partitioning and Saturation Arithmetic for Digital Image/Signal Processing Applications
Muhammad Ghazanfar Department of Electrical Engineering
Usman Institute of Technology, Hamdard University Karachi Pakistan
mgukhan@uit.edu
Bhawani Shankar Chowdhry Department of Electronic Engineering
Mehran University of Engg. & Technology Jamshoro Pakistan
bhawani.chowdhry@faculty.muet.edu.pk
Shiraz Latif Department of Electrical Engineering
Usman Institute of Technology, Hamdard University Karachi Pakistan
engr_shiraz@ieee.org
Asif Siddiqui King Fahd University of Petroleum & Minerals
Saudi Arabia asifsid@kfupm.edu.sa
GhulamAli Mallah
Department of Computer Science Shah Abdul Latif University
Khairpur Pakistan ghulam.ali@salu.edu.pk
Abstract— A reconfigurable processor is a microprocessor with erasable hardware that can rewire itself dynamically. This allows the chip to adapt effectively to the programming tasks demanded by the particular software they are interfacing with at any given time. To modify the ALU architecture of DAPDNA-2 reconfigurable processor in order to enhance its computational capabilities for Digital Image/Signal Processing applications; we have proposed to reduce the size of ALU from 32-bit (existing) to 8, 16 or 32-bit by employing the technique of ALU Partitioning as typical DIP/DSP applications generally consist of 8 or 16-bit data types. Also, employing Saturation can help us prevent overflows as any further increment in the register values from the maximum, is supposed to be discarded for graphics applications e.g. EA2Ch + D12Ah = FFFFh
Keywords- DAPDNA, Image Processing, Signal Processing, Microprocessor.
I. INTRODUCTION DAPDNA-2 processor, developed by IPFlex using
dynamically reconfigurable technology to achieve high performance, greater flexibility and computation power by exploiting parallel data processing. In comparison with ASICs’ performance; DAPDNA2 is proved to be a replacement to achieve the application requirements of the customers. With the gate count around 112 million with the clock frequency of 166 MHz; it has been implemented with Fujitsu 0.13 um CMOS technology. When comparing with Intel’s Pentium IV (at 3.0 GHz); it has been observed that its performance is exceptional in running typical applications.
II. DAPDNA ARCHITECTURE Figure 1, showing critical functional elements of
DAPDNA (Digital Application Processor based on Distributed Network Architecture).
DAP is designed in such a way that it acts both as a sequential execution unit as well as DNA controller which is a data processing engine built with 376 different types of PEs as shown in figure 1. Also, the DAP has instruction cache and data cache o f8 Kbytes each and five pipeline stages with the possibility of one instruction execution per cycle. Upon the requirement of better performance; the utilization of DNA through the DAP occurs by performing switching in a bank of internal preloaded configuration data on to the DNA.
Fig. 1: DAPDNA-2 Block Diagram
2012 Fourth International Conference on Computational Intelligence, Modelling and Simulation
2166-8531/12 $26.00 © 2012 IEEE
DOI 10.1109/CIMSim.2012.70
260
It is needed so that the structure of the network and PEs functionality can be modified and parallel data processing can be performed using the DNA. Only one cycle is required for this type of dynamic reconfiguration. The DAP may suspend its operation till the DNA provides the result of parallel data processing, or, it can operate separately from the DNA.
As shown in figure 2; there are four double buffered data input and output buffers in DNA. Additionally, there are four configuration memory banks and six segments of PE matrices. There are several types of PEs provided which are preloaded with configuration data which can be switched in single cycle. If more configuration data is needed, it may be loaded, using background process, from external memory (DDRSDRAM). When all the activated PEs are configured on the DNA; they can be executed simultaneously without feeding of instructions continuously. In an application, generally in “C” language coding, several looping instructions or processes in repeated processes can be found. Most of these processes can be replaced as DNA configuration data. Algorithms with parallel data processing are the special cases where the execution of these iterative processes can be accelerated and up to50 time higher performance can be achieved in comparison with Pentium-IV 3.0 GHz. Furthermore, its power consumption is only 2 to 3 watts as described in figure 3. In addition, regardless of the application mapping or fitting; the DAPDNA2 guarantees performance at 166 MHz.
Every PE has two independent 32 bit input buses and one 32 bit output bus. Using these buses, any PE in a given segment can be connected with a PE in the same segment. The bus structure provides flexibility to realize such
sophisticated bus connections by allowing any PE to connect a fix horizontal bus and it can choose one of either 8 x 32 bit horizontal X or Y buses as its output and any one of 8 x 32 hit vertical buses as its input. Multiplexers between the vertical and horizontal busses then complete the connection.
III. SUGGESTED IMPROVEMENTS Considering the power of DAPDNA2, we have planned
to improve DAPDNA2 by employing the technique of Partitioned ALU with Saturation in order to be used as a specialized dynamically reconfigurable processor for Image Processing/Digital Signal Processing applications.
As the data types typically used for Image Processing/Digital Signal Processing applications are narrow (4, 8 or 16bits) and using 32bit PEs for them will be wastage of resources and the dynamic ALU partitioning can yield enhanced performance. A partitioned ALU is capable of performing arithmetic operations on narrow data elements. E.g., a 32bit ALU can perform a single 32bit operation, two 16bit operations or four 8bit operations as shown in figure 4. The Tri State Buffer is employed to stop/allow the carry from one small ALU to other depending on the state of Partitioning i.e. discard carry if partitioned else pass carry. Saturating arithmetic means, that, if an overflow occurs, the number is clamped to the maximum possible value as shown in figure 5.
• Gives a result that is closer to the correct value • Used in DSP, Graphic applications • Requires extra hardware to be added to binary
adder • Pentium MMX instructions have option for
saturating arithmetic.
Fig. 2: PE Matrix
Figure 3: Performance Comparison
Figure 4: Partitioned ALU for 16-bits
Figure 5: Saturation Example
261
IV. MAPPING OF DIGITAL IMAGE PROCESSING APPLICATIONS ON PROPOSED RECONFIGURABLE
ARCHITECTURE
A. Mapping EEMBC RGB to CMYK Benchmark [6] The Algorithm for converting from RGB to CMYK is
part of the EEMBC (The Embedded Microprocessor Benchmark Consortium) Consumer Benchmark suite. It is the benchmark that checks the target CPU capability for basic arithmetic and minimum value detection.
The Algorithm is as follows: RED, GREEN and BLUE (RGB) 8-bit pixel color image
input is fed to the following equation: /* calculation of complementary colors */ cn = 255 – RED; mg = 255 – GREEN; ylw = 255 – BLUE; /* find the black level BLK */ BLK = minimum (c,m,y) /* correction in complementary color level using BLK */
CYAN = cn – BLK MAGENTA = mg – BLK YELLOW = ylw – BLK
The range of RGB and CMYK values is [0:255]. The above algorithm uses 8-bit values of RGB
components of a pixel and is a very strong candidate for execution on a partitioned ALU. In figure 10, the DFG is presented for mapping on the proposed architecture.
1) Performance Evaluation For a data size of 1024, the e partitioned ALU will take
1024*6 = 6144 (as mentioned in fig 10) operations whereas for a data size of 1024 (due to no parallel addition/subtraction i.e. each addition/subtraction will be scheduled in different time period), the non-partitioned ALU will take 1024 operations giving a clear speedup. The graph in figure 7 compares the performance of Partitioned and non-partitioned ALUs.
Figure 6: Sequencing Graph for Mapping EEMBC RGB to CMYK Benchmark
Fig. 8: Partial (to compute Y) Sequencing Graph for Mapping
EEMBC RGB to YIQ Benchmark
Fig. 7: Performance Comparison in terms of no. of operations a PE
requires completing a task
Fig. 9: Partial (to compute I) Sequencing Graph for Mapping EEMBC RGB to YIQ Benchmark
262
B. Mapping EEMBC RGB to YIQ Conversion Benchmark RGB to YIQ conversion is used in the NTSC encoder
where the camera inputs in the form of RGB are converted to luminance (Y) and the two chrominance information (I and Q). These I, Q signals are modulated by a subcarrier and added to the Y signal in the NTSC encoder. This benchmark checks the capacity of the CPU to perform a basic matrix multiplication/accumulate calculation. The R, G, B 8-bit input of the pixel color image processed as:
Y = 0.299*RED + 0.587*GREEN + 0.114*BLUE I = 0.596*RED – 0.275*GREEN – 0.321*BLUE Q = 0.212*RED – 0.523*GREEN + 0.311*BLUE RGB values are in the range of [0:255]. The coefficients
for conversion are 16- bits. The multiplication/accumulate results are shifted 16 bits to the right. Before the shift; 1 is added to the bit position right to the LSB of the shifted result for rounding to the nearest integer.
• Figure 8 shows the Mapping for Y of Mapping EEMBC RGB to YIQ Conversion Benchmark.
• Figure 9 shows the Mapping for I of Mapping EEMBC RGB to YIQ Conversion Benchmark.
• Figure 10 shows the Mapping for Q of Mapping EEMBC RGB to YIQ Conversion Benchmark.
1) PERFORMANCE EVALUATION For a data size of 1024, the partitioned ALU will take
1024*3 = 3072 (as mentioned in fig 8, 9 & 10) operations whereas for a data size of 1024, the non-partitioned ALU will take 1024*5 = 5120 operations (each operation will be scheduled in different time period) giving a clear speedup.
2) COST [7] Nowsuppose,
Component Cost AND2 2CU AND3 3CU OR2 2CU NOR3 2CU XOR2 4CU Tri State Buffer 1CU
Therefore, Cost of a full = 4 + 2 + 4 + 2 + 2 = 14CU Cost of a single bit ALU = 14 + 2 + 2 + 3 + 3 + 4 =
28CU Similarly, the cost of 32-bit ALU = 28 x 32 = 896CU
Fig. 12: Full Adder
Fig. 11: Partitioned ALU for 8-bits
Fig. 10: Partial (to compute Q) Sequencing Graph for Mapping
EEMBC RGB to YIQ Benchmark
Figure 13: 1-Bit ALU [7]
263
Fig. 14: 8-Bit ALU [7]
So from figure 11, the cost of Partitioned ALU = 896 + 3 + Cost of Steering Logic
** The cost of steering logic will definitely be much less than that of an ALU.
V. CONCLUSION In this paper we have tried to enhance the performance of
DAPDNA-2 by employing the technique of ALU Partitioning and Saturation Arithmetic for Digital Image Processing applications. We have verified the performance by mapping benchmarks and found that the suggested improvements are feasible as one ALU can act as 4 with some more logic implementation. If we suppose (worse case) the cost of extra logic is equal to that of an ALU, still we take the advantage of enhanced performance (i-e. 2 to 1).
ACKNOWLEDGMENT We would like to thank the management and faculty of
Mehran University of Engineering & Technology, King Fahd University of Petroleum & Minerals – Saudi Arabia, and, Usman Institute of Technology – Pakistan for their support.
REFERENCES [1] Sato, T.; Watanabe, H.; Shiba, K, “Implementation of dynamically
reconfigurable processor DAPDNA-2”, IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test, 2005. (VLSI-TSA-DAT)
[2] http://whatis.techtarget.com/definition/0,,sid9_gci760378,00.html [3] Cowell, M.; Postula, A., “Rachael SPARC: An Open Source 32-bit
Microprocessor Core for SoCs”, 9th IEEE EUROMICRO Conference on Digital System Design: Architectures, Methods and Tools, 2006.
[4] Gerard J. M. Smit ; André B. J. Kokkeler ; Pascal T. Wolkotte ; Philip K. F. Hölzenspies ; Marcel D. van de Burgwal ; Paul M. Heysters, “The Chameleon Architecture for Streaming DSP Applications”, EURASIP Journal on Embedded Systems 2007
[5] R. Enzler, C. Plessl and M. Platzner, “System-level performance evaluation of reconfigurable processors”, Microprocessors and Microsystems, Volume 29, Issues 2-3, 1 April 2005, Pages 63
[6] http://www.eembc.com [7] http://www.dgp.toronto.edu/
264
top related