a synthesis algorithm for customized heterogeneous multi-processors.pdf

A Synthesis Algorithm for Customized Heterogeneous Multi-processors Rahim Soleymanpour

School of Electrical and Computer Engineering, University of Tehran,

Tehran , Iran [email protected]

Siamak Mohammadi School of Electrical and Computer Engineering,

University of Tehran, Tehran , Iran

[email protected]

Hamed Rajabi Shomal University

Amol ,Iran [email protected]

Abstract: Today, running applications on embedded systems

demand more computation to satisfy requirements such as performance and power consumption. Multiprocessor systems on chip (MPSoC) provide practical solution by having significant effect on throughput. . We can employ application-specific instruction set processor (ASIP) concept to customize each processor in MPSoC platform based on the application mapped onto it. Here, we propose an algorithm to synthesize an optimal heterogeneous MPSoC with ASIP and also identify map an application on MPSoC platform. The synthesis algorithm and the scheduling are executed simultaneously. Additionally, the synthesis algorithm identifies processor numbers to reach maximum performance. Experimental results show an average speedup of 42.71% and energy consumption on Network on Chip (NoC) of 30.77% compared to a homogenous MPSoC.

Keywords: hardware/software codesign, synthesis algorithm, application-specic instruction-set processor (ASIP), heterogeneous MPSoC.

I. Introduction ASIP has provided sufficient flexibility for design. In this

technique, new instruction set is added to the base processor to enhance performance and reduce power consumption and code size. Algorithms, in this field, search a pattern of basic instructions that occur frequently in the application code by exploring a huge space, to identify custom instructions. Then, from potential candidates of custom instructions it selects the ones that have further merits. Without using an automatic tool, this procedure becomes tedious and may not attain the desired performance.

To devise optimal MPSoC, designers properly investigate all factors that affect MPSoC performance [1] such as decomposing an application program into subtasks, mapping and scheduling the tasks into processor elements (PE), and interprocessor communication. Consequently, to customize an MPSoC, one must certainly consider these parameters simultaneously; otherwise it might lead to suboptimal results. Researchers need a heuristic approach to look into vast design space exploration to reach practical solutions.

We present a synthesis algorithm to construct optimal heterogeneous MPSoCs for applications in embedded systems. This algorithm takes application codes and a directed acyclic task graph (DAG) as inputs and make use of iterative procedure. Selecting custom instructions for a task depends on two things: the potential tasks identified in previous iterations and data communication between the tasks. Lastly, it selects the configuration, which has appropriate tradeoff for speedup, processor numbers, and area overhead for custom instructions and energy dissipation of network on chip (NoC). To evaluate our proposed algorithm, various applications are used such as multimedia, encryption and networking. The experimental

result shows that the synthesized heterogeneous MPSoC achieves better performance.

In the rest of the paper, we discuss previous works related to synthesis of heterogeneous MPSoCs in Section II. In Section III, we describe the proposed algorithm, and depict its pseudo-code. Finally, experimental results and conclusion are presented in Sections IV, V, respectively.

II. Related Work ASIP approach is used to reduce time to market and also

have low complexity compared to the application-specified integrated circuits (ASICs). Authors in [2] offer a method to design a processor with a special instruction set. They have investigated the behavior of objective applications and then have analyzed various configurations, and lastly have expressed requirements for hardware resources. The algorithms in [3] and [4] describe a method to extract custom instructions for the target program. The platform [5] automatically searches for templates of custom instructions through basic blocks of applications. The platform [5] also depicts effects of input and output numbers of register file for the custom function unit (CFU) on the energy consumption.

Scheduling on platforms with homogeneous process elements has been widely studied. In [6] [7] different heuristics have been introduced to reduce the completion time of applications on a homogenous platform. Authors in [8] categorize some scheduling algorithms for homogenous MPSoCs and then compare the performance. On the other hand for heterogeneous processor elements, DCPD algorithm [9] has been proposed that duplicates some tasks to improve the performance. The HEFT and CPOP algorithms [10] are proposed for scheduling application graph on heterogeneous processors.

Issues that may be encountered in MPSoC designs that use extensible and configurable processors have been reviewed in [11] and then a tool for data intensive application were presented. The method described in [12] uses an iterative process that first completes the scheduling in each iteration of the algorithm and then randomly picks a critical path. In the following step, it adds the custom instructions to the tasks, which are on the selected critical path and continues it until the selected path is no longer critical. For task sequence, a dynamic programming model that customizes processors is proposed in [13]. It identifies the appropriate partition number for applications and selects the suitable custom instructions to improve performance for stream applications such as MP3. Partition number of an application is equal to processor number.

978-1-4673-2990-3/12/$31.00 2012 IEEE - 151 - ISOCC 2012

III. Proposed Algorithm Here, we propose the algorithm that executes the

repetitive and time consuming parts on a custom function unit (CFU). Inputs of our proposed algorithm are: 1) Directed acyclic task graph (DAG) illustrated as {T1, T2, , TN}, where N stands for task numbers and exeTime(TJ) and commCost(TJ,TI) state execution time of task TJ and data communication cost from TJ to TI, respectively. 2) The task code to profile execution time and extract custom instructions. This approach will improve the time completion and energy consumption on NoC.

Generally, the proposed algorithm uses incremental and iterative process and accomplishes the scheduling tasks in each step of the synthesis algorithm. If the task is critical, the task customization process is executed at the same time as the scheduling. Since the tasks may migrate to a different processor at the next iteration, it does assign selected custom instructions to the task instead of the processor. This situation may occur due to the reduced execution time of tasks because of customization routine. Consequently, the scheduling may differ from the task mapping compared to the previous iterations. Now, the synthesis algorithm will be expressed in more details.

Algorithm 1: First of all, tasks are profiled to identify hot pots and execution time of the application. We use dynamic critical path (DCP) algorithm [6] For scheduling that utilizes two concepts of absolute earliest start time (AEST) and absolute latest start time (ALST) to identify a task on critical path. The task is critical when AEST and ALST are equal. In each step, it calculates the initial value of AEST and ALST for all tasks. The second loop performs scheduling and selects custom instructions. It determines whether the selected task is critical.

In the next step, the synthesis algorithm performs scheduling with new execution time of the tasks. Tasks that were critical in the previous step may not be critical in the current iteration and vice versa. If a task is still critical, the procedure can try to insert more custom instructions.

At the end of each step, the completion time of the application is indicated by the current configuration, and best_merit is calculated for next step. Additionally, traffic and energy dissipation in the network are calculated. We use the following formula presented in [14] to estimate the energy consumption in NoC:

= + + (1) In the above equation, energy consumption for each bit

inside the switch fabric is Ebit and consists of the bit energy consumed on node switches, ESbit , on internal buffers, EBbit, and on interconnect wires, EWbit. .

As depicted in Algorithm 1, in the first loop, the algorithm terminates when there is no further improvement. The termination condition is diagnosed by analyzing the improvement systematically during last iterations. The process is continued as long as the overall average is above a constraint variable. All points that were found in all steps are explored to select the best possible solution according to the constraints.

Algorithm 2: The select_processor function described in Algorithm 2 performs the selection of best processor and calls extraction and custom instructions selection routines. Algorithm 2 is called for every task in all iterations. If the selected task is not critical, the algorithm only attempts to select processor from all available processors. Otherwise, if the task is critical, after detecting the appropriate processor, it calls the customization routine. The algorithm utilizes information obtained in the previous iteration to pick up the best custom instruction in the whole process. Assume a task is critical, in order to apply customization process, some custom instructions are selected for this task. However, tasks that have not yet been scheduled and which will be critical in following steps might have better custom instructions than in previous steps. This situation must be considered to make sure we do not lose efficiency. With this assumption, the algorithm utilizes a variable (best_merit) to select the best set of custom instructions in each iteration. It picks custom instructions which have the best merit between all tasks, and then update best_merit for the next iteration. The return value of merit variable is used to determine best_merit value for next iteration. It is clear that if some tasks have large data communication, they had better be mapped onto the same processor. Here, we ignore data communication cost of the tasks that execute on the same processor. Algorithm 2 tries to map the selected task onto a processor that consists of tasks that have the most data communication with the selected task. Also, if possible the critical child of the selected task is mapped onto the same processor. The algorithm uses custom instructions to reduce completion time and also to provide the above condition. The following example illustrates this situation: Assume there is a task with vast data communication with other tasks. Thereby, they should be run on the same processor but, there is no sufficient time slot. In

Algorithm 1: synthesis MultiProcessor (task graph, code) consumption_area0; best_merit ; improvement true; profiling for all tasks; while (consumption_area temp_merit then //compute best merit CI for next iteration swap(paramCI.merit , temp_merit); end if compute area consumption; end while update of best_merit for next iteation compute traffic and energy consumption in NoC; compute improvement;//based on history save intermediate result; end while; select best solution;

1 : 2 : 3 : 4 : 5 : 6 : 7 : 8 : 9 : 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

978-1-4673-2990-3/12/$31.00 2012 IEEE - 152 - ISOCC 2012

this case, we can remedy this situation by reducing the execution time through extraction of custom instructions. We use the following equation [6] to check whether there is enough idle time slot on the processor, on which the task is mapped to:

As shown in Algorithm 2, the processor that will have minimum AEST will be selected. If there is sufficient time slot on the desired processor for the critical task; the algorithm assigns the task temporarily to a processor called real processor. Otherwise, the task is assigned to a so-called temporary processor. In the following step, the algorithm identifies the best processor between real and temporary

processors, if possible.

Finally, the algorithm returns values such as the merit of the last selected custom instruction, area overhead and achieved speedup for selected task. At the end of the algorithm, the task is assigned to the adequate processor.

IV. Experimental Results Our proposed algorithm has been implemented with about

7500 lines of C/C++. We have constructed a platform to evaluate it. The platform consists of MIPS [15] processors implemented by a commercial design tool with CMOS 90nm library. Also crossbar network paradigm is used for communications between processors. The platform uses the algorithm described in [5] to extract custom instructions automatically. We have also implemented the proposed method in [12], mentioned above, in our platform to compare performance of both methods. We have used different benchmarks [16][17][18] from various scopes such as

Algorithm 2: select_processor (task, location, best_merit) for i=0 to number processors are in processorList do statereal; nameProcprocessorList; thisAEST findSlot(Task, nameProc , location ); if task is critical then CCR findMaxCCR(Task ,nameProc, location); if thisAEST < 0 then Diff thisAEST; Temporary exeTime(task) = diff; thisAEST findSlot(Task, nameProc , location ); State temp; end if update maxCCR; end if if thisAEST > 0 then nc select child node of task have smallest difference between ALST and AEST; temporary map task on nameProc; childAEST findSlot(nc , nameProc);

if childAEST+ThisAEST

TABLE 1: The optimal points obtained from the platform

Application MPEG2DEC APP1 APP2 Networking

Our Algorithm

Algorithm [12]

Our Algorithm

Algorithm [12]

Our Algorithm

Algorithm [12]

Our Algorithm

Algorithm [12]

Consumption area for CI (nm2) 7.1151 10.5814 2.1735 1.3931 0.6981 0.2709 0.5106 0.6736

speedup 23.97% 24.75% 69.26% 69.25% 33.78% 5.88% 48.44% 34.29% network traffic 21.76% 15.03% 28.13% -10.94% 0.0% 0.0% 56% 0.0%

Processor numbers 4 5 3 3 2 2 2 2

multimedia, networking and encryption.

Fig. 1 depicts how adding custom instructions can affect the speedup. It shows adding custom instructions usually improve the performance for both platforms, while our algorithm results in better performance. Our algorithm explores more points in the design space, and as a result it finds more likely better solutions. For example in Fig. 1 for App2, both algorithms go through the same part of the path, but the method in [12] stops earlier, while our algorithm continues the customization procedure. Sometimes, adding custom instructions does not improve performance. This happens for benchmark App1.

Energy consumption in NoC is a considerable portion of total energy. Since it depends entirely on the network traffic, and on how the task mapping can alter drastically the energy consumption. Fig. 2 illustrates the effect of the customization procedure on the network traffic. Additionally, it shows ups and downs in network traffic through customization routine that come from changing task mapping on processors. Axes X and Y show the increase in custom instructions and the traffic between processors, respectively. Sine our algorithm intents to reduce the network traffic, it creates opportunity to decrease energy consumption. However, it could be possible that customization procedure has no effect on network traffic. For example for App2, customizing does not change the task mapping on processors. On the other hand the method [12] does not consider the data communication. Consequently, it loses efficiency in power dissipation in the network on chip.

The optimal points related to the implemented algorithms are demonstrated in TABLE 1. Our algorithm becomes stable or terminates in maximum 38 iterations. we get on average 42.71% and 30.77% from speedup and energy dissipation, respectively, compared to 24.75% and 3.99% obtained from [12]. The results demonstrates that considering scheduling and customization together with communication cost results in better efficiency in synthesis of heterogeneous MPSoCs.

V. Conclusion

After discussing issues that affect MPSoC performance, we have introduced the synthesis algorithm, which performs scheduling and customization procedure simultaneously and attempts to map the tasks that have heavy communication onto the same processor. It utilizes an incremental and repetitive process to select the best possible custom instructions in whole iterations. Experimental results confirm that the proposed algorithm improves the efficiency. The obtained results can be used as a golden model for implementations.

Reference [1] G. Martin, Overview of the MPSoC design challenge, in

Proceedings of the 43rd annual Design Automation Conference, 2006, pp. 274279.

[2] A. De Gloria and P. Faraboschi, An evaluation system for application specific architectures, in Microprogramming and Microarchitecture. Micro 23. Proceedings of the 23rd Annual Workshop and Symposium., Workshop on, 1990, pp. 8089.

[3] L. Pozzi, K. Atasu, and P. Ienne, Exact and approximate algorithms for the extension of embedded processor instruction sets, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 25, no. 7, pp. 12091229, 2006.

[4] N. Clark, H. Zhong, and S. Mahlke, Processor acceleration through automated instruction set customization, in Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, 2003, p. 129.

[5] A. Yazdanbakhsh, M. E. Salehi, and S. M. Fakhraie, Architecture-Aware Graph-Covering Algorithm for Custom Instruction Selection, in Future Information Technology (FutureTech), 2010 5th International Conference on, pp. 16.

[6] Y. K. Kwok and I. Ahmad, Dynamic critical-path scheduling: An effective technique for allocating task graphs to multiprocessors, Parallel and Distributed Systems, IEEE Transactions on, vol. 7, no. 5, pp. 506521, 1996.

[7] M. Y. Wu and D. D. Gajski, Hypertool: A programming aid for message-passing systems, Parallel and Distributed Systems, IEEE Transactions on, vol. 1, no. 3, pp. 330343, 1990.

[8] A. Khan, C. L. McCreary, and M. Jones, A comparison of multiprocessor scheduling heuristics, 1994.

[9] C. H. Liu, C. F. Li, K. C. Lai, and C. C. Wu, A dynamic critical path duplication task scheduling algorithm for distributed heterogeneous computing systems, in Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, 2006, vol. 1, p. 8pp.

[10] H. Topcuoglu, S. Hariri, and M. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE transactions on parallel and distributed systems, pp. 260274, 2002.

[11] G. Martin, Multi-Processor SoC-Based Design Methodologies Using Configurable and Extensible Processors, Journal of Signal Processing Systems, vol. 53, no. 1, pp. 113127, 2008.

[12] F. Sun, S. Ravi, A. Raghunathan, and N. Jha, A Framework for Extensible Processor Based MPSoC Design, Designing Embedded Processors, pp. 6595, 2007.

[13] L. Chen, N. Boichat, and T. Mitra, Customized MPSoC synthesis for task sequence, in Application Specific Processors (SASP), 2011 IEEE 9th Symposium on, pp. 1621.

[14] T. T. Ye, G. D. Micheli, and L. Benini, Analysis of power consumption on switch fabrics in network routers, in Proceedings of the 39th annual Design Automation Conference, 2002, pp. 524529.

[15] MIPS Technologies Home. [Online]. Available: http://www.mips.com/. [Accessed: 13-Nov-2011].

[16] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, Mediabench: A tool for evaluating and synthesizing multimedia and communications systems, in micro, 1997, p. 330.

[17] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, MiBench: A free, commercially representative embedded benchmark suite, in Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, 2001, pp. 314.

[18] R. Ramaswamy and T. Wolf, PacketBench: A tool for workload characterization of network processing, in Workload Characterization, 2003. WWC-6. 2003 IEEE International Workshop on, 2003, pp. 4250.

978-1-4673-2990-3/12/$31.00 2012 IEEE - 154 - ISOCC 2012

a synthesis algorithm for customized heterogeneous multi-processors.pdf

Documents