inter-coarse-grained reconfigurable architecture reconfiguration...

IET Circuits, Devices & Systems

Research Article

Inter-coarse-grained reconfigurablearchitecture reconfiguration technique forefficient pipelining of kernel-stream on coarse-grained reconfigurable architecture-basedmulti-core architecture

IET Circuits Devices Syst., pp. 1–15& The Institution of Engineering and Technology 2015

ISSN 1751-858XReceived on 31st January 2015Revised on 2nd July 2015Accepted on 13th July 2015doi: 10.1049/iet-cds.2015.0047www.ietdl.org

Yoonjin Kim ✉, Hyejin Joo, Sohyun Yoon

Department of Computer Science, Sookmyung Women’s University, Seoul 140-742, South Korea

✉ E-mail: [email protected]

Abstract: Coarse-grained reconfigurable architecture (CGRA)-based multi-core architecture aims at achieving highperformance by kernel-level parallelism (KLP). However, the existing CGRA-based multi-core architectures suffer fromhigh energy consumption and performance bottleneck when trying to exploit the KLP because of poor resourceutilisation caused by insufficient flexibility. In this study, the authors propose a new ring-based sharing fabric (RSF) toboost their flexibility level for the efficient resource utilisation focusing on the kernel-stream type of the KLP.In addition, they introduce a novel inter-CGRA reconfiguration technique for the efficient pipelining of kernel-streambased on the RSF. Experimental results show that the proposed approaches improve performance by up to 88.8% andreduce energy by up to 48.2% when compared with the conventional CGRA-based multi-core architectures.

1 Introduction

The flexibility of a system is very important to accommodate theshort time-to-market requirements for embedded systems. On theother hand, application-specific optimisation of embedded systembecomes inevitable to satisfy the market demand for designers tomeet tighter constraints on cost, performance, and power. Tocompromise these incompatible demands, coarse-grainedreconfigurable architecture (CGRA) has emerged as a suitablesolution for embedded systems [1, 2]. It can boost the performanceby adopting multiple processing elements (PEs) while it can bereconfigured to adapt to evolving characteristics of the embeddedapplications such as audio, video, and graphics processing.However, there is a limit when a CGRA is expected to improvethe performance of an entire application. This is because singleCGRA is sequentially optimised for the parallelised computationsin a kernel at a time, whereas the overall speedup of the entireapplication can be achieved by kernel-level parallelism (KLP) thatseveral kernels concurrently run. Therefore, such a limitation ofsingle CGRA has resulted in the appearance of CGRA-basedmulti-core architecture which allows for the multi-CGRA tosupport diverse KLPs – running separate kernels orinter-dependent kernels (kernel-stream) in parallel.

However, the existing CGRA-based multi-core architecturessuffer from high energy consumption and performance bottleneckwhen trying to achieve the KLP. This is because the existingmulti-CGRA structures are not flexible enough to adaptivelysupport various cases of the KLP. It means that the resources inthe multi-CGRAs cannot be efficiently utilised under monotonousaggregation of several CGRAs. Therefore, boosting their flexibilitylevel for the efficient resource utilisation is considered as a seriousconcern.

In this paper, we propose a new multi-CGRA fabric with a novelreconfiguration technique focusing on the kernel-stream type of theKLP. The proposed multi-CGRA fabric is a new ring-based sharingfabric (RSF) for raising the flexibility level of the CGRA-basedmulti-core architectures. In addition, the novel reconfigurationtechnique means inter-CGRA reconfiguration technique on the RSFfor maximising utilisation of the resources in the multi-CGRA. The

validation of the proposed approaches is demonstrated throughRT-level (RTL) design, synthesis, and simulation of the multi-corearchitectures with varying the number of CGRAs. Compared withthe general multi-CGRAs, the RSFs show area/delay/poweroverhead of up to 19%/10%/21.73% with increasing the number ofCGRAs (4–16). In addition, gate-level simulation of KLP testbenches results in performance improvement by up to 88.8% andenergy saving by up to 48.2% compared with the conventionalarchitectures.

This paper is organised as follows. After the preliminaries inSection 2, we describe the related works in Section 3. In Section4, we present the motivation of our approaches. Then we proposethe inter-CGRA reconfiguration technique on the RSF in Section 5and the experimental results are given in Section 6. Finally, weconclude this paper in Section 7.

2 Preliminaries

2.1 CGRA-based multi-core architecture

Typically, a CGRA-based multi-core architecture includes generalpurpose processors (GPPs), multi-CGRA, and their interface.Fig. 1a shows such an example of the CGRA-based multi-corearchitecture – it is composed of a GPP, a DMA, four CGRAs, andon-chip communication architecture such as networks-on-chip(NoCs) or on-chip bus which couples them. The GPP executescontrol intensive, irregular code segments, and the multi-CGRAperforms data-intensive kernel code segments – in this paper, wemake use of the multi-CGRA in Fig. 1a as base architecture forcomparison with the proposed architecture.

Each CGRA consists of PE array (PA), data buffer (DB),configuration memory (CM), and execution controller (EC). ThePA has identical PEs containing functional units and a few storageunits. The PA has reconfigurable interconnections between PEs forefficient data-transfer. The DB provides operand-data to PAthrough a high-bandwidth data bus. The CM is composed ofconfiguration elements (CEs) and each CE provides context wordto configure each PE. The EC has control data that contains

1

Fig. 1 CGRA-based multi-core architecture with symbolic representation of kernel running

a CGRA-based multi-core architectureb Symbols for kernelc Symbols for CGRA

execution cycles, read/write mode, and addresses of the DB and theCE for correct operations of the PA.

2.2 Symbolic representation

In this paper, we bring up the problems of resource utilisation in theconventional multi-CGRA and propose new approaches to overcome

2

such issues. Therefore, panoptic illustration of resource utilisation inmulti-CGRA is necessary for intelligible explanation of ourapproaches. In this section, we define an efficient way expressed insymbols to show such a utilisation status as Figs. 1b and c. They showthe symbolic representation of the resource utilisation with CM/DBusage when kernel Ki is run on a CGRA. The meanings of thesymbols for kernel andCGRAare defined inFigs. 1b and c, respectively.


3 Related works

Until now, there have been a few multi-core architecture projectsbased on CGRAs [1–16] for kernel-level parallelism [6, 7–9].However, most of them are monotonous aggregation of severalCGRAs. For example, Samsung reconfigurable processor(SRP)-based multiprocessors have been presented in [17, 18]. In[17], the multiprocessor consists of 16 reconfigurable processorsthrough NoCs interconnection with mesh type topology forexploiting data parallelism of volume rendering. However, theexperimental result shows modest performance improvementcompared with CPU/GPU improvement. It is because there isperformance limitation with general NoC-based multi-corearchitecture. Another SRP-based multi-core architecture is shownin [18] – it is composed of ARM9 processor, two CGRAs, andAMBA high-performance bus which couples them. Even thoughthey have demonstrated the software implementation of digitalvideo broadcasting-second generation terrestrial (DVB-T2) on dualCGRAs running at 400 MHz, this paper also shows performancelimitations because of inefficient resource utilisation in themulti-core architecture. In addition, power/energy evaluation of themultiprocessors is not shown in both cases [17, 18].

The Xentium tiles [19] are another example of CGRA-basedmulti-core architecture. The Xentium is a programmable digitalsignal processing tile and the different tile processors areconnected to a router on NoC. It turns out that the hyperspectralimage compression algorithm can indeed be efficiently mapped onthis multi-tiled architecture and adding more tiles give a close tolinear speedup. However, it is unclear whether other applicationsmay be also successfully mapped onto the architecture alreadyspecialised for the compression algorithm. In addition, addingmore tiles means increase of power consumption as well asspeedup but such a power issue has not been dealt with in [19].

Multi-core architecture with dynamically reconfigurable arrayprocessors [20] is more flexible than [17–19, 21] because theshared data-memory banks are connected to all processing coresthrough cross-bar switches unlike communication among theCGRAs is only restricted by NoC or on-chip bus in [17–19, 21].However, the centralised shared data-memory banks may cause

Fig. 2 Example #1 – pipelining of the kernel-stream composed of four interdepe

a Kernel-stream with four interdependent kernelb Kernel-stream iteratively runs on the base multi-CGRA


performance bottleneck with much power consumption when thenumber of cores increases. In addition, there are no quantitativeevaluation and analysis about power, area, and timing withincreasing the number of cores in [20].

4 Motivation

In this section, we present the motivation of our approaches. Themain motivation comes from the resource utilisation problemswhen trying to exploit KLP on multi-CGRA. Even though variouscases of the KLP can be considered, we focus on the pipelining ofkernel-stream that is the most complex and ever-changing case.

4.1 Pipelining of kernel-stream

Pipelining of kernel-stream is a type of the KLP and it means thatinterdependent kernels (kernel-stream) iteratively run onmulti-CGRA in the manner of pipelining. In this case, each kernelmay be mapped on each CGRA without any problems of resourceutilisation if each kernel requires CM/DB usage less than CM/DBcapacity of each CGRA. However, there may be more casescausing poor utilisation of resources as Example #1 in Fig. 2 –four interdependent kernels (kernel-stream) iteratively run on thebase multi-CGRA. First of all, Example #1 shows lack of DBresources when mapping kernel KA on CGRA #1 as Fig. 3a – KA

requires 400% DB usage for 40 iterations, whereas DB capacity ofa CGRA is 200%. In this case, DMA transfer ‘to DB1 (200%)’and CGRA-computation ‘Pipelining for 21–40 iterations’ shouldbe sequentially performed because of insufficient DB capacity.Such a sequential operation causes performance bottleneck.However, if a DB has sufficient capacity (400%), it allows overlapof the DMA transfer with the CGRA-computation withoutperformance bottleneck as the bottom of Fig. 3a.

In addition, pipelining of kernel-stream on the base multi-CGRAcauses another case of the performance bottleneck and waste ofenergy. It is because DB read/write operations frequently occurwhen the result data from the previous CGRA transfer to the nextCGRA as input data through on-chip communication architecture.

ndent kernels on the base multi-CGRA

3

Fig. 3 Inefficient resource utilisation of Example #1

a Lack of resources (DB)b Pipeline-scheduling on the base multi-CGRAc Pipeline-scheduling on the multi-CGRA with direct interconnectionsd Ideal multi-CGRA for Example #1

Fig. 3b shows such problems in Example #1 in detail. DMA transferthrough on-chip communication architecture means frequent DBread/write operation and it leads to the waste of energy andperformance bottleneck. However, if adjacent PAs are directlyconnected together in Fig. 3c, it allows direct data transfer without

4

DB read/write operations passing through on-chip communicationarchitecture.

To sum it up, the kernel-stream (KA–KD) in Example #1 can besuccessfully mapped on the ideal multi-CGRA as Fig. 3d withreducing energy and enhancing performance compared with the


base multi-CGRA. A point to consider is that the ideal multi-CGRAdoes not include more CM/DB resources compared with the basemulti-CGRA – shown at the bottom of Fig. 3d, its total CMcapacity (400%) is the same size as the base multi-CGRA and itstotal DB capacity (600%) is less than the capacity (800%) of thebase multi-CGRA. Therefore, the ideal multi-CGRA is onlyconfiguration with different numbers of DBs per PA with thedirect interconnection while keeping within the bounds of the totalcapacity.

4.2 Necessity of inter-CGRA reconfiguration

As mentioned in the previous section, the base multi-CGRA suffersfrom performance bottleneck and high energy consumption whentrying to achieve kernel-stream type of the KLP. This is becausesuch a monotonous aggregation of several CGRAs cannot beflexible to support efficient resource utilisation. We hypothesisethat a multi-CGRA can support component-level (CM, DB, or PA)inter-CGRA reconfiguration that means use of each component isnot limited to a CGRA. Then the CGRA-based multi-corearchitecture can be optimised for its performance and energybecause such an inter-CGRA reconfiguration can efficiently utilisecomponent-level resources of the multi-CGRA in various casesof kernel-streams. In the next section, we propose a newmulti-CGRA fabric and a novel reconfiguration technique thatsupport the inter-CGRA reconfiguration.

5 Inter-CGRA reconfiguration on RSF

In this section, we propose a new RSF to boost their flexibilitylevel for the efficient resource utilisation focusing on thekernel-stream type of the KLP. In addition, we introduce a novelinter-CGRA reconfiguration technique for the efficient pipeliningof kernel-stream based on the RSF.

5.1 Design objectives

We can easily consider a highly flexible fabric for inter-CGRAreconfiguration as Fig. 4. It shows the completely connected fabric(CCF) based on four CGRAs that seems as a good candidate tofacilitate reconfigurable inter-CGRA – the CCF can enable anycombination of mapping between all of the CMs (or DBs) and allof the PAs. However, such a full connectivity causes significantarea and power overhead with increasing the number of CGRAs –as shown in Fig. 4, the bit-width of the interconnections betweenthe components is not small. On the other hand, only lessconnectivity may degrade the reconfigurability of inter-CGRA.Therefore, in Section 5.2 through Section 5.3, we propose RSF

Fig. 4 Example of CCF

a CM–PA interconnectionsb PA–PA/DB interconnections


and intra/inter-CGRA co-reconfiguration coming close to both twodesign objectives as follows:

† Design objective #1: The multi-CGRA fabric should showminimal interconnection overhead even though the number ofCGRAs increases.† Design objective #2: The multi-CGRA fabric should be asreconfigurable as CCF.

5.2 Ring-based sharing fabric

The proposed multi-CGRA fabric based on four CGRAs is shownin Fig. 5a – it is called RSF. The RSF connects all of thePAs through single-cycle interconnections and a DB (or a CM)is shared by two adjacent PAs on the RSF. Such connectivityfits in well with design objective #1 because the design overheadis only interconnections and switching logics between twoadjacent PAs. Therefore, the overhead is trivial even though thenumber of CGRAs increases as Fig. 5b – it shows anotherRSF composed of eight CGRAs. The next section illustratesinter-CGRA reconfiguration on the RSF with the previousexamples and suitability of RSF for design objective #2 isevaluated in Section 5.2.2.

5.2.1 Example of inter-CGRA reconfiguration: Figs. 3a–cshow that Example #1 causes the waste of energy and theperformance bottleneck. However, the proposed RSF can beconfigured by inter-CGRA reconfiguration as Fig. 5c that isequivalent to the ideal multi-CGRA as Fig. 3d. Therefore, thepipelining of the kernel-stream can be successfully mapped on theRSF with reducing energy and enhancing performance comparedwith the base multi-CGRA.

5.2.2 Suitability of RSF for design objective #2: Theprevious example of inter-CGRA reconfiguration shows verysuccessful mapping cases on the RSF with the ideal resourceutilisation. This is because the PAs in Example #1 fortunatelyutilise one CM or two DBs at most – the RSF structurally allowsthat a PA can utilise up to two CMs or two DBs. However, if aPA requires three CMs/DBs and over, the RSF seems to be farfrom design objective #2 – the CCF supports that any PA canutilise all of CMs and DBs on the fabric. Therefore, how toalleviate the structural limitation of the RSF is the key to comingclose to design objective #2. In the next section, we propose sucha key technique on the RSF for supporting efficient resourceutilisation.

5

Fig. 5 Inter-CGRA reconfiguration on RSF

a RSF organisationb RSF including eight CGRAsc Inter-CGRA reconfiguration for Example #1

5.3 Intra/inter-CGRA co-reconfiguration

Fig. 6a illustrates an example that the pipelining of kernel-streamrequires three DBs (500%) and it iteratively runs on the RSF at 50times – each data-set (100%) includes operand-data for theiterative running at ten times. In this example, the lack of DBresources may be exposed on the RSF but we can alleviate the

6

limitation of the RSF by shifting configuration of kernel-stream onmultiple CGRAs. Figs. 6b and c illustrate how to exploit intra/inter-CGRA co-reconfiguration in order to achieve the shiftingconfiguration. Before all, Fig. 6b shows initial configuration of thekernel-stream that PA1 utilises DB1 (D1 and D2) and DB4 (D3

and D4) for the running of 40 iterations. Then the RSF can beconfigured as Fig. 6c that shows the utilisation of one more DB


Fig. 6 Intra/inter-CGRA co-reconfiguration for running kernel-stream with three DBs

a Pipelining of kernel-stream with three DBsb Configuration #1c Configuration #2d Pipeline-scheduling on the RSF according to two cases of configurations

(DB3) for the remaining ten iterations. The utilisation of DB3 (D5)can be achieved by shifting the configurations of PAs from‘PA1→PA2→PA3’ to ‘PA4→PA1→PA2’. Therefore, the RSFoperates as if a PA is connected with three DBs. In this case, theintra-CGRA reconfiguration means that PA1 and PA2 arereconfigured twice in order to perform KA/KB and KB/KC. On theother hand, the inter-CGRA reconfiguration enables that threeCGRAs are configured with different numbers of CMs/DBs andconnected through the direct interconnections. Such aco-reconfiguration can start immediately because each CM isshared by two adjacent PAs that are dynamically reconfigurable. Itmeans that the pipelining of the kernel-stream continually runs onthe RSF without stall as shown in Fig. 6d.

Fig. 7a shows another example that the pipelining ofkernel-stream requires four DBs (700%) and it iteratively runs onthe RSF at 70 times. The almost identical way would apply hereas in the previous mapping with three DBs (Fig. 6) but moreelaborated co-reconfiguration is needed because shiftingconfiguration of the kernel-stream should be sequentiallyperformed twice in order to utilise two more DBs (DB2 and DB3).Therefore, first of all, the CMs should be initialised such asFig. 7b. Unlike the previous example, the CM4 and the CM2include both data-sets CA and CC. In addition, CB is also stored inthe CM3 as well as the CM1, whereas the CM3 is not used on the


previous RSF. Then the Configuration #1 as Fig. 7c and theConfiguration #2 as Fig. 7d work for the running of 40 iterationsand 20 iterations in the same manner of the previous case withthree DBs. Furthermore, the RSF is finally configured as Fig. 7ethat is made possible by the initialisation of CMs as Fig. 7b. TheConfiguration #3 shows the utilisation of DB2 (D7) for theremaining ten iterations. The utilisation of DB2 can be achievedby the second shifting the configurations of PAs from‘PA4→PA1→PA2’ to ‘PA3→PA4→PA1’. Therefore, the RSFoperates as if a PA is connected with four DBs. In addition, PA1are reconfigured three times in order to perform KA/KB/KC but thecapacity of two CMs (CM1 and CM4) is enough to support threedifferent configurations as if PA1 is connected with three CMs.Finally, Fig. 7f shows the pipeline-scheduling of the kernel-streamwith four DBs that continually runs on the RSF withoutperformance degradation. In this way, intra/inter-CGRAco-reconfiguration can be exploited to map this example with upto three CMs and four DBs on the RSF.

5.4 Efficient configuration-data assignment

In the previous example as Fig. 7, the CM4 and the CM2 in the RSFinclude both the CA with 50% and the CC with 40% in order to

7

Fig. 7 Intra/inter-CGRA co-reconfiguration for running kernel-stream with four DBs

a Pipelining of kernel-stream with four DBsb Initialisationc Configuration #1d Configuration #2e Configuration #3f Pipeline-scheduling on the RSF according to three cases of configurations

support the series of shifting configuration of the kernel-stream.However, if the CA usage is 70% as Fig. 8a, CA cannot be storedin a CM with CC because of exceeding a CM capacity (100%) asFig. 8b. However, Fig. 8c shows a case of the initialisationwithout exceeding a CM capacity while it can also support theco-reconfiguration as Fig. 7 with the same performance. Therefore,it is necessary to make the optimal pipeline-scheduling within aCM capacity. For this reason, we propose an algorithm forefficient configuration-data assignment on the RSF as shown inFig. 9.

The algorithm starts with input parameters from the RSF and thekernel-stream. Then it makes a pipeline-scheduling with the bestperformance possible. The scheduling case within a CM capacityis saved as a candidate and the algorithm tries to make anotherpipeline-scheduling with the same performance. If the CMcapacity allows no more scheduling at the currentperformance-level, it selects the scheduling with the minimum CMusage among the candidates. Finally, it automatically assignsconfiguration-data/operand-data on CMs/DBs with feasible CM/DB control information. However, if there is no candidate, it looksfor another scheduling with adjusting downwardperformance-level. Such an example is as shown in Fig. 10. Thekernel-stream in Fig. 10a is actually identical to the previous

8

example as Fig. 8a except more increased CA (80%). It means thatCA cannot be stored in a CM with CB because of exceeding a CMcapacity (110%) as Fig. 10b. Meanwhile, Fig. 10c (or Fig. 10d )shows a case of the initialisation without exceeding a CM capacityand it also supports the pipelining of the kernel-stream with fourDBs. However, a point to consider is that Fig. 10c (or Fig. 10d )implies downward performance by the delayed pipeline-schedulingas Fig. 10i. This is because the shifting configuration cannot keepthe same stream-direction as Figs. 10e–g – the Configuration #3shows the kernel-stream in the opposite direction compared withConfigurations #1 and #2. Finally, if there is no case of possiblescheduling despite lowering performance, lack of resources (CM)may occur.

6 Experiments and results

6.1 Architecture implementation

To demonstrate the quantitative effectiveness of the proposedapproaches, we have implemented three different organisations ofmulti-CGRA with variation in the number of CGRAs as shown inTables 1–4, 8, 12, and 16 CGRAs that include the identical


Fig. 8 Comparison between two cases of RSF initialisation

a Pipelining of kernel-stream with four DBs and increased CA (70%)b Failed initialisationc Successful initialisation

Fig. 9 Algorithm for efficient configuration-data assignment


Fig. 10 Mapping the pipelining of the kernel-stream with four DBs and increased CA (80%) on RSF

a Pipelining of kernel-stream with four DBs and increased CA (80%)b Failed initialisationc, d Successful initialisatione Configuration #1f Configuration #2g Configuration #3h Pipeline-scheduling of the kernel-stream with CA (70%)i Pipeline-scheduling of the kernel-stream with CA (80%)

IET Circuits Devices Syst., pp. 1–1510 & The Institution of Engineering and Technology 2015

Table 1 Multi-CGRA implementation at the RTL with Verilog

Architecture type Number of CGRAs

only bus-connected baseline (BASE) 4, 8, 12, 16completely connected fabric (CCF)RSF

Table 2 Single CGRA implementation at the RTL with Verilog

Components Parameters Value

PA bit-width of registers in a PE 16 bitnumber of registers in a PE 4

number of PEs 4 × 4(16)CM (4 KB) bit-width of a CE 32 bit

number of layers for a CE 64number of CEs 4 × 4(16)

DB (1.5 KB) number of sets 2number of banks in a set 3

bit-width of a bank (dual-port A/B) 32/64 bitnumber of layers for a bank (dual-port A/B) 64/32

CGRAs specified in Table 2. We have designed them at RTL usingVerilog HDL and synthesised gate-level circuits using DesignCompiler [22] with 90 nm generic library [22] to analyse hardwarecost.

6.2 RTL-synthesis results

In this section, RTL synthesis results have been shown with varyingthe number of CGRAs to demonstrate the cost-effectiveness of theproposed approach in reducing area, delay, and powerconsumption compared with the CCF.

Table 3 Area comparison

No’1 Arch’ Gate equivalent

Net2 Logic3 Tota

4 BASE 311,216 5,813,381 6,124,CCF 562,318 6,167,122 6,729,RSF 491,591 6,094,900 6,586,

8 BASE 624,679 11,626,367 12,251CCF 1,585,952 13,016,128 14,602RSF 1,117,076 12,510,034 13,627

12 BASE 939,966 17,438,521 18,378CCF 3,274,543 20,465,091 23,739RSF 1,940,685 19,226,564 21,167

16 BASE 1,254,581 23,250,721 24,505CCF 5,706,329 28,993,032 34,699RSF 2,896,255 26,275,472 29,171

No’1: Number of CGRAs, Net2: Net interconnect area, and Logic3: Total cell areaTotal4: Net1 + Logic2, Increased5: Increase rate of area compared with BASE, [(CCReduced6: Reduction rate of area compared with CCF, (1−RSF/CCF)×100

Table 4 Critical path delay comparison

Arch’

4 8

D2, ns Inc3, % R4, % D2, ns Inc3, % R4, %

BASE 3.4 – – 3.4 – –CCF 3.88 14.1 – 3.89 14.4 –RSF 3.74 10 3.6 3.74 10 3.9

No’1: Number of CGRAs and D2: Critical path delayInc3: Increase rate of delay compared with BASE, [(CCF or RSF)/BASE–1]×100R4: Reduction rate of delay compared with CCF, (1−RSF/CCF)×100


6.2.1 Area evaluation: Table 3 shows area cost evaluation forthe three cases of multi-CGRA with increasing the number ofCGRAs. In the case of four CGRAs, the area costs of the CCFand the RSF have only increased by 10 and 8%, respectively,compared with the BASE. However, as the number of CGRAscontinues to increase from 8 to 16, the area of the CCFssignificantly increases by 19–42% due to its heavyinterconnections and switching logics. On the other hand, theRSFs have gradually increased the area by 11–19% because theinterconnections and switching logics between two adjacent PAsare only added according to increasing the number of CGRAs.Therefore, the proposed RSF is more area-efficient fabriccompared with the CCF – area reduction rate ranges from 2 to 16%.

6.2.2 Delay evaluation: Table 4 shows the comparison ofcritical path delay in the three cases of multi-CGRA varying thenumber of CGRAs. In the case of the CCF, the critical path delayhas considerably increased by 14.1–25.3% compared with theBASE. This is because more complex switching logics enablingthe full connectivity are included in the set of critical paths of theCCF when the number of CGRAs increases. However, the RSFsshow the same increase rate (10%) of delay regardless of thenumber of CGRAs because only adding CGRAs with keeping ringshape does not affect the critical path delay. Therefore, theproposed RSF is more efficient than the CCF in terms of thecritical path delay. Moreover, switching logics are added accordingto increasing the number of CGRAs. Therefore, the proposed RSFis more efficient than the CCF in terms of the critical path delay –delay reduction rate ranges from 3.6 to 12.2%.

6.2.3 Power evaluation: We have evaluated the powerconsumption of the three cases of multi-CGRA with increasing thenumber of CGRAs as shown in Table 5. First of all, both the CCFand the RSF including four CGRAs show insignificant increaserate (11 and 5%) of power compared with the BASE. However,

Increased5, % Reduced6, %

l4 Net Logic Total

597 – – – –440 81 6 10 –491 58 5 8 2,046 – – – –,080 154 12 19 –,110 79 8 11 7,487 – – – –,634 248 17 29 –,249 106 10 15 11,302 – – – –,361 355 25 42 –,727 131 13 19 16

F or RSF)/BASE–1]×100

No’1

12 16

D2, ns Inc3, % R4, % D2, ns Inc3, % R4, %

3.4 – – 3.4 – –3.99 17.4 – 4.26 25.3 –3.74 10 6.3 3.74 10 12.2

11

Table 5 Power comparison

No’1 Arch’ Dynamic power, mW Increased5, % Reduced6, %

Net2 Logic3 Total4 Net Logic Total

4 BASE 0.5463 0.2263 0.7726 – – – –CCF 0.5911 0.2640 0.8551 8.2 16.7 11 –RSF 0.5578 0.2543 0.8121 2.1 12.4 5 5

8 BASE 1.0307 0.3377 1.3684 – – – –CCF 1.8859 0.7586 2.6445 83 124.6 93 –RSF 1.1212 0.4287 1.5499 9 27 13 41

12 BASE 1.5123 0.4595 1.9718 – – – –CCF 2.8085 1.0898 3.8983 85.71 137.2 98 –RSF 1.6698 0.7305 2.4003 10.4 59 22 38

16 BASE 2.0252 0.6400 2.6652 – – – –CCF 3.8230 1.2867 5.1097 88.8 101.1 92 –RSF 2.0955 0.7644 2.8599 3.5 19.4 7 44

No’1: Number of CGRAs, Net2: Net switching power, and Logic3: Cell internal powerTotal4: Net1 + Logic2 and Increased5: Increase rate of power compared with BASE, [(CCF or RSF)/BASE–1]×100Reduced6: Reduction rate of power compared with CCF, (1−RSF/CCF)×100

Table 6 Test benches of kernel-streams

Kernel Kernel-stream

ID Name Cycles1 ID Kernel-Sequence2 DBs3

a aComplex Mult 6 4K1 5j1b3o 3b aconvolution 15 4K2 5a5f2o2i 3c aFIR 4 4K3 5h3i5n1b 3d areal update 4 4K4 6o5n1b 4e bHydro 8 4K5 10c2b5a 4f bInner Product 2 4K6 5d5j5e5c 4g bTri-Diagonal 7 8K1 9k3h1k3n2i1e3h 3h bFirst-Diff 3 8K2 7e4f2c1e2c2d4f1p 3i bFirst-Sum 5 8K3 2g2a2c2j2d4h2l2j 3j cDequant 6 8K4 9k3h1k1e2i1p3n 4k cQuant 9 8K5 10d10h5j5a2b5m6o 4l dWhet 7 8K6 2i1k5f2c2o5f2d 6M dPrime 6 8K7 4j10f4i5c5d4o2k 6n dMinver 3 8K8 10f1b3o3c5h1k2i1l 6o dCnt 5 8K9 1i1c1d1o2f1h1n 8p dBsort100 8 8K10 1b1i1o1c1d2f1n1h 8— 12K1 1p1o1c1d1i1h1n1d1c1o1h 3

12K2 2o2i2c2d1k1e1p2n2o2c2d2i 312K3 1k1h1n1i1o1c1d1n1h1o1i 612K4 1c1d1o1c2f1n1d1h2f1i1c1o 612K5 1c1d1o1n1d1h1i1c1o1n1i 1212K6 1n1d1h2f1i1c1o1c1d1o1c1n 1216K1 1k1i1c1d1o2f1h1n1i1o1c1d1n1h2f 316K2 1k1a2i3h2i1k1h2o1d2c5f2c2k1p1i1c 316K3 1k1h1n1i1o1c1d1n1h2f1i1c1d1o2f 816K4 1c1d1o1c2f1n1d1h2f1i1c1d1o2f1c1i 816K5 2f1i1c1d1k1h1n1i1o2f1c1i1o2f1d 1616K6 1c1d2f1h1n2f1c1d2f1h1n1d1c1h1c2f 16

aDSPstone benchmarks [23]bLivermore loop benchmarks [24]cMPEGdWCET benchmarks [25]Cycles1: Single execution cycles on a CGRADBs3: Number of utilised DBsKernel-Sequence2: Each letter of the alphabet means the kernel ID and each number before the letter means iteration number of running the kernel

Table 7 Mapping kernel-streams on RSF including four CGRAs

ID Configuration-data assignment onCMs

Shifting Sequence1

CM1 CM2 CM3 CM4

4K1 b o j 1

the power of the CCFs with more CGRAs (8–16) seriously increasesby 92–98% because of its huge interconnections and switchinglogics. Meanwhile, the RSFs with more CGRAs (8–16) show theincrease rate of power ranging from 7 to 22% because relativelyfewer interconnections and switching logics are added accordingto increasing the number of CGRAs. Therefore, the proposed RSFis more power-efficient fabric compared with the CCF – powerreduction rate ranges from 5 to 44%.

4K2 f o i a 14K3 i n b h 14K4 n,b b o o 1,−14K5 b a,c b c,a 1,14K6 j,c e,d c,j d,e 1,1

Shifting Sequence1: ‘1’ means clockwise shifting and ‘−1’ meanscounterclockwise shifting on RSF. They are placed in execution order

6.3 Performance and energy evaluation

6.3.1 Algorithm implementation: We have implemented theconfiguration-data assignment algorithm (Fig. 9) using C language.To demonstrate the effectiveness of the implemented algorithm,

IET Circuits Devices Syst., pp. 1–1512 & The Institution of Engineering and Technology 2015

Table 8 Mapping kernel-streams on RSF including eight CGRAs

ID Configuration-data assignment on CMs Shifting Sequence1

CM1 CM2 CM3 CM4 CM5 CM6 CM7 CM8

8K1 h k n k e h — j 18K2 f c e c d f p e’ 18K3 a c j d o l j g 18K4 h,n k,p e,g g p,k n,h j j 1,−18K5 h,a j,b a,m b,o m o,d h d,j 1,18K6 k,c f,o,d c,f o,d,i f,k d,i,f k,c i,f,o 1,1,1,18K7 f,c,o i,d,k c,o d,k,j o,f k,j,i f,c j,i,d 1,1,1,18K8 b,c,k o,h,i c,k,l h,i,f k,l,b i,f,o l,b,c f,o,h 1,1,1,−18K9 c,o,h d,f,n,i o,h,c f,n,i,d h,c,o n,i,d,f c,o,h i,d,f,n 1,1,1,1,1,18K10 i,c,f,n o,d,n,b c,f,h,i d,n,b,o f,h,i,c n,b,o,d h,i,c,f b,o,d,n 1,1,1,−1,−1,−1

Shifting Sequence1: ‘1’ means clockwise shifting and ‘−1’ means counterclockwise shifting on RSF. They are placed in execution order

Table 9 Mapping kernel-streams on RSF including 12 CGRAs

ID Configuration-data assignment on CMs ShiftingSequence1

CM1 CM2 CM3 CM4 CM5 CM6CM7 CM8 CM9 CM10 CM11 CM12

12K1 o c d i h n 1d c o h p

12K2 i c d k e p 1n o c d i o’

12K3 h,i,c n,o,d i,c,n o,d,h c,n,o d,h,i 1,1,1,1n,o h,i,k o,h i,k,n h,i k,n,o

12K4 d,c,n o,f,d c,n,h f,d n,h,i d,f,c 1,1,1,−1h,i,o f,c i,o,d c o,d,c c,o,f

12K5 d,n,h,c

o,d,i n,h,c d,i,o,c h,c,n,d

i,o,c 1,1,1,1,1,

c,n,d o,i,c n,d,h i,c,o d,n,h.c

c,o,i 1,1,1,1,1

12K6 d,f,c,n

h,i,o,d,n

f,c,n i,o,d,n,h

c,n,f o,d,n,h,i

1,1,1,−1,−1,

c,o,n,f

d,c,h,i

o,n,d,f

c,n,h,i,o

n,d,f,c n,h,i,o −1,−1,−1,−1,1

Shifting Sequence1: ‘1’ means clockwise shifting and ‘−1’ meanscounterclockwise shifting on RSF. They are placed in execution order

various test benches of kernel-streams have been evaluated – they areshown in Table 6. The implemented algorithm automaticallygenerates feasible CM/DB control information with assigningconfiguration-data/operand-data on CMs/DBs and they can bedirectly used for gate-level simulation.

6.3.2 Evaluated kernel-streams and their mappingresults: Table 6 shows the various test benches of kernel-streams composed of several interdependent kernels [23–25]. They

Table 10 Mapping kernel-streams on RSF including 16 CGRAs

ID Configuration-data assignmen

CM1 CM2 CM3 CM4 CM5CM9 CM10 CM11 CM12 CM13

16K1 i c d o fo c d n h

16K2 a i h i kc f c k p

16K3 h,i,c n,o,d,h i,c,n,f o,d,h,i c,n,ff,c,o i,d,f,k c,o,h d,f,k,n o,h,i

16K4 d,c,n,h o,f,d c,n,h,i f,d,c n,h,ii,d,f c,o d,f,i o,c f,i,d

16K5 i,d,h,f c,o,n,f d,h,i,f o,n,c,d,f h,i,ff,i,h c,o,d,f,n i,f,h o,d,f,c,n f,i,h

16K6 d,h f,n,c h,f,d n,c f,d,hh,d,f n,c d,h,f c,n h,f,d

Shifting Sequence1: ‘1’ means clockwise shifting and ‘−1’ means counterclockwi


are classified by two criteria – the first criterion is the number ofinterdependent kernels and the second one is the required numberof DBs. The first criterion is for evaluating the pipelining of thekernel-streams on the three cases of multi-CGRA with 4–16CGRAs. For example, Table 6 shows that kernel-stream 8KI–8K10 include 7–8 kernels and they run on the multi-CGRA whosenumber of CGRAs is 8. The second criterion subdivides the fourcases (4–16 K) of kernel-streams into more cases that requiredifferent numbers of DBs. Therefore, we can evaluate how theimplemented algorithm works well for the kernel-streams thatrequire three DBs – the most DBs.

Tables 7–10 show the mapping results of the test benches on theRSFs by the implemented algorithm. The algorithm generates theoptimal configuration-data assignment on CMs for each testbenches – kernel IDs shown in each CM mean that theconfiguration data for the corresponding IDs should be included inthe CM for making pipeline-scheduling with the best performancepossible and the minimum CM usage. In addition, the last column(‘Shifting Sequence’) in the tables shows the directions andfrequencies of the required shifting configuration.

6.3.3 Performance and energy evaluation: Table 11 showscycle-accurate performance and energy evaluation by running thetest benches on the three multi-CGRAs with 4–16 CGRAs – theCM/DB control information generated from the algorithm aredirectly used for gate-level simulation. RSF shows performanceimprovement by up to 88.8 and 11.1% when compared withBASE and CCF except some cases – 4K3, 4K6, 8K3, 8K5, 8K8,8K10, 12K2, 12K4, 12K6, and 16K2. Such cases on the RSF arefractionally slower (−0.1∼−6.4%) than the CCF because of thenecessarily delayed pipeline-scheduling. In the case of energyevaluation, the RSF reduces energy by up to 87.5 and 48.2%compared with the BASE and the CCF.

t on CMs Shifting Sequence1

CM6 CM7 CM8CM14 CM15 CM16

h n i 1f – kh o d 1i c k’

d,h,i n,f,c,o h,i,d 1,1,1,1,1,1f,k,n,o h,i,c k,n,o,dd,f,c,o h,i,d f,c,o 1,1,1,−1,−1,−1c,o,f i,d,c,n o,f

n,o,c,d,f i,f,h o,c,d,f,n 1,1,1,1,1,1,1,d,f,c,o,n i,d,h,f f,c,o,n 1,1,1,1,1,1,1c,f,n d,h f,n,c 1,1,−1,−1,−1,−1,−1,c,n f,d,h f,n,c −1,−1,−1,−1,−1,−1,1

se shifting on RSF. They are placed in execution order

13

Table 11 Performance and energy evaluation

ID Performance Energy

Time, ns Rate, % Joule, fJ Rate, %

BASE CCF RSF RB1 RC

2 BASE CCF RSF RB1 RC

2

4K1 33,497 7558 7330 78.1 3.0 26 6 6 77.0 7.94K2 33,521 7570 7473 77.7 1.3 26 6 6 76.6 6.34K3 21,383 4834 4899 77.1 −1.3 17 4 4 75.9 3.84K4 45,812 10,298 9986 78.2 3.0 35 9 8 77.1 7.94K5 61,360 13,894 13,453 78.1 3.2 47 12 11 77.0 8.04K6 56,090 12,715 12,933 76.9 −1.7 43 11 11 75.8 3.48K1 19,271 11,382 10,981 43.0 3.5 26 30 17 36.0 43.78K2 20,261 11,759 11,527 43.1 2.0 28 31 18 36.0 42.88K3 13,692 4081 4301 68.6 −5.4 19 11 7 64.7 38.58K4 25,765 15,117 14,586 43.4 3.5 35 40 22 36.4 43.78K5 24,004 12,121 12,222 49.1 −0.8 33 32 19 42.8 41.28K6 24,494 6177 6021 75.4 2.5 34 16 9 72.4 43.18K7 30,083 11,639 11,272 62.5 3.1 41 31 17 57.9 43.58K8 29,570 10,025 10,292 65.2 −2.7 41 26 16 60.9 40.18K9 29,845 5563 5569 81.3 −0.1 41 15 9 79.0 41.68K10 36,054 10,557 10,891 69.8 −3.2 49 28 17 66.0 39.812K1 12,308 2793 2693 78.1 3.6 24 11 6 73.4 40.612K2 15,167 3607 3837 74.7 −6.4 30 14 9 69.2 34.512K3 26,425 5857 5610 78.8 4.2 52 23 13 74.2 41.012K4 19,995 4345 4440 77.8 −2.2 39 17 12 70.3 30.812K5 39,970 8539 8176 79.5 4.3 79 33 20 75.1 41.012K6 40,120 8555 8960 77.7 −4.5 79 33 24 70.0 28.916K1 17,394 3502 3112 82.1 11.1 24 9 5 79.9 48.216K2 18,588 3838 3912 79.0 −1.9 25 10 6 76.3 40.516K3 46,872 8955 7974 83.0 11.0 64 24 12 80.9 48.116K4 46,553 6245 5991 87.1 4.1 64 16 9 85.5 44.016K5 89,685 12,226 10,921 87.8 10.7 123 32 17 86.3 47.916K6 91,783 11,089 10,240 88.8 7.7 126 29 16 87.5 46.1

RB1 : Reduction rate (%) of RSF compared with BASE, (1−RSF/BASE)×100

RC2 : Reduction rate (%) of RSF compared with CCF, (1−RSF/CCF)×100

7 Conclusion

CGRA has emerged as a suitable solution for embedded systems butthere is a limit when a CGRA is expected to improve theperformance of an entire application. This is because single CGRAis sequentially optimised for the parallelised computations in akernel at a time, whereas the overall speedup of the entireapplication can be achieved by KLP that several kernelsconcurrently run. Therefore, CGRA-based multi-core architectureshave appeared to support diverse KLPs. However, the existingCGRA-based multi-core architectures suffer from high energyconsumption and performance bottleneck because the existingmulti-CGRA structures are not flexible enough to adaptivelysupport various cases of the KLP. It means that the resources inthe multi-CGRAs cannot be efficiently utilised under monotonousaggregation of several CGRAs. To overcome the limitations, wehave proposed the new RSF for improving the flexibility level ofthe CGRA-based multi-core architectures focusing on thekernel-stream type of the KLP. In addition, the novel inter-CGRAreconfiguration technique based on the RSF has been introducedfor efficient pipelining of kernel-stream. Experimental results showthat the proposed approaches improve performance by up to88.8% and reduce energy by up to 48.2% when compared withthe existing architecture model.

8 Acknowledgment

This Research was supported by the Sookmyung Women’sUniversity Research grants 1-1203-0071.

9 References

1 Hartenstein, R.: ‘A decade of reconfigurable computing: a visionary retrospective’.Proc. Design Automation and Test in Europe Conf., Munich, Germany, March2001, pp. 642–649

14

2 Kim, Y., Mahapatra, R.N.: ‘Design of low power coarse-grained reconfigurablearchitecture’ (CRC Press, Taylor and Francis Group, London, 2010, 1st edn.)

3 Lee, D., Jo, M., Han, K., et al.: ‘FloRA: coarse-grained reconfigurable architecturewith floating-point operation capability’. Proc. Int. Conf. on Field-ProgrammableTechnology, Sydney, Australia, December 2009, pp. 376–379

4 Lee, G., Choi, K., Dutt, N.D.: ‘Mapping multi-domain applications ontocoarse-grained reconfigurable architectures’, IEEE Trans. Comput.-Aided Des.Integr. Circuits Syst., 2011, 30, (5), pp. 637–650

5 Han, K., Ahn, J., Choi, K.: ‘Power-efficient predication techniques for accelerationof control flow execution on CGRA’, ACM Trans. Archit. Code Optim., 2013, 10,(2), pp. 8:1–8:25

6 Wood, A., Knight, A., Ylvisaker, B., et al.: ‘Multi-kernel floor planning forenhanced CGRAs’. Proc. IEEE Int. Conf. on Field-Programmable Logic andApplication (FPL), Oslo, Norway, August 2012, pp. 157–164

7 Kim, M., Song, J.H., Kim, D.-H., et al.: ‘Hybrid partitioned H.264 full highdefinition decoder on embedded quad-core’. Proc. IEEE Int. Conf. on ConsumerElectronics (ICCE), Las Vegas, NV, USA, January 2012, pp. 279–280

8 Nishihara, K., Hatabu, A., Moriyoshi, T.: ‘Parallelization of H.264 video decoderfor embedded multicore processor’. Proc. IEEE Int. Conf. on Multimedia andExpo, Hannover, Germany, April 2008, pp. 329–332

9 Kim, M., Song, J., Kim, D., et al.: ‘H.264 decoder on embedded dual core withdynamically load-balanced functional partitioning’. Proc. IEEE Int. Conf. onImage Processing (ICIP), Hong Kong, China, September 2010, pp. 3749–3752

10 Jafri, S.M.A.H., Piestrak, S.J., Sentieys, O., et al.: ‘Design of a fault-tolerantcoarse-grained reconfigurable architecture: a case study’. Proc. 11th Int. Symp.on Quality Electronic Design (ISQED), San Jose, CA, USA, March 2010,pp. 845–852

11 Jafri, S.M.A.H., Piestrak, S.J., Sentieys, O., et al.: ‘Error recovery technique forcoarse-grained reconfigurable architectures’. Proc. IEEE Int. Symp. on Designand Diagnostics of Electronic Circuits & Systems (DDECS), Tallinn, Estonia,April 2011, pp. 441–446

12 Alnajjar, D., Ko, Y., Imagawa, T., et al.: ‘A coarse-grained dynamicallyreconfigurable architecture enabling flexible reliability’. Proc. IEEE Workshopon Silicon Errors in Logic – System Effects (SELSE), Stanford, CA, USA,March 2009, pp. 1–5

13 Alnajiar, D., Ko, Y., Imagawa, T., et al.: ‘Coarse-grained dynamicallyreconfigurable architecture with flexible reliability’. Proc. IEEE Int. Conf. onField-Programmable Logic and Application (FPL), Prague, Czechoslovakia,August 2009, pp. 186–192

14 Alnajiar, D., Konoura, H., Ko, Y., et al.: ‘Implementing flexible reliability in acoarse-grained reconfigurable architecture’, IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., 2013, 21, (12), pp. 2165–2178

15 Lee, G., Choi, K.: ‘Thermal-aware fault-tolerant system design with coarse-grainedreconfigurable array architecture’. Proc. NASA/ESA Conf. on Adaptive Hardwareand Systems (AHS), Anaheim, CA, USA, June 2010, pp. 272–279


16 Kang, J., Ko, Y., Lee, J., et al.: ‘Selective validations for efficient protections oncoarse-grained reconfigurable architectures’. Proc. IEEE Int. Conf. onApplication-specific Systems, Architectures and Processors (ASAP),Washington, DC, USA, June 2013, pp. 95–98

17 Jin, S., Lee, S.-H., Chung, M.-K., et al.: ‘Implementation of a volume rendering oncoarse-grained reconfigurable multiprocessor’. Proc. IEEE Int. Conf. onField-Programmable Technology (FPT), Seoul, South Korea, December 2012,pp. 243–246

18 Basutkar, N., Yang, H., Xue, P., et al.: ‘Software-defined DVB-T2 receiver usingcoarse-grained reconfigurable array processors’. Proc. IEEE Int. Conf. onConsumer Electronics (ICCE), Las Vegas, NV, USA, January 2013, pp. 580–581

19 Walters, K.H.G., Kokkeler, A.B.J., Gerez, S.H., et al.: ‘Low-complexityhyperspectral image compression on a multi-tiled architecture’. Proc. IEEENASA/ESA Conf. on Adaptive Hardware and Systems, San Francisco, CA,USA, July 2009, pp. 330–335


20 Han, W., Yi, Y., Muir, M., et al.: ‘Multicore architectures with dynamicallyreconfigurable array processors for wireless broadband technologies’, IEEETrans. Comput.-Aided Des. Integr. Circuits Syst., 2009, 28, (12), pp. 1830–1843

21 Wei, H., Yu, J., Yu, H., et al.: ‘Software pipelining for stream programs on resourceconstrained multicore architectures’, IEEE Trans. Parallel Distrib. Syst., 2012, 23,(12), pp. 2338–2350

22 ‘Synopsys–User guide’. Available at http://www.synopsys.com, accessed March2013

23 ‘DSPstone’. Available at http://www.ice.rwth-aachen.de/research/tools-projects,accessed April 2013

24 ‘Livermore loop benchmarks’. Available at http://www.netlib.org/benchmark/livermorec, accessed March 2013

25 ‘WCET benchmarks’. Available at http://www.mrtc.mdh.se/projects/wcet/benchmarks.html, accessed April 2013

15

inter-coarse-grained reconfigurable architecture reconfiguration...

Documents