986 ieee transactions on very large scale …cadlab.cs.ucla.edu/~cong/papers/tvlsi2006.pdf ·...

986 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 9, SEPTEMBER 2006

Architecture and Compiler Optimizations for DataBandwidth Improvement in Configurable Processors

Jason Cong, Fellow, IEEE, Guoling Han, and Zhiru Zhang, Student Member, IEEE

Abstract—Many commercially available embedded processorsare capable of extending their base instruction set for a specificdomain of applications. While steady progress has been made inthe tools and methodologies of automatic instruction set extensionfor configurable processors, the limited data bandwidth availablein the core processor (e.g., the number of simultaneous accesses tothe register file) becomes a potential performance bottleneck.

In this paper, we first present a quantitative analysis of the databandwidth limitation in configurable processors, and then proposea novel low-cost architectural extension and associated compila-tion techniques to address the problem. Specifically, we embed asingle control bit in the instruction op-codes to selectively copy theexecution results to a set of hash-mapped shadow registers in thewrite-back stage. This can efficiently reduce the communicationoverhead due to data transfers between the core processor and thecustom logic. We also present a novel simultaneous global shadowregister binding with a hash function generation algorithm to takefull advantage of the extension. The application of our approachleads to a nearly optimal performance speedup.

Index Terms—Data bandwidth optimization, optimizing com-pilers, reconfigurable architectures, register allocation.

I. INTRODUCTION

APPLICATION-SPECIFIC instruction-set processors (ASIPs)are a promising approach to combining the flexibility

offered by a general-purpose processor and the speedup (andpower savings) offered by an application-specific hardwareaccelerator. Generally, an ASIP has the capability to extendthe instruction set of the base architecture with a set of appli-cation-specific instructions implemented by custom hardwareresources. These hardware resources can be either runtimereconfigurable functional units, or presynthesized circuits. Therecent emergence of many commercially available embeddedprocessors with both configurability and extensibility (e.g.,Altera Nios/Nios II1 , Tensilica Xtensa 6/LX2 , Xilinx Mi-croBlaze3, etc.) testifies to the benefit of this approach. As anexample, Fig. 1 (taken from Altera’s website1) shows the blockdiagram of the instruction logic of the Altera Nios II soft-core

Manuscript received July 3, 2005; revised February 2, 2006. This work wassupported in part by MARCO/DARPA Gigascale Silicon Research Center(GSRC), National Science Foundation under Award CCR-0096383, and bygrants from Altera Corporation and Xilinx, Inc. under the California MICROProgram.

The authors are with the Computer Science Department, University of Cal-ifornia, Los Angeles, CA 90095 USA (e-mail: [email protected]; [email protected]; [email protected]).

Digital Object Identifier 10.1109/TVLSI.2006.884050

1Altera Corp., [Online]. Avaliable: http://www.altera.com.2Tensilica Inc., [Online]. Avaliable: http://www.tensilica.com.3Xilinx Inc., [Online]. Avaliable: http://www.xilinx.com.

Fig. 1. Custom instruction logic of Nios II.

processor. Nios II contains a RISC core as the base microar-chitecture, and the custom logic can extend the functionalityof the arithmetic logic unit (ALU) by implementing custominstructions for complex processing tasks as either single-cycle(combinatorial) or multi-cycle (sequential) operations.

The research community has expended a considerableamount of effort in the ASIP area for almost a decade. A broadoverview of the ASIP design, its advantages, applications, andfundamental challenges can be found in [14]. A high-perfor-mance ASIP architecture is described in [20]. It integrates a fastreconfigurable functional unit into the pipeline of a superscalarprocessor to implement application-specific operations. TheASIP architecture and compiler co-exploration problem isaddressed in [9].

A crucial step to achieving high performance in an ASIP de-sign is to select an optimal custom instruction set. However, forlarge programs, this is a difficult task to manage by manual de-signs, and is further complicated by various microarchitecturalconstraints. This has inspired a large body of recent research toaddress the automatic instruction set extension problem.

Most of the existing approaches attempt to discover the can-didate extended instructions by identifying common patterns inthe data flow graph of the given application, and then to se-lect an appropriate subset of the candidate instructions to max-imize performance under certain architectural constraints (e.g.,the number of input and output operands, area constraints, etc.).

A template generation and covering algorithm is proposed in[3] to automatically identify the custom instructions. The tem-plates are identified based on the execution traces, which mayspan across multiple basic blocks. A dynamic programming ap-proach is used for covering, which attempts to minimize the

1063-8210/$20.00 © 2006 IEEE

CONG et al.: ARCHITECTURE AND COMPILER OPTIMIZATIONS FOR DATA BANDWIDTH IMPROVEMENT IN CONFIGURABLE PROCESSORS 987

number of used templates. Similarly, [13] generates the can-didate templates using a clustering algorithm based on occur-rence frequency. Then the directed acyclic graph covering isformulated as the maximum independent set problem to max-imize the number of covered nodes using a minimum numberof templates. Unfortunately, these works do not consider the ar-chitecture constraints during the template generation.

An approach presented in [18] generates and selects the can-didate custom instructions from operation patterns in the dataflow graph. This method is extended in [19] to handle complexcontrol flows in embedded software programs. A comprehen-sive priority function is computed to rank and prune candidateinstructions.

In [4], a candidate extended instruction is defined to be aconvex directed acyclic subgraph subject to certain input andoutput constraints. A branch and bound algorithm is used todecide whether or not to include a node of the control dataflow graph (CDFG) when creating the candidate. The time com-plexity of this approach grows exponentially as the problem sizeincreases.

The extended instruction set synthesis technique proposedby [6], solves three sub-problems under the microarchitecturalconstraints: pattern generation which enumerates all the can-didate instructions, pattern selection which selects a subset ofthe candidates to form the extended instruction set, and ap-plication mapping which maps the CDFG onto the extendedinstruction set. Particularly, the application mapping problemis transformed into the minimum-area cell-library-based tech-nology mapping problem in the logic synthesis domain, whichcan be solved exactly through binate covering. Use of this ap-proach on several small data-intensive DSP applications on thesoft core processor Nios shows 2.75 speedup on average withlittle resource overhead (2.54%).

Although the existing techniques are efficient in identifyingthe promising candidate instructions, [12] points out that mostof the speedup (about 60%) comes from the cluster with morethan two input operands. This exceeds the number of READ portsavailable on the register file of a typical embedded RISC pro-cessor core. Strictly following the two-input single-output con-straint, generally leads to small clusters with limited speedup.

Generation of larger clusters with extra inputs is allowed in[18] by using the custom-defined state registers to store the addi-tional operands. Unfortunately, at least one extra cycle is neededfor each additional input to be loaded into a custom-defined stateregister. The communication overhead incurred because of thesedata transfers between the core processor, and the custom logiccan significantly offset the gain from forming a large cluster.An architectural extension, shadow registers with multiple con-trol bits, is introduced in [7] to reduce the overhead. However,inserting multiple control bits into the instruction word may re-quire a dramatic change to the base instruction set architecture,thus, limiting the applicability of this approach.

Our contributions in this paper are threefold. First, we presenta quantitative analysis of data bandwidth limitation. Second,we propose a new low-cost architectural extension called hash-mapped shadow registers, which significantly reduces the over-head by using a single control bit. Third, we present a simulta-neous global shadow register binding with a hash function gen-

Fig. 2. Typical extensible processor.

eration algorithm which fully exploits the benefits of shadowregisters across basic block boundaries. The preliminary exper-imental results of this study are shown in [8].

The remainder of this paper is organized as follows. Wefirst present the quantitative analysis on the data bandwidthproblem in Section II. Our proposed architectural extensionand associated compiler optimization techniques are describedin Sections III and IV, respectively. Experimental results arereported in Section V, followed by conclusions in Section VI.

II. ANALYSIS OF BANDWIDTH LIMITATION

A. Motivation

The architectural model targeted in this paper is a classicalsingle-issue, in-order pipelined RISC core processor with atwo-READ-port and one-WRITE-port register file. Under this pro-cessor model, a custom instruction follows the same instructionformat and execution rules, which include: 1) the number of theoperands and results of a custom instruction is predeterminedby the extensible architecture; 2) the custom instruction cannotexecute until all the input operands are ready; and 3) the custominstruction can READ the core register file only during thedecode/execute stage, and can commit the result only duringthe WRITE-back stage. This extensible architecture simplifiesthe implementation since the base instruction set architecturecan remain unchanged.

Fig. 2 shows the block diagram of a typical configurable pro-cessor, where a two-operand instruction format is used. Duringthe execution, the custom logic reads two operands from thecore register file and writes the result back directly. However,such a scheme would restrict the custom instruction to havingonly two input operands, thus, limiting the complexity of thecomputations. Generally, when the input operand number con-straint of the custom instruction is relaxed, more performancespeedup can be obtained by clustering more operations in onecustom instruction to exploit more parallelism. According to thestudy in [12], most of the performance speedup (about 60%)comes from the cluster with more than two input operands. Un-fortunately, if a custom instruction needs extra operands, theprocessor core has to explicitly transfer the data from the reg-ister file to the local storage of the custom logic through the databus. Only after that, is the custom instruction allowed to execute.The communication through a data bus may take multiple CPUcycles, and significantly offset the performance gain by usingthe custom instruction. Therefore, the limited port number in


Fig. 3. Our compilation and simulation flow.

the register file may become a performance bottleneck in theextensible processors.

B. Evaluation Framework

In order to better understand the performance bottleneck inASIP design, we developed the ASIP performance evaluationframework shown in Fig. 3. SimpleScalar [5], which is a cycle-accurate simulation tool set, is used to estimate the performanceof the processor and the impact of communication cost. To havea quick evaluation of data bandwidth limitation, our ASIP com-pilation is applied on the compiled binary code of the bench-marks.

Based on the execution trace generated by SimpleScalar, theCDFG generator produces the control data flow graph. Underthe given microarchitectural constraints, the ASIP compilationproblem is solved in three steps based on [6]. The first step,pattern generation, enumerates all candidate patterns from agiven control data flow graph through the cut enumerationtechnique. Pattern selection is then performed at the secondstep. A cost function that considers the occurrence, speedup,and area is calculated to guide the selection. The selectionproblem is formulated as a 0–1 knapsack problem which ispseudo-polynomial time solvable via dynamic programming.The third step, application mapping, maps the data flow graphinto the selected patterns to minimize the total latency. Theapplication mapping problem is shown to be equivalent tothe minimum area cell-library-based technology mappingproblem in the logic synthesis domain, which can be solvedexactly through binate covering. The algorithmic details ofthe aforementioned three steps can be found in [6]. Using theoptimized code as input, SimpleScalar simulates the programexecution on the target configurable processor and provides theperformance estimation.

C. Analysis Results

In this study, we modeled a single issue, in-order RISC con-figurable processor which is similar to the Altera Nios/Nios II.1

Table I shows the detailed machine configuration. The instruc-tion set allows two input operands and one output operand. TheC examples used in the experiments are from Mediabench [15]and Mibench [10].

TABLE IDETAILED PROCESSOR CONFIGURATION

Fig. 4. Examples of extended instructions. (a) Three-input. (b) Four-input.

As previously mentioned, our ASIP compilation tool gener-ates extended instructions and maps the program with the ex-tended instruction set. All the extended instructions are gener-ated within basic block boundaries. Memory operations are notallowed in any extended instruction. We assume that the latencyof the extended instructions equals the latency of the criticalpath in the collapsed computation cone. Fig. 4 shows examplesof three- and four-operand extended instructions extracted fromthe benchmarks. If we assume that the execution time of all theoperations is one clock cycle, the three-operand extended in-struction in the critical path would have a three-cycle latency.

Fig. 5(a) shows the ideal speedup for each benchmark underdifferent input size constraints. The speedup is measured bycomparing the number of simulated execution cycles of the pro-gram on the extended instruction set with the number of cyclesof the original code on the basic instruction set. We assume thatthere is no limit on the number of read ports in the register fileso that no move operations are needed.

The results shown in Fig. 5(a) indicate that we can achieve10% speedup on average under the 2-operand constraint. Under3- and 4-operand constraints, 15% and 18% speedup can beachieved, respectively. It also shows that for these examples,the designs under 3- and 4-operand constraints can achieve, re-spectively, about 50% and 80% more speedup over 2-operandconstraints.

However, the processor can only provide two simultaneousaccesses from the register file. Move operations have to beinserted before the execution of three- or four-operand ex-tended instructions. In our experiment, we assume that themove operation needs only one clock cycle. Fig. 5(b) showsthe ideal speedup versus the actual speedup when the commu-nication overhead is considered. We can observe considerableperformance degradations for all the benchmarks due to theextra move instructions. Even with the optimistic communi-


Fig. 5. Performance speedup under different input operand constraints. (a) Ideal speedup. (b) Ideal speedup versus actual speedup.

cation cost model, we can only achieve 9% and 12% averagespeedup for 3- and 4-operand constraints, respectively. It isclear that the communication overhead will seriously offset thespeedup achieved by the extended instructions. If we define therelative_speedup_drop as

relative speedup dropSpeedup Speedup

Speedup

where Speedup denotes the ideal speedup without consid-eration of move operations overhead, and Speedup repre-sents the actual speedup if the communication cost is included,then on average, a relative speedup drop is 41% and 32% underthe 3- and 4-operand constraints, respectively. Therefore, databandwidth seriously degrades the performance improvement forconfigurable processors.

III. ARCHITECTURE EXTENSION FOR DATA

BANDWIDTH IMPROVEMENT

A. Existing Solutions

Several architectural approaches can be adopted to tackle thespeedup degradation caused by the port number limitation. Weshall discuss three existing approaches.

1) Multiport Registers File: A straightforward method to in-crease data bandwidth for an instruction is the use of a multi-port register file to introduce extra READ ports. This allows thecustom instruction to increase simultaneous accesses to the coreregister file. No communication latency will be introduced if theoperand number is no more than the number of the register fileread ports.

However, since the base instruction set is untouched, the extraREAD ports will be wasted when executing the basic instructions.In addition, adding ports to the register file will have a dramaticimpact on the energy and die area of the processor. As pointedout in [17], the area and power consumption of a register filegrows more than quadratically with its port number. Moreover,since the register file is controlled solely by the core processor.In order to access an additional READ port of the register file,a custom instruction needs one extra address encoded in its in-struction word. This may not be feasible because of the limitedinstruction word length.

Fig. 6. Shadow registers with multiple control bits.

2) Register File Replication: Register file replication is an-other technique used to increase data bandwidth. By creating acomplete physical copy (or partial copy) of the core register file,the custom instructions can fetch the encoded operands from theoriginal register file and the extra operands from the replicatedregister file. Using this approach, Chimaera [11] is capable ofperforming computations that use up to nine input registers. Thereconfigurable logic is given direct READ access to a subset ofthe registers in the processor by creating an extra register filewhich contains copies of those registers values, thus, allowingcomplicated computations to be performed.

Since the basic instructions cannot use the replicated registerfile, the complete register duplication approach will introduceconsiderable resource waste in terms of area and power. For thepartial duplication approach, it enforces a one-to-one correspon-dence between a subset of registers in the core register file andthose in the replicated register file, and the computation resultsare always copied to the replicated registers. This leaves verylimited opportunities for compiler optimization to further im-prove performance.

3) Shadow Registers With Multiple Control Bits: Shadowregisters have been recently proposed [7] as an architecturalextension to overcome the aforementioned limitations and dif-ficulties. Fig. 6 shows the block diagram of this architecture,in which the core register file is augmented by an extra setof shadow registers that are conditionally written by the pro-cessor in the WRITE-back stage and used only by the custom


TABLE IIINSTRUCTION ENCODING EXAMPLE FOR THREE SHADOW REGISTERS

logic. Any instruction (basic or extended) can either skip or copythe computation result into one of the shadow registers in theWRITE-back stage. The copy/skip option and the address of thetarget shadow register need to be encoded as additional controlbits in the instruction format. Table II shows one possible en-coding for the extension with three shadow registers, in whichtwo bits are needed.

Since the data is copied to the shadow registers transpar-ently during the WRITE-back stage, the communication over-head between the processor core and the custom logic can begreatly reduced. Moreover, under Table II’s encoding scheme,each shadow register can be the physical copy of any register inthe core register file, which creates a great amount of freedomand opportunity for the compiler optimization. Experiments in[7] have shown that a small number (typically less than four) ofshadow registers is sufficient to compensate a major portion ofthe performance degradation caused by the port number limita-tion.

However, additional bits ( 1 bits for shadowregisters) are required in the instruction format. Since the un-used opcodes are usually limited in the common instructionformat, it is not trivial to encode two or three extra bits withoutincreasing the word length. In fact, we need to represent fourdistinct versions of one instruction to accommodate two extracontrol bits and eight versions for three bits—something that isnot practical in general.

B. Proposed Approach—Shadow Registers With SingleControl Bit

Note that the shadow registers essentially act as a tiny cachelocated in the custom logic to temporarily store the recent com-putation results from the core processor. In this sense, the mul-tiple-bit controlled shadow register set resembles a fully asso-ciative cache since each core register can be mapped to any entryof the shadow register set. This full associativity requires thepresence of multiple control bits in the instruction word to iden-tify the target shadow register. In this subsection, we proposeto use a hash-mapped shadow register scheme to significantlyreduce control bit overhead.

1) Hash-Mapped Shadow Register Scheme: Fig. 7 shows theblock diagram of our proposed enhancement. In this scheme,every instruction can also choose to either skip or copy the resultinto the shadow registers during the WRITE-back stage. The keydifference is that only the copy/skip option needs to be encodedas an additional control bit in the instruction format.

The actual shadow register that will be written is determinedby a prespecified many-to-one hash function. Namely, the ex-ecution result to register in the core register file will beconditionally copied to register in the shadow register setwhere . For example, one simple hashing scheme

Fig. 7. Shadow registers with single control bit.

can be directed mapping, where mod , assuming that wehave shadow registers. In general, this particular hash func-tion can be performed by the hashing unit which resides in thecustom logic using a fast and cheap lookup table, and reconfig-ured for different applications or different domains of applica-tions.

2) Advantages and Limitations: Since the shadow registerswill be mainly used for storing the source operands of thecustom instructions, the required number of shadow registersis generally much less than that of the core register file. Exceptfor a small amount of extra interconnect and control logicintroduced by the conditional copy path, the base datapath willremain the same. Therefore, the implementation tends to bevery cost efficient when compared to the prior approaches ofusing the multiport register file and register file replication.

Note that the target shadow register address is determined bythe hashing unit instead of being explicitly encoded in the in-struction word. This allows more shadow registers without thepenalty of increasing the number of control bits. More impor-tantly, we believe that it is much easier to add (or encode) onesingle additional control bit without increasing the length of theinstruction word. For example, in Nios II’s1 R-type instructionsthere are several reserved bits that can be potentially used for ad-vanced features. For the I-type and J-type instructions, there are42 encodings specified for the 6-bit OP code field. This leavesmore than 20 encodings unused, which are sufficient to accom-modate the second versions of those particular instructions (e.g.,arithmetic and load instructions) that can forward their results tothe shadow registers. Clearly, our proposed single-bit controlledshadow register approach provides a much more viable solutioncompared to the original multi-control-bit shadow register ap-proach reviewed in Section III-A.

One potential limitation of the hash-mapped shadow registersapproach is that a predetermined many-to-one correspondenceis enforced between the core registers and the shadow registers,which may restrict the opportunities for compiler optimizationto further improve performance. However, we believe that thislimitation is minor because more shadow registers can be allo-cated to mitigate the problem. Essentially, we can tradeoff theassociativity of the shadow register set for the number of entries.

Shadow registers will be allocated in the compilation process,and are, thus, visible to software. The shadow registers and thehash mapping in the hash unit become a part of an applicationcontext. Therefore, the operating system needs to save (restore)the shadow registers’ values and the hash mapping into (from)


memory when context switch occurs. As shown in the experi-mental results in Section V, a small number of shadow registers(usually less than eight) are sufficient to compensate for the per-formance drop. For this reason, the associated overhead shouldbe insignificant.

IV. SIMULTANEOUS GLOBAL SHADOW REGISTER BINDING

WITH HASH FUNCTION GENERATION

Shadow registers should also be considered during the com-pilation process to fully exploit the benefits of our proposed ar-chitectural extension. The compiler has to first guarantee the in-tegrity and the correctness of the program, e.g., it has to ensurethat an active shadow register would not be overwritten duringthe execution of a custom instruction. Moreover, optimizationtechniques are needed to intelligently assign or bind the vari-ables to the appropriate shadow registers. In [7], a shadow reg-ister binding problem was introduced as a special case of theregister allocation problem to maximize use of the shadow reg-isters. However, the proposed algorithm limits itself to a dataflow graph only (i.e., within the basic block boundary).

In this section, we first present a global solution that per-forms shadow register binding across the whole control dataflow graph using the predetermined hash function describedbelow in Section IV-A. We then show that the hash functiongeneration can be carried out simultaneously with the shadowregister binding in Section IV-B.

A. Global Shadow Register Binding UnderPrescribed Hash Function

1) Preliminaries: Compiler optimization algorithms are usu-ally performed on the CDFG derived from a program. On thetop level of a CDFG, the control flow graph consists of a set ofbasic block nodes and control edges. Each basic block is a dataflow graph (DFG) in which nodes represent basic computations(or instruction instances) and edges represent data dependen-cies. Throughout this section, we assume that given the pro-filing information, the ASIP compiler has generated extendedinstructions and mapped the application so that every node inthe mapped CDFG corresponds to an instruction in the extendedinstruction set (including basic instructions and custom instruc-tions). We also assume that each instruction produces at mostone result, and the instruction scheduling and register allocationfor the core registers have already been performed. Thus, weknow the shadow register to which an operand can be mappedwith the given hash function.

Our task is to appropriately assign the source operands of thecustom instructions to the available shadow registers so that thepotential communication overhead (in terms of the number ofmove operations) can be minimized. Specifically, the problemis solved in three steps. 1) We first perform a depth-first searchto produce a linear order of all the instructions. 2) Then we de-rive the usage intervals for each shadow register candidate (i.e.,data use). 3) Based on the usage intervals, we construct a com-patibility graph and bind each shadow register independently.

2) Linear Ordering: We adapt a linear scan register alloca-tion technique to compute a global linear order of the instruc-tions and derive the usage intervals for the shadow register can-didates in the CDFG. Linear scan register allocation was pro-

Fig. 8. CDFG example with linear order.

posed by Poletto and Sarkar in [16]. It is a very simple and ef-ficient algorithm that runs up to seven times faster than a fastgraph coloring allocator and produces relatively high qualitycode.

We employ the depth-first method introduced in [1] to orderthe basic blocks. The final linear ordering is the reverse of thepostorder traversal of the graph. Fig. 8 shows a CDFG exampleannotated with the linear numbering for each basic block. Thecomplete sequence of basic blocks visited as we traverse theCDFG is 1, 2, 3, 4, 6, 7, 6, 4, 5, 4, 3, 2, 1. In this list, we markthe last occurrence of each number to get 1, 2, 3, 4, 6, , ,4, , , , , , which is the final ordering of the basic blocks.Additionally, the instructions inside one basic block are orderedaccording to the instruction scheduling. By combining these twonumberings, we obtain a global sequence number (SN) for eachinstruction in the CDFG. In fact, the correctness of the algorithmdoes not depend on the ordering method. However, the orderingmay influence the quality of register binding.

3) Usage Interval Generation: As mentioned earlier, if theoperand number of a custom instruction exceeds the availableREAD port count, extra data transfer (or move, for short) oper-ations are needed to copy operands from the register file to thelocal storage in the custom logic. In our proposed architecture,if an operand is already in the shadow register, one move oper-ation can be saved.

In addition, it is not necessarily optimal to keep a variable inthe shadow register for its every use, as pointed out in [7]. Fig. 9shows the data flow graph of the basic block 5 in Fig. 8. Supposethe register file has only two READ ports, and all the extendedinstructions have three input operands, then four move opera-tions will be required without the shadow register. Assumingthat there is one shadow register available, only two moves canbe saved through the shadow register if we keep the variablesin the shadow register for their entire lifetime. Interestingly, onemore move can be saved if instruction commits to the shadowregister and overwrites the result of . Therefore, to achieve themaximum number of move operation saves, we should allow avalue to be replaced in the middle of its lifetime. This motivatesus to define the usage intervals with respect to each data use ofa variable instead of the variable itself.

To derive the usage interval of a data use , we need to con-sider the following three cases. 1) If variable is defined (orassigned) within the same basic block of the subject use, then


Fig. 9. Instruction sequence and its data flow graph.

the usage interval is , where the SN of the definition in-struction is and the SN of the use instruction is . 2) If allthe definitions of variable locate in the preceding basic blocks(precedence relationship is defined in terms of the linear order),then the usage interval is , where denotesthe SN of the definition instruction with the smallest numberingin the linear order. 3) If variable has one definition in the suc-ceeding basic block, it implies that is assigned and used in aloop. In this case, we need to traverse the loop once, extendingthe usage interval to , where and denote the smallestand largest SNs of any instructions in this loop.

For the example in Fig. 8, the use of at basic block 7 hastwo definitions in basic blocks 3 and 6, respectively. The usageinterval for use of is . We should note that the usage in-terval is a conservative approximation. Some subranges within

, where variables are not live may also be included and areunder-utilized. However, the correctness of the program is notaffected.

4) Binding for One Shadow Register: The binding problemfor one shadow register can be formulated as follows.

One-Shadow-Register Binding Problem: Given a shadowregister, and an interval set in which each interval could behash-mapped to the register, bind the corresponding variablesto the register so that the maximum number of move operationscan be saved.

To accurately calculate the move reduction, a weightedcompatibility graph can be constructed by creating avertex for each interval in and creating an edge from (withinterval ) to (with interval ) if and only if .Each vertex is assigned a weight which denotes the number ofmove saved if the corresponding variable is held in the shadowregister until the end time.

Fact 1: The usage intervals of the output edges from the sameinstruction are not compatible with each other.

This is straightforward because their usage intervals mustoverlap at this instruction.

Fact 2: If a data use at instruction can retrieve the valuefrom the shadow register, the value is also available for otheruses along all the reverse paths from the instruction to thedefinitions.

Based on this fact, the weight of an interval can be calculatedas the sum of move reductions for each use along the reversepaths. In particular, these uses are covered by the interval. InFig. 8, suppose that instruction 53 and 78 execute 10 and 20

times, respectively, then the weight of interval for useis 30.

Lemma 1: The weighted compatibility graph is acyclic.Lemma 2: The one shadow register binding problem is equiv-

alent to finding a maximum weighted chain in the compatibilitygraph.

The basic idea of the proof is as follows. The nodes on thechain are compatible with each other, so their correspondingvariables can be allocated to the same shadow register. Theweight on a node indicates the number of saves for storing thevalue in the shadow register until the end time. Fact 1 impliesthat a variable could be bound to the shadow register at mostonce. So, the maximum weighted chain corresponds to a reg-ister binding with maximum move saves.

Since the interval graphs can be constructed in , andthe maximum weighted chain can be solved in ,we can directly derive the following theorem.

Theorem: One shadow register binding problem can besolved optimally in time .

5) Extension to Shadow Registers: Recall that the hashingunit determines the shadow register to which a variable canbe mapped. Since the hash function is a many-to-one function,each variable can only be mapped to one shadow register. Thisimplies an interesting property wherein each shadow registercan be allocated independently. Specifically, for each individualshadow register, we can group the corresponding variables (orusage intervals) together and perform the one-shadow-registerbinding algorithm on this group without interfering with thebinding of other hash-mapped shadow registers. Therefore, thealgorithm can be easily extended to handle shadow registersby iteratively solving the one shadow register binding problemwhile still guaranteeing the optimality.

B. Hash Function Generation With Shadow Register Binding

The choice of the hash function implemented in the hashingunit also affects the solution quality. If we revisit the exampleshown in Fig. 9 under the assumption that two shadow reg-isters are available, only two moves can be saved when thehash function accidentally maps core registers and to thesame shadow register. On the other hand, four moves will besaved if and are mapped to different shadow registers. Thisclearly suggests that hash function generation and shadow reg-ister binding are interdependent, and both should be consideredat the same time. In this section, we present an effective algo-rithm that constructs the hash function simultaneously with theshadow register binding.

1) Multiway Set Partitioning (MSP) Formulation: We for-malize our hash function generation problem as follows.

Hash Function Generation With Shadow Register: Givena set of core registers , a set of shadowregisters , and an objective function

representing the number of saved move operations, find amany-to-one function so that is maximized.

In particular, our goal is to generate a best possible hashfunction to facilitate the global shadow register binding al-gorithm so that the maximum number of moves can be saved.Therefore, we evaluate the objective function by solving the


-shadow-register binding problem described in the precedingsection. To be more concrete, for each available shadow register

, we first find all the usage intervals whose core register in-dices are hash-mapped to . After the usage interval set isfound, we construct the interval graph for and solve theone-shadow-register binding problem on . We denote the re-sult by which is the total weight on the maximum weighedchain in (i.e., the number of moves that can be saved for ).Once we independently bind all the shadow registers, the finalobjective function can be computed by .

Since we are searching for a many-to-one function, any reg-ister in can be mapped to one and only one shadow register inSR. Therefore, the desired hash function essentially partitionsthe given set of core registers into subsets and labels eachsubset with a shadow register index. The goal is to maximize theperformance gain by saving as many moves as possible. Withthe above observation, we notice that the hash function gener-ation problem can be polynomially reduced to the well-knownmulti-way set partitioning (MSP) problem, described as follows.

MSP Problem: Given a set of modules, an integer value , and

an objective function , partition into disjoint clustersso that every belongs

to exactly one cluster in and is maximized.Specifically, we define our partitioning objective function tobe , where denotes a weighting functionon cluster

A polynomial reduction from HSG to MSP can be obtainedby constructing the module set in the MSP problem fromthe set in the HSG problem and setting the function to bethe partitioning objectives in MSP. Then, we can show that anoptimal partition uniquely determines an optimal hash function.For example, if is an optimal two-way MSP so-lution on , where and

, then it is straightforward to conclude that thecorresponding hash function of HSG (with four core registersand two shadow registers) is defined by and

.Although the problem of computing an optimal multi-way

partitioning is NP-hard in general, because it is so critical tomany application areas [especially in the VLSI computer-aideddesign (CAD) domain], a number of efficient heuristics havebeen proposed. However, most of the existing techniques focuson the min-cut graph (or hypergraph) partitioning problem,which is not directly applicable to our MSP problem withspecialized objectives. In this paper, we adapt a two-phaseapproach proposed in [2] to the HSG problem. First, we applyan efficient heuristic to generate a linear permutation sequenceon the core registers set . Second, we optimally solve a 1-D

-way partitioning problem through dynamic programmingon the given linear sequence. Once the partitioning problem isresolved, the hash function can be directly constructed togetherwith the global shadow register binding solution.

C. Core Register Ordering

In the core register ordering phase, our goal is to find a linearpermutation on using a fast heuristic which consists of thefollowing three major steps.

Fig. 10. Permutation generation.

(1) We first compute the total weight of those usage intervalsassociated with each individual core register.

(2) Then, we sort the registers in nonincreasing order ac-cording to those weights.

(3) Given the sorted list , webegin with, first, highest weighted registers, i.e., ,

and evenly distribute them into the posi-tions 0, . Then, we dis-tribute the second highest weighted registers to thepositions 1, ,and repeat until all core registers are positioned. To bemore precise, we construct a linear permutation ofso that the register in with index is moved to po-sition ( ) , where .Fig. 10 illustrates the permutation generation process.Suppose we have six registers in a sorted order (i.e.,

6) and 2, we then map registers , , ,, , and to positions 0, 3, 1, 4, 2, 5, respectively.

We believe the above heuristic suits our purposes because itencapsulates the following intuitions. 1) If two core registersboth have large weight (i.e., they are both associated with a largeset of intervals), then there is a high possibility that their usageintervals frequently overlap and incur many hash conflicts dueto the potential shadow register contentions. Hence, these tworegisters would intuitively repulse each other during the parti-tioning and they should be spaced out as much as possible in thepermutation sequence. 2) Otherwise, two core registers wouldattract each other, and they should be closed by in the permuta-tion sequence.

Since we space out every two registers with similar highweights by distance, it is unlikely that they wouldbe placed in the same cluster in a 1-D -way partitioningdescribed below.

1) Optimal 1-D -Way Partitioning (1D-KP): Given a linearcore register permutation, we solve a (1D-KP) problem to deter-mine the appropriate cluster for each core register.

1D-KP Problem: Given a permutation : , finda -way set partitioningin which each cluster can be uniquely denoted by

, where represents the th register in thepermuted linear sequence of .

Note that we require each cluster of the partitioning to bea contiguous slice of the permutation sequence. Hence, eachcluster can be written as which denotes that the clusterbegins with th register and ends with th register according tothe given permuted order.

For such an objective ,the principle of optimality holds for the 1D-KP problem;


Fig. 11. Hash-mapped shadow register implementation block diagram.

i.e., an optimal solution to any instance is made up of op-timal solutions to its sub-instances. Therefore, the 1D-KPproblem can be optimally solved by dynamic programming.In particular, the recurrence function can be expressed as

, where

and . The 1D-KP problemhas complexity, where is the timecomplexity of computing the cluster weighting. As discussedearlier in Section IV-A, is , where repre-sents the size of the interval graph. In fact, both (usuallyno greater than 32) and (typically no greater than 8) arerelatively small constants. This helps to significantly reduce thetime complexity and makes the algorithm much more scalable .

Since the shadow register binding problems are implicitlysolved during the cluster weighting evaluations, we can directlyretrieve the global shadow register binding solution when thefinal partition (or hash function) is determined. Therefore, wesolve the hash function generation problem simultaneously withthe shadow register binding problem.

V. EXPERIMENTAL RESULTS

A. Hash-Mapped Shadow Register Implementation

To evaluate the actual hardware cost of our proposed archi-tectural extension, we have designed and implemented the hash-mapped shadow register scheme on the Altera Stratix II field-programmable gate arrays (FPGAs).

Fig. 11 shows the block diagram of our implementation. Inthis design, we implement an eight-entry hash-mapped shadowregister. Since modern FPGAs (e.g., the Altera Stratix/StratixII, the Xilinx VirtexII-pro/Virtex-4, etc.) are rich in distributedon-chip memories, we group the shadow registers as a registerfile and map it onto a small-sized on-chip RAM block.4 InFig. 11, three reconfigurable 5-LUTs (five-input look-up ta-bles) are used to map the core register to the shadow register.The destination address from the core processor is sent to the5-LUTs. Based on the hash function, the target shadow registerindex is generated and fed into the WRITE addressport of the shadow register file. Note that the additional controlbit : 0 skip, 1 copy is extracted in the decodingstage and connected to the WRITE-enable (WE) signal so thatthe data is selectively written to the shadow register file. Inaddition, this implementation provides two extra data READ

ports (DATAOUT1 and DATAOUT2) to the custom logic, whichcan fetch data from the two data ports by sending the addresses

4In fact, the Altera NIOS II also takes this approach to implement its coreregister file.

to the and , respectively. Therefore,the custom instructions will allow up to four input operands.

TABLE IIIAREA AND TIMING RESULTS OF THE HASHING UNIT

TABLE IVRESOURCE AND PERFORMANCE OVERHEAD OF HASH-MAPPED

SHADOW REGISTERS

We use the Altera Quartus II and SOPC builder1 to synthesizeand map our design onto Straix II FPGAs. Table III shows thearea and timing results of the hashing units for the shadow reg-ister files with four, six, and eight entries, respectively. Since ourhashing unit is a pure combinational circuitry, it does not con-tain any sequential elements such as flip-flops and memories,and it consumes a very small number of adaptive lookup-tables(ALUTs).5 For shadow registers, only ALUTs arerequired. In the meantime, since the hash unit only has one levelof logic, the combinational delay is also small.

We further synthesize the hash-mapped shadow register ar-chitecture together with the Altera Nios II soft core processor.1

The final resource and performance overheads of hash-mappedshadow registers are listed in Table IV. In this experiment, weuse the fast implementation of Nios II (Nios II/f) and we set thetarget frequency to be 150 MHz. Compared to the standard NiosII/f implementation, our hash-mapped shadow register architec-ture introduces fairly small overhead ( 1% ) in both resourceusage and final clock frequency.

B. Simulation Results

We implemented our algorithms in a C++/Unix environment.A new step called shadow register binding is performed afterapplication mapping in our compilation flow. The mappedapplications with shadow register binding are fed into Sim-pleScalar, which is configured with the same parameters asshown in Table I, to measure the performance improvement.

By introducing the shadow registers, the number of move op-erations will be effectively reduced. In our experiment we applythe simultaneous shadow register binding with hash functiongeneration to allocate the shadow registers. As mentioned ear-lier, the predetermined many-to-one correspondence enforcedby the hash function may cause more conflicts and restrict theopportunities for compiler optimization. The limitation can becompensated for by introducing more hash-mapped shadow reg-isters. Fig. 12 shows the speedup with different shadow registerarchitectures. For the shadow registers with multiple control bits

5ALUT is the basic reconfigurable logic element of the Altera Stratix II. EachALUT allows up to five inputs.


Fig. 12. Speedup under different shadow register architectures.

Fig. 13. Speedup under different number of shadow registers and duplication registers.

(Sreg) as proposed in [7], three registers are used due to the con-trol bit cost. With three hash-mapped shadow registers (Hsreg)using one control bit, the benchmarks achieve comparable per-formance thanks to our global register allocation algorithm. Byproviding more hash-mapped shadow registers, yet still usingone control bit, we can further improve the performance. Theresults shown in Fig. 12 indicate that we can almost reach idealspeedup (within 1.8%) by providing five and eight hash-mappedshadow registers for the three- and four-operand constraints, re-spectively. However, three shadow registers with multiple con-trol bits [7] still leaves a 14.5% performance gap on average forthe two constraints.

We further compare our proposed hash-mapped shadow reg-isters with the partial register replication (Rreg) method usedin [11]. Fig. 13 shows the speedup under different operand con-straints and different number of registers. With the same numberof registers, shadow register architecture consistently outper-forms partial register replication, which leaves a 49% and 46%performance gap open for 3- and 4-operand constraints, evenwith 8 replicated registers. Note that Rreg architecture also re-quires an address translator to map the core register index tothe replication register index. For an eight-entry partial registerreplication, this would introduce a very similar area and delayoverhead on FPGA platforms.


To examine the impact of the hash function, we also compareour generated hash function with a simple mod function (i.e.,mod ). In our experiment, we find that we can achieve com-parable results for most cases. However, this will demand moreshadow registers to achieve the same speedup. For instance, thesimple mod function needs four and three, respectively, moreshadow registers to achieve the same speedup for the adpcmcand blowfishe with the three-operand constraint.

VI. CONCLUSIONS AND FUTURE WORK

The data bandwidth problem is seriously limiting the perfor-mance of ASIPs. In this paper, we provide a quantitative anal-ysis of the data bandwidth problem. A new low-cost architec-tural extension, which employs the hash-mapped shadow reg-isters, is proposed to directly address the data bandwidth lim-itation. We also present a simultaneous global register bindingwith a hash function generation algorithm to fully exploit thebenefits of hash-mapped shadow registers across the basic blockboundaries. The application of our approach results in a verypromising performance improvement.

One limitation of our current single-bit controlled shadowregister architecture is that the original instruction set stillneeds to be changed to encode the extra control bit. In ourfuture work, we shall investigate new techniques to furtherreduce this overhead. One possible solution is to use a shadowregister scheme with unconditional result forwarding, i.e., everyinstruction will always copy its result to both core register fileand the corresponding shadow register during the WRITE-backstage. Therefore, the additional control bit can be saved. In thiszero-bit shadow register architecture, unconditional WRITEsmay overwrite values which can be potentially used by thefollowing custom instructions, and thus, degrading the systemperformance. However, by sacrificing the flexibility of gatingthe shadow registers, we do not need to change the instructionset at all. Moreover, we can still apply our proposed shadowregister binding and hash function generation algorithms tothis zero-bit shadow register architecture. According to ourpreliminary experiment results, the zero-bit shadow registerscheme only degrades the speedup by 9% and 12% under thethree- and four-input constraints, respectively, when comparedto the single-bit scheme.

REFERENCES

[1] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Tech-niques and Tools. Reading, MA: Addison-Wesley, 1986.

[2] C. J. Alpert and A. B. Kahng, “Multi-way partitioning via spacefillingcurves and dynamic programming,” in Proc. 31st Des. Autom. Conf.,1994, pp. 652–657.

[3] M. Arnold and H. Corporaal, “Designing domain-specific processors,”in Proc. Int. Conf. Hardw. Softw. Codes., 2001, pp. 61–66.

[4] K. Atasu, L. Pozzi, and P. Ienne, “Automatic application-specific in-struction-set extensions under microarchitectural constraints,” in Proc.40th Des. Autom. Conf., 2003, pp. 256–261.

[5] D. Burger, T. Austin, and S. Bennett, “Evaluating future microproces-sors: the simplescalar toolset,” Univ. Wisconsin, Madison, Tech. Rep.CS-TR96-1308, 1996.

[6] J. Cong, Y. Fan, G. Han, and Z. Zhang, “Application-specific instruc-tion generation for configurable processor architectures,” in Proc. ACMInt. Symp. Field-Program. Gate Arrays, 2004, pp. 183–189.

[7] J. Cong et al., “Instruction set extension for configurable professorswith shadow registers,” in Proc. ACM Int. Symp. Field-Program. GateArrays, 2005, pp. 99–106.

[8] J. Cong, G. Han, and Z. Zhang, “Architecture and compilation fordata bandwidth improvement in configurable embedded processors,”in Proc. Int. Conf. Comput.-Aided Des., 2005, pp. 263–270.

[9] D. Fischer, J. Teich, M. Thies, and R. Weper, “Efficient architecture/compiler co-exploration for asips,” in Proc. Int. Conf. Compilers, Arch.,Synth. Embedded Syst., 2002, pp. 27–34.

[10] M. R. Guthaus et al., “MiBench: A free, commercially representativeembedded benchmark suite,” in Proc. IEEE 4th Workshop WorkloadCharacterization, 2001, pp. 83–94.

[11] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, “The chimaera recon-figurable functional unit,” IEEE Trans. Very Large Scale Integr. (VLSI)Syst., vol. 12, no. 2, pp. 206–217, Feb. 2004.

[12] P. Ienne, L. Pozzi, and M. Vuletic, “On the limits of processor spe-cialisation by mapping dataflow sections on ad-hoc functional units,”Comput. Sci. Dept., Swiss Federal Inst. Technol. Lausanne, Lausanne,Switzerland, Tech. Rep. 01/376, 2001.

[13] R. Kastner, A. Kaplan, S. Orgenci Memik, and E. Bozorgzaden, “In-struction generation for hybrid reconfigurable systems,” ACM Trans.Des. Autom. Electron. Syst., vol. 7, pp. 605–627, Oct. 2002.

[14] K. Keutzer, S. Malik, and A. R. Newton, “From ASIC to ASIP: Thenext design discontinuity,” in Proc. Int. Conf. Comput. Des., 2002, pp.84–90.

[15] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Mediabench: Atool for evaluating multimedia and communications systems,” in Proc.30th Int. Symp. Microarch., 1997, pp. 330–335.

[16] M. Poletto and V. Sarkar, “Linear scan register allocation,” ACM Trans.Program. Lang. Syst., vol. 21, no. 5, pp. 895–913, Sep. 1999.

[17] S. Rixner et al., “Register organization for media processing,” in Proc.6th Int. Symp. High-Performance Comput. Arch., 2000, pp. 375–386.

[18] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “Synthesis of customprocessors based on extensible platforms,” in Proc. Int. Conf. Comput.-Aided Des., 2002, pp. 256–261.

[19] ——, “A scalable application-specific processor synthesis method-ology,” in Proc. Int. Conf. Comput.-Aided Des., 2003, pp. 283–290.

[20] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “CHIMAERA:A high-performance architecture with a tightly-coupled reconfigurablefunctional unit,” in Proc. 27th Ann. Int. Symp. Comput. Arch., 2000,pp. 225–235.

Jason Cong (F’00) received the B.S. degree in com-puter science from Peking University, Beijing, China,in 1985, the M.S. and Ph.D. degrees in computer sci-ence from the University of Illinois at Urbana-Cham-paign, Urbana-Champaign, in 1987 and 1990, respec-tively.

Currently, he is a Professor and the Chairmanof the Computer Science Department, Universityof California, Los Angeles (UCLA). He is also aco-director of the VLSI computer-aided drafting(CAD) Laboratory. His research interests include

computer-aided design of VLSI circuits and systems, design and synthesisof system-on-a-chip, programmable systems, novel computer architectures,nano-systems, and highly scalable algorithms. He has published over 230research papers and led over 30 research projects supported by DARPA, NSF,SRC, and a number of industrial sponsors in these areas. He served on thetechnical program committees and executive committees of many conferences,such as ASPDAC, DAC, FPGA, ICCAD, ISCAS, ISPD, and ISLEPD, andseveral editorial boards, including the IEEE TRANSACTIONS ON VERY LARGE

SCALE INTEGRATION (VLSI) SYSTEMS and the ACM Transactions on DesignAutomation of Electronic Systems. He served on the ACM SIGDA AdvisoryBoard, the Board of Governors of the IEEE Circuits and Systems Society, andthe Technical Advisory Board of a number of EDA and silicon IP companies,including Atrenta, eASIC, Get2Chip, Magma Design Automation, and UltimaInterconnect Technologies. He was the founder and President of Aplus DesignTechnologies, Inc., until it was acquired by Magma Design Automation in2003. Currently, he serves as the Chief Technology Advisor of Magma andAutoESL Design Technologies, Inc. Additionally, he has been a guest professorat Peking University since 2000.

Dr. Cong has received a number of awards and recognitions, including theBest Graduate Award from Peking University in 1985, the Ross J. Martin Awardfor Excellence in Research from the University of Illinois at Urbana-Champaignin 1989, the NSF Young Investigator Award in 1993, the Northrop OutstandingJunior Faculty Research Award from UCLA in 1993, the ACM/SIGDA Merito-rious Service Award in 1998, and the SRC Technical Excellence Award in 2000.


He also received three best paper awards including the 1995 IEEE Transactionson CAD Best Paper Award, the 2005 International Symposium on Physical De-sign Best Paper Award, and the 2005 ACM Transaction on Design Automationof Electronic Systems Best Paper Award.

Guoling Han received the B.S. and M.S. degrees incomputer science from Peking University, Beijing,China, in 1999 and 2002, respectively. He is currentlypursuing the Ph.D. degree in computer science at theUniversity of California, Los Angeles.

His current research interests include plat-form-based system-level synthesis and compilationtechniques for reconfigurable systems.

Zhiru Zhang (S’02) received the B.S. degree incomputer science from Peking University, Beijing,China, in 2001, and the M.S. degree in computerscience from the University of California, LosAngeles (UCLA), in 2003. He is currently pursuingthe Ph.D. degree in computer science at UCLA.

His current research interests include plat-form-based hardware/software co-design forembedded systems, communication-centric be-havioral synthesis, and compilation techniques forreconfigurable systems.

986 ieee transactions on very large scale …cadlab.cs.ucla.edu/~cong/papers/tvlsi2006.pdf ·...

Documents