eﬃcient compilation for queue size constrained queue...

Efficient Compilation for Queue Size

Constrained Queue Processors

Arquimedes Canedo a,∗ , Ben A. Abderazek b, Masahiro Sowa c

aIBM,

Tokyo Research Laboratory,

1623-14 Shimotsuruma, Yamato-shi

Kanagawa-ken 242-8502

Japan

bUniversity of Aizu,

Aizu-Wakamatsu, Fukushima-ken 965-8580,

Japan

cUniversity of Electro-Communications,

Graduate School of Information Systems,

Chofugaoka 1-5-1, Chofu-Shi 182-8585

Japan

Abstract

Queue computers use a FIFO data structure for data processing. The essential

characteristics of a queue-based architecture excel at satisfying the demands of

embedded systems, including: compact instruction set, simple hardware logic, high

parallelism, and low power consumption. The size of the queue is an important

concern in the design of a realizable embedded queue processor. We introduce the

relationship between parallelism, length of data dependency edges in data flow

Preprint submitted to Elsevier 30 September 2008

ManuscriptClick here to view linked References

http://ees.elsevier.com/parco/viewRCResults.aspx?pdf=1&docID=404&rev=2&fileID=17367&msid=02482A80-0DF7-4DC1-9147-14048E273901

graphs and the queue utilization requirements. This paper presents a technique

developed to make the compiler aware of the size of the queue register file and, thus,

optimize the programs to effectively utilize the available hardware. The compiler

examines the data flow graph of the programs and partitions it into clusters

whenever it exceeds the queue limits of the target architecture. The presented

algorithm deals with the two factors that affect the utilization of the queue, namely:

parallelism and the length of variables’ reaching definitions. We analyze how the

quality of the generated code is affected for SPEC CINT95 benchmark programs and

different queue size configurations. Our results show that for reasonable queue sizes

the compiler generates code that is comparable to the code generated for infinite

resources in terms of instruction count, static execution time, and instruction level

parallelism.

Key words: Queue Register File, Queue Processor, Constrained, Optimization,

Compiler

1 Introduction1

Queue-based computers are a novel and viable alternative for high-performance2

embedded systems and general purpose processors. Several queue computation3

models have already been proposed [1,2,43,38,42,33,46,35]. Queue computers4

use high speed registers organized as a first-in first-out (FIFO) queue to5

perform operations. Data is written, or enqueued, at the tail of the queue,6

and elements are read, or dequeued, at the head of the queue. Two hardware7

pointers are kept by the processor to track the position of the head and8

∗ Corresponding Author.

Email address: [email protected] (Arquimedes Canedo).

URL: http://www.sowa.is.uec.ac.jp (Arquimedes Canedo).

2

tail positions. These pointers are referred to as QH and QT. Conventional9

compilation techniques for random access register machines cannot be utilized10

to generate queue programs since the FIFO queue demands strict ordering11

of operands and operations [10]. A level-order traversal of the data flow12

graph gives the order of which data should be enqueued, dequeued, and13

processed [35]. The queue computation model features unique characteristics14

that make it very attractive for addressing the problems of current computer15

design: simple hardware structures, low power consumption, and exploitation16

of instruction level parallelism. The instructions of a queue processor consist17

only of an opcode as operand reads and writes are done implicitly through the18

queue head and the queue tail pointers. Queue machines are very similar to19

stack machines [24,31], with the notable exception that the queue computation20

model does not suffer from the performance bottleneck created at the top21

of the stack [44]. Queue machines facilitate parallel execution of programs22

as one end of the queue is used exclusively for reading and the other end23

for writing. Furthermore, queue programs are generated from a level-order24

scheduling that exposes all parallelism available in the data flow graph.25

Having groups of data independent instructions permits the hardware to26

have a smaller instruction window than conventional RISC machines and27

potentially consume less power [6]. Another advantage of queue machines28

over conventional architectures is the lack of false dependencies in the29

program that are introduced by the tight coupling between the instruction30

set and the architected registers. Hence, queue processors do not need power31

hungry structures such as register renaming [43,2,1]. Furthermore, the implicit32

operand access allows instructions to have small encoding, simplifying the33

fetching and decoding, reducing the memory bandwidth, and reducing power34

consumption [7,40,15,16,23].35

3

At any point in the execution of a program, the queue length is the number36

of elements between QH and QT. Every statement in a program may have37

different queue length requirements and the hardware should provide enough38

words in the FIFO queue to hold the values and evaluate the expression. We39

developed the Queue Compiler Infrastructure [10] as part of the design space40

exploration tool-chain for the QueueCore processor [1]. The original queue41

compiler targets an abstract queue machine with unlimited resources including42

an infinite queue register file. Based on this assumption we measured the queue43

length requirements of the SPEC CINT95 applications. Figure 1 shows that44

95% of the statements in the programs require less than 32 queue words for45

their evaluation, and the remaining 5% demand a queue size between 32 and46

363 words. In our previous work [9] we gained insight into how queue length is47

mainly affected by two program characteristics: parallelism and soft edges. Soft48

edges represent the lifetime, in queue words, of reaching definitions of variables.49

Graphically a soft edge is an edge that spans across more than one level in50

the data flow graph. Table 1 shows the maximum queue requirements for the51

peak parallelism and maximum def-use length in SPEC CINT95 programs52

compiled for infinite queue. This table demonstrates that a reasonable and53

realizable amount of queue is needed in queue processors to execute the54

programs without performance penalty. However, assistance from the compiler55

is required to schedule the programs in such a way that parallelism and soft56

edges comply with the queue register file size in a realistic queue processor.57

This paper presents an optimizing compiler that is used to partition the data58

flow graphs of programs into clusters of constant parallelism and limited length59

of soft edges that can be executed in a queue processor with a limited queue60

register file. The compiler is also responsible for generating clusters that obey61

4

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

099.go 124.m88ksim 126.gcc 129.compress130.li 132.ijpeg 134.perl 147.vortex

Queue length

Acc

um

ula

tive

Per

centa

ge

Fig. 1. Queue size requirements. The graph quantifies the amount of queue

required to execute statements in the SPEC CINT95 benchmarks. A point, (x, y),

denotes that y% of the statements in the program require x, or less, queue words

to evaluate the expression.

the semantics of the queue computation model. The proposed algorithm was62

implemented in the queue compiler infrastructure [10,8] and affects compile-63

time by a negligible amount. The goal of this paper is to estimate how the64

characteristics of the output code are affected when the available queue is65

constrained. We estimate how the critical path, available parallelism, and66

program length of SPEC CINT95 benchmarks are affected for different size67

configurations of the queue register file. The contributions of this work are:68

• This is the first study, to the best of our knowledge, that estimates the69

performance of a queue processor with limited number of queue words.70

• The development of an efficient compiler algorithm that partitions the data71

flow graph into clusters that demand no more queue than what is available72

in the underlying architecture. This is achieved by limiting the parallelism73

and the length of reaching definitions in the data flow graph.74

Section 2 gives a summary of the related work. Section 3 introduces the75

queue computation model, a producer order queue processor, and provides76

5

Table 1

Characteristics of programs that affect the queue length in queue-based computers

Benchmark Peak Parallelism Max. def-use

099.go 20 19

124.m88ksim 29 19

126.gcc 35 56

129.compress 9 10

130.li 17 18

132.ijpeg 26 24

134.perl 15 15

147.vortex 49 14

the description of the queue compiler infrastructure. Section 4 presents the77

algorithm utilized to partition the data flow graph of programs into clusters78

of fixed queue utilization. An analysis of the experimental results is given in79

Section 5. Section 5 concludes this paper.80

2 Related Work81

The benefits and simplicity of 0-operand machines have been considered82

since the late 1950s. Several computers implement a LIFO stack in hardware83

to perform computations [5,17,24]. Several computer languages, compilers,84

interpreters, and virtual machines have been inspired by this computation85

model [36,29,27,22]. Practice has shown that the performance of the stack86

6

model is limited due to the bottleneck created at the top of the stack,87

which is the only place to read and write operands [44,39]. Opposite to stack88

computing, the queue computing model offers a parallel model with the same89

instruction format and characteristics. [9] showed that non-optimized queue90

code has about 13% more parallelism than optimized register code for an 8-91

way issue universal register machine. Surprisingly, only a handful of queue92

processors have been proposed in the literature.93

In [35], Preiss et al. proposed the first queue computer design together with the94

theory behind compiling for evaluating expressions for the queue computation95

model. They demonstrate that a level-order scheduling of an expression’s parse96

tree generates the sequence of instructions for correct evaluation. A level-order97

scheduling of directed acyclic graphs (DAG) still delivers the correct sequence98

of instructions but requires additional hardware support. This hardware99

modification is called an Indexed Queue Machine. The basic idea is to specify,100

for each instruction, the location with respect of the head of the queue (an101

index) where the result of the instruction will be used. An instruction may102

include several indexes if it has multiple parent nodes. The instruction format103

of the Indexed Queue Machine resembles a data flow computer [11]. All these104

ideas concerned an abstract machine until the hardware mechanism of a105

superscalar queue machine was proposed by Okamoto [33]. This superscalar106

queue machine realized the abstract design into a hardware implementation107

capable of executing instructions in parallel. The problem of generating code108

for a single queue has been demonstrated to be NP-Complete [18,19]. In [38],109

Schmit et al. proposed a heuristic algorithm to cover any DAG in one queue110

by adding special instructions to the data flow graph. From their experimental111

results, a large amount of additional instructions was reported, making this112

7

technique insufficient for achieving small code size. Despite the large amount113

of extra instructions, the resulting size of their tested programs is smaller than114

RISC design.115

The concept of an operand queue has been used as supporting hardware for116

the efficient execution of loops for two register-based processors. The WM117

architecture [46] is a register machine that reserves one of its registers for118

accessing the queue and demonstrates high streaming processing capabilities.119

The compiler for the WM machine [4] was developed to support streaming120

as an extension of accesss/execute computation model. In [42,14] the use of121

Register Queues (RQs) effectively reduces the register pressure on software122

pipelined loops. The compiler techniques developed for these processors rely123

on conventional register transfer intermediate languages, treating the queues124

as specific purpose set of registers.125

In our previous work [43,2,1], we investigated and designed a producer126

order parallel queue processor (QueueCore), which is capable of executing127

any data flow graph. Our model breaks the rule of dequeueing by allowing128

operands to be read from a different location than the head of the queue.129

This location is specified as an offset in the instruction. The fundamental130

difference with the Indexed Queue Machine is that our design specifies an131

offset reference in the instruction for reading operands instead of specifying132

an index to write operands. In the QueueCore’s instruction set, the writing133

location at the rear of the queue remains fixed for all instructions. To realize134

QueueCore as an actual processor we must explore how the file size of the135

queue register affects performance. None of the previous work related to136

queue computers has considered a constrained queue register file. We know137

from more than fifty years of experience in register machines that the size138

8

of the register file directly affects the overall performance of a computer139

system. Many works have proposed the optimization of the register file for140

the improvement of execution time [47,28,13], parallelism [45,34,30], power141

consumption [3,37,25,41], hardware complexity [21,20], etc [26,48].142

3 Overview of Queue Computing143

The Queue Computation Model (QCM) refers to the evaluation of expressions144

using a first-in first-out queue. Read and write operations are done through145

the head (QH) and tail (QT) of the queue. For every instruction, the hardware146

calculates the correct location of its operands, thus making correct execution147

possible. Queue length is the number of elements between QH and QT at a given148

state of execution. Queue length is tightly related to the way queue programs149

are generated. To evaluate any expression, the directed acyclic graph (DAG)150

of the expression should be scheduled in a level-order manner [35]. A level-151

order scheduling visits all nodes in a DAG from the deepest level towards the152

root as shown in Figure 2(a). All nodes belonging to the same level Li are153

data independent and can be executed in parallel. The queue length at any154

point in a program’s execution can be measured by the number of elements155

in level Li. Figure 2(c) shows the position of QH and QT for each level in the156

expression. The queue length requirements are also shown in the Figure. For157

L3, the queue length requirement is three, L2 is two, L1 is one, and L0 is158

zero. The arithmetic instructions encode the two operands that specify the159

location with respect to the QH from where the operands should be read. This160

architectural modification is necessary to allow the execution of any data flow161

graph and the details are discussed somewhere else in [43,2]. This kind of162

9

queue processor is called a producer order queue processor.163

a b c

+ -

/

x

L3

L2

L1

L0

(a) Level order traversal of parse tree (b) Queue program

ld a

ld b

ld c

add 0, 1

sub -1, 0

div 0, 1

st x

a b cL3

a b c a+b b-c

QH QT

QH QT

L2

a b c

QH QT

L1

a b c a+b b-c

QHQT

(a+b)/(b-c)L0

3

2

1

(c) Queue contents at each execution level

a+b b-c (a+b)/(b-c)

Fig. 2. Queue program characteristics. (a) Level-order traversal of a DAG. (b)

producer order queue instructions. (c) queue contents and queue length at each

execution level.

3.1 Target Architecture: QueueCore processor164

The QueueCore processor [1] implements a producer order instruction set165

architecture. Each instruction can encode a maximum of two operands that166

specify the location to read the operands from in the queue register file.167

For each instruction, the processor determines the physical location of the168

operands by adding the offset reference in the instruction to the current169

position of the QH pointer. A special unit called the Queue Computation170

Unit is in charge of finding the physical location of source operands and their171

destination within the queue register file, thus allowing parallel execution of172

instructions. Every instruction of the QueueCore is 16-bit wide. For cases173

where there are insufficient bits to express large constants, memory offsets,174

10

or offset references, a covop instruction is inserted. This special instruction175

extends the operand field of the following instruction by concatenating it to176

its operand. The file size of the queue register of the QueueCore processor is177

set to 256 words.178

3.2 Compiling for QueueCore179

In our previous work [8], we developed the compiler infrastructure for the180

QueueCore processor [1]. The compiler translates C programs into QueueCore181

assembly. An important task of the queue compiler is to schedule the182

program in a level-order manner and correctly compute the offset instruction183

references [10]. Figure 3 shows the block diagram of the queue compiler,184

including the proposed clusterization pass. The front-end of the compiler is185

based on GCC 4.0.2 and it parses C files into abstract syntax trees (AST),186

using a high level intermediate representation called GIMPLE [32]. GIMPLE187

representation is a three-address code suitable for code generation for register188

machines but not queue machines. As the level-order scheduling traverses the189

full DAG of an expression, we reconstruct GIMPLE trees into trees of arbitrary190

depth and width called QTrees. During the expansion of GIMPLE into191

QTrees, we also translate the high-level constructs such as aggregate types into192

their low level representation in QueueCore instructions. After having fully193

expanded DAGs into QTrees, we build the core data structure, or the leveled194

DAGs (LDAG) [18]. LDAGs allow the code generation algorithm of the queue195

compiler to determine the offset references for every instruction. The offset196

calculation phase consists of determining, for every instruction, the current197

location of the QH and measuring the distance to the instruction’s operands.198

11

The measured distance is the offset reference value and it is annotated in199

the LDAG. Then the LDAGs are level-order scheduled and a linear low-level200

intermediate representation (QIR) is emitted. The last phase of the compiler201

the generation of assembly code from the QIR representation.202

Under the queue computation model, the queue utilization requirements are203

given by the number of nodes at every level of computation. In the following204

section, we propose an algorithm that partitions the expressions into clusters205

to reduce queue length in such a way that the rules of queue computing206

are preserved. We chose LDAGs to be the input of the algorithm, since all207

dependency edges between operations and operands are determined.208

Front-End

1-offset CodeGeneration

OffsetCalculation

InstructionScheduling

C source file

QueueCoreAssembly

ASTs

QTrees

LeveledDAGs

QIR

AssemblyGeneration

Queue C

om

pile

r In

frastr

uctu

re

Clusterize

Fig. 3. Queue Compiler Infrastructure. Block diagram showing the phases and

intermediate representations used along the compilation.

12

4 Algorithm for Queue Register File Constrained Compilation209

Queue length refers to the number of elements stored between QH and QT210

at some computation point. We have introduced how queue length can be211

determined by counting the number of elements in a computation level.212

Nevertheless, DAGs often present a case when this assumption is not enough213

to estimate the queue length requirements of an expression. Consider the DAG214

shown in Figure 4 for the multiply-accumulate operation commonly used in215

signal processing “y[n] = y[n] + x[i] * b[i]”. Notice that some edges span more216

than one level (soft edges). For example, the edge with the source at node217

“sst” and the sink at node ’+’ of L3 spans three levels. Soft edges increase218

the requirements of the queue length, since the sink node must be kept in219

the queue until the time when the source node is executed. For the given220

example, the maximum queue length requirement of the DAG is five queue221

words and is imposed by the longest soft edge. The algorithm must deal with222

two different conditions that directly affect the queue length requirements of223

an expression: the length of computation levels, and the length of soft edges.224

The first can be solved by splitting the level into manageable sized blocks.225

The second can be solved by re-scheduling the child’s subtree. The order in226

which these actions are performed affects the quality of the output DAGs and,227

therefore, the quality of the generated code.228

If the levels are first split and then the subtrees are re-scheduled, the second229

action affects the length of the levels in the final DAG. The first transformation230

should be performed one more time to guarantee that all levels comply with the231

target queue length. If the order of the actions is inverted, then the DAG will232

be expanded into a tree and all subexpressions have to be recomputed since233

13

all subtrees are completely expanded, thus affecting performance and code234

size due to the redundant instructions. We propose an algorithm that deals235

with these problems in a unified manner. Our integrated solution reduces the236

subexpression re-scheduling and minimizes the insertion of spill code.237

sst

n

+

*lds

+

y * +

x *

i

lds

+

b

lds

size

ld i

ldi size

ld n

lea x

mul 0, 1

lea b

lea y

mul 0,-1

add 0, 1

add 0,-1

add 0, 1

lds 0

lds 0

lds 0

mul 0, 1

add 0, 1

sst -5, 0

L0

L1

L2

L3

L4

L5

L6

Fig. 4. Queue length is determined by the width of levels and length of soft edges.

4.1 Data Flow Graph Clusterization238

The main task of the clusterization algorithm is to reduce a DAG’s queue239

length requirements by splitting it into clusters of specified size. The algorithm240

must partition the DAG in such a way that every cluster is semantically241

correct in terms of the queue-computing model. Partitioning involves the242

addition of extra code to communicate the intermediate values computed in243

different clusters. Our algorithm uses memory to communicate intermediate244

values between clusters. The input of the algorithm is a LDAG data structure.245

For the queue compiler, a cluster is defined as a LDAG with spill code that246

communicates intermediate values to other clusters through memory. Keeping247

clusters as LDAGs allows the implementation to use the same infrastructure248

and the later phases of the queue compiler remain free of modification.249

14

The algorithm is divided into two phases: labeling and spill insertion. The250

labeling phase groups the DAG subtrees into clusters in order to preserve251

the rules of queue computing. For any given DAG or subtree W rooted at252

node R, the width of W is verified to be smaller than the threshold. The253

threshold is the size of queue register file for which the compiler should254

generate constrained code. If the condition is true then all nodes in W are255

labeled with a unique identifier called the cluster ID. In cases where the width256

of the DAG exceeds the threshold, the DAG must be recursively partitioned257

in a post-order manner, i.e. starting from the left child and then the right258

child of R. The labeling algorithm is listed in Algorithm 1. To measure the259

width of a subtree W , the DAG is traversed as a tree and the level with more260

elements is considered the width of W . Notice that when the DAG rooted at261

“sst” in Figure 4 is traversed as a tree, the maximum width is encountered262

in level L5 with six elements corresponding to nodes n, size, x, ∗, b, ∗. For263

simplicity of the explanation, assume that the threshold equals 2. Since264

SubTree Width(sst) > Threshold, the partitioning algorithm recurses on the265

left hand side node “+” at L3. The width of the “+” subtree is 2, equal to266

the threshold. Thus, all nodes belonging to the subtree rooted at “+” node at267

L3 are marked with cluster ID = 1 by line 14 of Algorithm 1. The algorithm268

continues with the rest of the DAG until all nodes have been traversed and269

assigned to a cluster. The output of the labeling phase is a labeled DAG as270

shown in Figure 5. Four clusters are shown in the Figure, the first cluster has271

its root node at (+) with its node in L3, the second cluster is rooted at the272

(lds) with its node in L2, the third cluster is at the (lds) with its node in L3,273

and the fourth cluster is at (sst) with its node in L0.274

The spill insertion phase is the second and final phase of the algorithm. The275

15

sst

n

+

*lds

+

y * +

x *

i

lds

+

b

lds

size

L0

L1

L2

L3

L4

L5

L6

Fig. 5. Output of the labeling phase of the clusterization algorithm

annotated LDAG from the previous phase is processed and a list of N number276

of clusters (a cluster set) is generated as the output. The input annotated DAG277

is traversed in a post-order manner. For every node visited in the traversal,278

a set of actions are performed to: (1) assign the node to the corresponding279

cluster, (2) insert reload operations to retrieve temporaries computed in a280

different cluster, (3) insert operations to spill temporaries used by different281

clusters.282

Assigning nodes to the corresponding cluster involves the creation of: LDAG283

data structures, node information, and data dependency edges. Using the284

queue compiler’s LDAG infrastructure [8] allows the clusterization algorithm285

to be implemented in a clean and simple manner. In terms of memory286

complexity, the addition of a list of length N is required in the compiler to287

generate the clusters. The value of N is the number of clusters discovered by288

the labeling phase.289

Spill code is inserted in two situations: to deal with intermediate results used290

by different clusters and to solve the problem of soft edges that span more than291

16

Algorithm 1 labelize (LDAG W )

Require: Threshold

1: root ← W’s root node

2: if SubTree Width (root) > Threshold then

3: lhs ← labelize (root.lhs)

4: if SubTree Width (root.rhs) > Threshold then

5: rhs ← labelize (root.rhs)

6: root ← Assign ID to node (rhs.id)

7: return root

8: else

9: root.rhs ← Assign ID to subtree (root.rhs);

10: root ← Assign ID to node (root.rhs);

11: return root

12: end if

13: else

14: root ← Assign ID to subtree (root);

15: return root

16: end if

one level and demand more queue than what was specified by the threshold292

(same or different clusters). Only subexpressions are spilled to memory and293

reloaded. Variables and constants that are used by multiple nodes are only294

reloaded since spill/reload would require extra instruction and extra memory295

space for temporaries. For every node the algorithm detects which operation u296

needs operands v to be reloaded whenever the cluster identifier of the node and297

the operand are different ID(u) 6= ID(v), or soft edges larger than threshold298

with source at node u exist. After detection of reloading, the node u is analyzed299

for spilling as follows. If the analyzed node u is a subexpression and has more300

17

than one parent node then a spill operation is inserted.301

Figure 6 shows the clusters generated for the example in Figure 5. Four302

clusters are generated after the spill code is inserted. The gray nodes in303

the figure represent the nodes that are spilled to memory. Notice that the304

node (size) has two parents in Figure 5 but no temporary is generated305

as it is not a subexpression but a constant known at time of compilation.306

The rectangle nodes represent the reload operations needed to retrieve the307

computed subexpressions from other clusters, variables, and constants. All308

four clusters in the figure comply with the requirement of not exceeding queue309

utilization greater than two. For this example, the penalty of compiling for a310

queue register file size of two words is the insertion of ten extra instructions:311

four spills and six reloads.312

+

y *

n size

lds

+

x *

i

lds sst

*

+

b

lds

tmp3

size tmp4

tmp1 tmp2

tmp1 +

tmp2

tmp4

tmp1

tmp3

Cluster 1 Cluster 2 Cluster 3

Cluster 4

Fig. 6. Output of the clusterization algorithm. Spill nodes are marked with gray

circles and reload operations are represented by rectangles.

Algorithm 2 lists the actions performed to generate a set of clusters over the313

annotated LDAG. As clusters have the same shape as LDAGs, we can use the314

queue compiler infrastructure to generate code directly from the cluster set.315

18

Each cluster is treated as a LDAG and the code generator [8] calculates the316

offset references for all instructions, including spill code. Additionally, from317

the described clusterization algorithm, the queue compiler internals remained318

untouched and the compilation flow remains the same as the original compiler.319

Clusters are connected to each other by data dependency edges. The order in320

which the clusters are scheduled is very important to preserve correctness of321

the program. We built a cluster dependence graph (CDG) to facilitate code322

generation. The CDG for the example given above is shown in Figure 7. At323

first, cluster 1 must be scheduled for execution, followed by clusters 2 and324

3, and finally cluster 4. Some clusters are independent from each other, like325

clusters 2 and 3, and can be scheduled in any order. In this paper, we schedule326

the clusters in the same order as they are discovered by the labeling algorithm.327

However, we notice here that this may present an opportunity for further328

optimization.329

1

32

4

Fig. 7. Cluster Dependence Graph (CDG)

5 Results330

The primary concern of this study was to analyze how the quality of the331

generated programs is affected when the program is constrained with a332

limited number of queue words. We concentrate on three aspects of the333

output programs: (1) instruction count, (2) critical path, and (3) instruction334

19

Algorithm 2 clusterize (node u, LDAG W )

Require: An empty cluster set C of N elements

1: /* Traverse as a DAG */

2: if AlreadyVisited (u) then

3: return NIL

4: end if

5: /* Action 1: add to corresponding cluster */

6: ClusterSet Add (C, ID(u), u)

7: /* Action 2: generate reloads */

8: for all children v of u do

9: if ID(v) 6= ID(u) then

10: GenReload (C, ID(u), v)

11: else if isSoftEdge (u, v) AND EdgeLength (u, v) > Threshold then

12: GenReload (C, ID(u), v)

13: else

14: /* Post-Order traversal */

15: clusterize (v, W)

16: end if

17: end for

18: /* Action 3: generate spills */

19: if Parents (u) > 1 AND isSubexpression (u) then

20: GenSpill (C, ID(u), u)

21: end if

22: /* Mark visited and return */

23: MarkVisited (u)

24: return u

20

level parallelism. Instruction count is the number of generated instructions335

including spill code and reloads. The critical path refers to the height of the336

program’s data flow graph given by the number of queue computation levels.337

This metric provides a compile-time estimation of the execution time of the338

program in a parallel queue processor. The instruction level parallelism in339

a queue system is estimated as the average number of instructions on every340

computation level of the data flow graph of a program.341

The methodology used to perform the experiments is as follows. We suc-342

cessfully implemented the presented algorithm in the queue compiler in-343

frastructure [8]. The threshold input value for the algorithm is given as a344

compiler option. For all experiments, the compiler was only configured with345

the clusterization algorithm presented. No other optimizations are currently346

available in the queue compiler. We compiled SPEC CINT95 benchmark347

programs [12] with threshold values of 2, 4, 8, 16, 32, and infinity.348

Table 2 quantifies the compilation time cost of the presented algorithm. The349

second column (LOC) shows the lines of C code for the input programs. The350

third column (Baseline) shows the compilation time with the constrained com-351

pilation disabled. The rightmost column (Constrained) shows the compilation352

time taken by the queue compiler with the constrained compilation enabled353

and a threshold set to 2. This threshold value is the worst-case configuration for354

the algorithm as the available queue is only two words. The table demonstrates355

that the penalty of this optimization negligibly affects the complexity of the356

queue compiler. The compilation time is the real-time of a dual 3.2 GHz Xeon357

computer running GNU/Linux 2.6.20. The compiler was bootstrapped with358

debugging facilities, and no optimizations were added.359

21

Table 2

Estimation of constrained compilation complexity measured as compile-time for the

SPEC CINT 95 benchmark programs with threshold set to two.

Benchmark LOC Baseline Constrained

099.go 28547 9.34s 9.35s

124.m88ksim 17939 9.58s 9.69s

126.gcc 193752 42.67s 43.39s

129.compress 1420 0.37s 0.38s

130.li 6916 3.20s 3.27s

132.ijpeg 27852 8.88s 9.10s

134.perl 23678 6.92s 7.26s

147.vortex 52633 18.73s 19.02s

5.1 Qualitative Analysis of the Output Code360

5.1.1 Instruction Count361

The most evident effect of the clusterization algorithm in the output code362

is in the instruction count. Spill code is inserted whenever the width of a363

level or a soft edge exceeds the threshold value. Table 8 shows the normalized364

instruction count for the benchmark programs for different queue lengths. The365

baseline is the programs that were compiled for infinite resources (INFTY),366

where clusterization is not present. We selected various lengths of queue for367

the following reasons. The most restrictive configuration for a queue processor368

is a queue length of 2. This configuration estimates the worst-case conditions369

22

for compilation and may strongly affect the quality of the programs. The370

other three chosen queue lengths (threshold = 4, 8, 16) are values above371

the average available parallelism in non-optimized SPEC CINT95 programs.372

The relationship between queue length and available parallelism is that N373

parallel instructions consume a maximum of 2N queue length. However, peak374

parallelism and some soft edges are beyond these values and our algorithm375

found opportunities for clusterization. The last length of queue is set at infinity376

to compare the previous constrained configurations against an ideal hardware.377

As we expected, the most restrictive queue length configuration was incurred378

at 2 with the most substantial insertion of spill code ranging from 3% to 11%379

more instructions. The clusterization algorithm works on the premise that the380

width of the data flow graph, or degree of parallelism, must be partitioned in381

case its queue requirements violate the available queue. Therefore, compilation382

for a queue length of two words forces a large number of partitions of the383

original data flow graph, thus inserting a substantial amount of spill code. For384

the queue lengths of 4 and 8, the increase in number of instructions is about 2%385

and 1%, respectively. Compilation for a queue length of 4 words exceeds the386

average queue requirements of SPEC CINT95 programs, which is about 3.5387

queue words per level. When compiling for a queue length of 16, the insertion388

of spill code is insignificant for most of the programs. These rare cases that389

demand more than 16 queue words are the bursts of peak parallelism and long390

soft edges.391

23

0.900

0.955

1.010

1.065

1.120

099.go

124.m88

ksim

126.gc

c

129.co

mpr

ess

130.li

132.ijp

eg

134.pe

rl

147.vo

rtex

AVG

2 4 8 16 INFTY

Num

ber

of In

stru

ctio

ns

Fig. 8. Normalized instruction count measurement for queue lengths,

threshold = 2, 4, 8, 16, INFTY .

5.1.2 Spill Code Distribution392

We separated the inserted spill code into three components, as shown in393

Figure 9: parallelism, soft edges, and reloads. Parallelism accounts for all394

spill instructions inserted to constrain the width of the data flow graph395

to the available queue. Soft edges represent all spilled temporaries that396

were generated to constrain the soft edges that exceed the available queue.397

Reloads are the instructions to read the spilled temporaries and uses of398

shared constants and variables of other clusters. The Figure quantifies the399

contribution of each component of spill code into the total number of400

extra instructions for the 124.m88ksim benchmark. When only considering401

the two components that contribute with spill code (parallelism, and soft402

edges) and ignoring the reload instructions, notice that for a queue size of403

2, the parallelism component dominates with 89% extra code. The other404

11% is from the soft edge component. As explained above, the parallelism405

component contributes most of the code since a large number of partitions406

must be made given the configuration and benchmarks. When a larger407

24

queue size configuration is imposed and exceeds the average queue utilization408

of the compiled program, the distribution changes, and, on average, the409

parallelism component contributes 40%, and soft edges component contributes410

the remaining 60% of spill code.411

0

1,125

2,250

3,375

4,500

2 4 8 16

Parallelism Soft edges Reloads

Queue Length

Num

ber

of In

stru

ctio

ns

Fig. 9. Spill code distribution of 124.m88ksim benchmark.

5.1.3 Critical Path412

We define the critical path of a program as the number of computation levels413

in its data flow graph. These levels represent the true data dependencies of the414

program’s data flow graph and the limits of a queue processor. Assuming that415

all instructions in every level are executed in parallel by the queue processor,416

the execution time is bounded by the number of levels in the program. We417

use the critical path to estimate the static execution time of the compiled418

programs. Since partitioning the data flow graph into clusters increases the419

number of levels, we were interested in determining how the static execution420

time is affected when compiling for a constrained queue register file. Figure 10421

shows the experimental results for different queue sizes. For queue sizes of 4,422

8, 16 the performance degradation of the static execution time is less than 1%423

25

eﬃcient compilation for queue size constrained queue...

Documents