research on programming for a multi-core environmentweb.cs.wpi.edu/~lauer/concurrency/sketch of...

A Proposed Graphical Dataflow Programming Model for Achieving Fine-grain Parallelism in Multi-core Systems

Hugh C. LauerDepartment of Computer ScienceWorcester Polytechnic Institute

Worcester, Massachusetts

AbstractA graphical dataflow programming language is outlined. Programs in this language are represented by arcs and nodes, the arcs repre-senting data of specific types and the nodes representing terminals or operations. Inspired by LabView, the language is intended to support automatic tools for parallelizing computations at the gran-ularity of function calls and basic blocks. An execution model based on that of Mesa is used to avoid the overhead of threads and to ac-commodate very high degrees of parallelism in application space, without invoking the operating system kernel.

IntroductionThis document is a preliminary outline of a graphical dataflow pro-gramming language and environment for implementing fine-grain par-allelism suitable for multi-core processors. The purpose is to explore ways to represent programs and software systems so that inherent parallelism can be automatically exposed and distributed across the many processor cores on a modern processor chip. Using a graphic dataflow model, we hope to achieve degrees of parallelism that are much larger and granularities much finer than are typically achiev-able using traditional thread models.Dataflow methods have been widely used in studying, analyzing, and implementing parallelism. Optimizing compilers, for example, convert instruction sequences into internal dataflow graphs in order to con-struct fast, efficient code sequences corresponding to the original source code provided by the programming [reference?]. Modern mul-tiple-issue processors such as the Intel Pentium and IBM PowerPC construct internal dataflow models of the executing instructions in or-der to permit many machine-language instructions to be in progress at once while preserving interdependence of input and output values [reference]. A number of explicit dataflow programming languages have been proposed, including CODE2 at the University of Texas in Austin [reference]. LabView™ is an existing, object-oriented, graphi-cal dataflow programming language that is in widespread use outside

1

of computer science, primarily in the fields of instrumentation and control, with a large and dedicated user community [reference].1

The graphical dataflow programming model and language of this doc-ument is heavily inspired by LabView, but semantically different in significant ways. In particular, two of the goals of this language are (a) to be able to add automatic tools for parallelizing computations and (b) to develop an execution model that supports parallel computa-tions at the level of the function and/or basic block. As our proposed language is explained in this document, we use a number of figures, diagrams, and icons, all of which were created in LabView, even if their semantics may be different from LabView semantics. Of course, in an actual implementation of the language, the graphical appear-ance of every construct would be an attribute that could be changed to avoid conflicts with the look and feel of LabView itself.This document is organized as follows. First a simple example is shown, along with how it might be automatically parallelizable. Fol-lowing that, the elements of the programming language will be pre-sented, including

o its data types and data flow connections, o its operators, functions and methods, o classes and references to class objects, o the proposed execution model, and o a development environment.

This is the first presentation of this proposed language, so there are a lot of loose ends that still require brainstorming. However, it is hoped that the essential elements and approach of the language form a cohe-sive overall model.

1 I became a LabView user several years ago as part of a three-person startup in the field of surgical navigation. While the startup is nearly defunct, LabView has become my favorite programming environment. Although National Instru-ments, the vendor of LabView, actively promotes it as a tool for programming multi-core systems, the language and system are both proprietary and therefore not open to academic research and experimentation in the underlying issues of fine-grain parallelism.

2

Simple Example

Figure 1

Figure 1 above is the block diagram of a simple program written in LabView. Unlike flow charts and other pictorial representation of pro-grams expressed in text, the graph in the diagram is the actual pro-gram itself. This example is a function to compute the mean and stan-dard deviation of an array of numbers. There are three terminal nodes in this function, one on the left labeled Input and two on the right la-beled Mean and Standard Deviation, respectively. Input is a one-di-mensional array of double precision floating point numbers, denoted by the style and color of the box, while Mean and Standard Deviation are scalar double precision floating point results. The little arrows de-note the direction of the flow of data. Another terminal node on the left is a double precision floating point constant with the value of zero.2

Lines on the block diagram — called “wires” in LabView — denote the flow of data, and nodes represent operations on the data. The large rectangular box represents a for-loop. In this case, the number of iter-ations is determined by the number of elements in the Input array, and it is available to subsequent operations via the blue wire emerg-ing from the symbol in the upper left corner. The little up- and down-arrows on the right and left sides of the loop block (i.e., and

) denote values that recirculate from one iteration of the loop to the next, with the initial value on the left being derived from the wire that connects to it and with last value emerging on the right when the loop exits. The little notation where the Input array enters the loop block on the left means that the iterations of the loop pick off the elements 2 Each LabView function is defined in its own file with the .vi extension (denot-

ing Virtual Instrument). A function has two parts — the block diagram, an exam-ple of which is shown above, and a front panel for depicting the terminal inputs and outputs in a user-friendly way. Elements of the front panel include switches, knobs, buttons, sliders, meters, digital displays, text inputs, oscilloscope dis-plays, etc., each corresponding to one terminal on the block diagram. The look and feel of LabView front panels is central to National Instruments’ business and an important component of their intellectual property. In this project, we have no need to replicate that functionality or interface of these front panels, so we do will not mention them again in this document.

3

of the array, one at a time. 3 It can readily be seen that this function calculates the mean and standard deviation σ of the Input array us-ing the following formulas:–

ParallelizabilityIt should be immediately apparent that the loop of this function is not automatically parallelizable, for the simple reason that the partial sums recirculate through the iterations. The time it takes to execute the loop is therefore O(N) – i.e., proportional to the length of the Input array. However, it could be parallelizable if the input array were parti-tioned into, say, K subarrays, each of which calculates a partial sum-mation of the items and the squares, followed by a summation step that combines the partial sums. This might be represented in LabView by the following block diagram:–

Figure 2

This has two nested loops, the inner one identical to the previous ex-ample. The outer loop has no recirculating values, and therefore its it-erations are data-independent. This offers the opportunity of auto-matic parallelization. An additional input terminal K defines the de-gree of parallelization. The one-dimensional input array is reshaped (by the Reshape operation) into a two-dimensional array of K rows, each of which has elements and where the last row is padded with zeros if necessary. The right side of the outer loop has two indexing

3 A corresponding notation on the right side of a loop can be used to create an array of values from the outputs of the individual iterations. LabView program-mers adapt very quickly to the compact notation of this graphical representation of indexing array elements and recirculating data among iterations of loops.

4

markers (i.e., ) that convert the K partial sums into arrays of K ele-ments. The outputs of the outer loop are connected to two array sum-mation operators that add the results, and finally these are divided by the original number of elements in the Input array to obtain the mean and standard deviation. (These summation operators are highly optimized versions of the original loop of Figure 1 that add the elements of the array together. Each has an execution time of at most K for an array of K elements, and the two can be executed concur-rently because they are data-independent.)The underlying system is now free to dispatch the iterations of the outer loop to K processor cores so that they are executed concur-rently. The system must implicitly provide barrier synchronization as part of the accumulation of the results. It is now possible to calculate the speedup according to Amdahl’s law. Assume that squaring a number and adding it to a sum takes q times as long as simply adding it.4 Then the time to complete the loop in Fig-ure 1 is , whereas the time to complete the nested, paral-

lelized loops in Figure 2 is . I.e., the parallelized loop speeds up by a factor of K, but the final summations cost another K operations. The speedup according to Amdahl is therefore

Note that S increases with K (but not linearly) until K reaches a value of about , and then S starts to decrease again.

CommentsSeveral comments come to mind:– First, notice in both figures that the loop iteration counter is not

used, that there is no explicit array subscripting, and the number of iterations is set automatically. This is an example of the com-pactness of notation of the graphical dataflow approach. All of these need to be declared and defined explicitly in a traditional tex-tual language, and their parallelization must be called out explicitly in a system like OpenMP.

Second, in this example, the programmer is aware of the need or opportunity to parallelize the computation, but s/he is not encum-bered with the programming complexity of synchronization primi-tives and spawning multiple threads. Moreover, if the underlying

4 In a modern pipelined processor, a floating-point multiply-and-add instruc-tion may take 2-4 times as long to complete as a simple addition.

5

system supports it, parallelization can be done at the granularity of the inner loop (i.e., a basic block) rather than at the granularity of a heavyweight construct such as a thread.

Third, Amdahl’s Law can be automatically calculated. One would expect that a programming system supporting automatic paral-lelization would also display the Amdahl speedup information as an attribute of each loop and/or function.

Finally, it is not very far-fetched to suppose that a graph process-ing algorithm could automatically derive Figure 2 from Figure 1 us-ing patterns of common program structures.

Data Types and Arcs on Program GraphsThe data types of the proposed graphical programming language are mostly similar to those of Java. The primitive types are

o Integerso Floating point numberso Complex numberso Booleanso Characterso Enumerations

They all have the obvious computational characteristics. In addition, each data type has some graphical properties associated with it, in-cluding the visual appearance of terminal nodes of that type and the visual appearance of arcs or lines representing the flow of data from point to point in the graph.

ArraysArrays are collections of objects all of the same type. An array may have one, two, or more dimensions. In LabView and almost all other languages, the number of dimensions of each array is defined stati-cally in the program; this seems to be a reasonable restriction and so will be adopted here, also. A reshape operation is provided to convert an array of one dimensionality to an array of another dimensionality.The types of individual elements may be any of the primitive types, but they may not be other arrays. As in LabView, however, the ele-ments of an array may be objects of a class (see below), and such ob-jects may contain arrays within them.A runtime property of each array is the number of elements in each dimension of that array. Moreover, the number of elements in any di-

6

mension of an array is not fixed but may be increased or decreased computationally.5 As in most programming languages, arrays may be indexed to retrieve individual elements or subarrays. However, explicit indexing is not al-ways necessary because of the implicit indexing of loops.6

Structures, classes, and objectsFor data aggregation, our graphical programming environment will use Java-like classes. A class defines an aggregate data type compris-ing one or more fields, each of which may be public or private. A class may optionally define one or more methods — i.e., functions that take class objects as parameters. Operators are provided to access the fields of the class, but only the methods of the class may access the private fields.A field of a class may be any type, including an array, another class, or even another instance of the same class. Semantically, when a class includes an object of another class as one if its fields, that object is logically copied into the body of the including object. Therefore, when a class includes an instance of the same class, that object is logically copied into the including class, creating a recursion. In order to break this recursion, NULL must be a legal value of any class object.As in Java, classes may be subclasses of other classes. The subclass in-herits all of the fields and methods from its parent class, and it may override some or all of those methods. It is the intention to allow mul-tiple inheritance — that is, to allow a class to be a subclass of more than one other class, in which case it inherits the characteristics of all of them. 7

In many cases, objects of a class are transmitted over arcs of the dataflow graph by value. That is, the arc sends a copy of the object from its source to its destination. This works well when a class is used as a struct for simply bundling a collection of data objects and arrays for convenience in handling. The various tools and methods for au-tomating parallelism and for synchronizing the flow of data will work

5 This obviously means that arrays are stored on the heap, as in Java, rather than on the stack, as in C/C++.

6 In my own experience with LabView, I find that I use explicit indexing for fewer than half my arrays.

7 LabView came to object-oriented programming and classes late in its evolu-tion. Originally, LabView supported only clusters, where were its version of C structs. Later, these were expanded to classes, but only with single inheritance. Ironically, the first time that I found myself using inheritance for a serious appli-cation in LabView, I found immediately the need to inherit from two different classes. As a result, I ended up with a very hokey data structure that multiple in-heritance would have made much nicer with multiple inheritance.

7

just as well for class objects transmitted by value and for separate data objects transmitted by value.However, there are many situations in which a set of objects must be shared among multiple computations, including concurrent computa-tions. An example is the classic database, in which the computations are individual transactions over a collection of shared data. Another example is the cognitive model that a game character or robot builds up about its environment. In these cases, it is not sufficient to simply transmit copies of objects around the dataflow graph; instead, it is necessary to transmit references to individual objects.Unfortunately, once we start transmitting references, most of the tools and techniques for automatically parallelization in the dataflow model stop working. This becomes a fundamental obstacle to our goal of fine-grain parallelism and will be discussed in more detail below un-der section entitled Classes and References.

Arcs and TerminalsIn this graphical programming language, the lines or arcs of the graph represent directional flows of data values from nodes to other nodes. Specifically, a copy of a data value is transmitted from the source node of the arc to the destination node of the arc. This copy may be a single primitive value, an entire array, or a member of a class of ob-jects. The data type of each arc is specified at the time the program is created and does not change.In this document, we use the term terminal to denote the nodes in the graph to which arcs connect. The input, output, and constant termi-nals of functions have special icons — for example, the icons in Figure1 labeled Input, Mean, and Standard Deviation, and also the icon rep-resenting the constant zero. The , , and symbols at the edges of composite operators such as for-loops and if-else blocks are the termi-nals of those blocks. The connector points of primitive operations such as are also terminals.A single arc may have only one source, so that its data value is unam-biguous at any time during the execution of the program. However, it may have multiple destinations, so that copies of the data go to two or more different nodes. Two examples may be seen in Figure 1.

One is the output of the index node , which goes both to the op-eration of the calculation of the mean and also to the operation of the calculation of the standard deviation.

The other is the output of the upper operation on the right, which goes both to the terminal node mean and also to the oper-ation below it.

8

While logically, this means multiple copies of the same data, the com-piler can optimize this to avoid unnecessary copying.Directionality is built into every arc, so from the point of view of graph theory, it makes no difference whether the data flow is left-to-right, right-to-left, top-to-bottom, or bottom-to-top. For readability, most LabView programs are organized from left to right, and this is rein-forced by the positions of the connections on nodes representing func-tions and operations. Inputs to a function tend to be on the left side of its graphical icon, and outputs tend to be on the right side. The author prefers this style over the top to bottom style of CODE2.It is possible to create cycles in the graph representing a function. That is, the output of a node can feed back to become an input (di-rectly or indirectly) of that same node. Basically, this means that the output value from a particular execution of that node becomes an in-put value for a later execution of the same node. The most common case in LabView is the and pair, which recirculate values from one iteration of a loop to the next (or a subsequent) iteration. How-ever, LabView also permits more general cycles, but it inserts each one a sort of holding register into each cyclic path to cause a non-zero delay of the cycled value. This issue needs more thought in our sys-tem, where the emphasis is on parallel execution.

OperatorsThis programming language provides a selection of basic operators, composite operators including loops and case or if-else blocks, 8 and function invocations. Each operator is a construct or icon on the dataflow diagram. It has input and output terminals to which are con-nected arcs. The underlying execution rule is that an operator will “fire” or execute when the data values on its inputs are available. Oth-erwise it waits. When it does fire, the operator “consumes” the values from its input terminals, performs the operation on those values, and transmits the results on its output terminals. If the different inputs ar-rive at different times (as a result, for example, of being generated by concurrent activities with different timing), the operation does not happen until the last value arrives, implicitly requiring a barrier syn-chronization.9

8 LabView offers several other composite operators including one that forces otherwise parallelizable code to execute sequentially. These can be added later if needed.

9 LabView is a bit ambiguous about this synchronization. Before the days of concurrency, the LabView compiler simply generated code so that an operation fired when the compiler knew that both values were available. There are some pathological circumstances, however, where an operation does not “consume” its input but rather fires repeatedly, using the same value over and over again.

9

It is possible for one part of a dataflow graph to execute faster than another. The result could be a data “overrun” on a particular arc. That is, an operator or function might generate a new data value before the previous one was consumed by the next operator along the path. If al-lowed, this could cause all sorts of mysterious behavior. There are three possible ways to address this, namely by blocking, queuing, or discarding. Blocking means that an operator cannot complete its operation un-

til all previous values on the arcs emanating from its output termi-nals have been cleared by subsequent operators. This can, in cer-tain circumstances, be a serious impediment to parallelism.

Queuing means that an arc itself has a limited amount of buffer memory to absorb some number of values before falling back to blocking. This could smooth out the data flow.

Discarding means that the overrunning data is thrown away. The sending operator continues to execute, but the results are not used. This is probably the most difficult.

LabView uses blocking by default, but it includes special constructs to support queuing and discarding. However, these constructs are fairly heavyweight.10

Primitive operatorsThe basic operators of this programming language are those that are found in most conventional languages — e.g., arithmetic operators, comparisons, array subscripting, field selection from objects, etc. The difference is that they are expressed in dataflow format such as the and operators in Figure 1 and Figure 2. For example, the opera-tor has two inputs on the left and one output on the right. When a data value is available on each input, the operation node “fires” — that is, it adds the two values together and places the result on the arc emanating from its (only) output terminal.Two other basic operators are the extraction and insertion of fields into class objects. The extraction operator simply copies its field to its output terminal. The insertion operator constructs a new class object from an existing one (including the NULL object) by replacing one or more of its fields. Although logically, the insertion operator makes a copy of the object into which it is inserting something, the compiler is 10 It is not clear whether streams need to be included as separate aggregators

of data types. A stream is a sequence of values, all of the same type, that flow through an arc. The allocation of context blocks in the execution model (see be-low) would be different if the called function were expecting a stream of data rather than a single value. However, it seems simpler to not have streams but only recognize that it is possible that a new value can be generated on a wire be-fore the previous one is consumed and that this has to be handled correctly.

10

free to optimize away any unnecessary copying. The same is true for arrays:– a subscripting operator simply copies a value (or subarray) to its output terminal. When inserting an element at a subscripted loca-tion of an array, the insertion operator logically makes a copy of the array with the subscripted location changed to the new value. The compiler, of course, may optimize this.It is expected that there would be a large palette of operators corre-sponding to the large number of built-in functions in textual lan-guages or the palette of operators in LabView. In addition to defining its own operation, each operator has several graphical properties, in-cluding its color and appearance and the pattern of its input and out-put connectors. Some operators may have more than one output. For example, the LabView integer division operator produces both a quo-tient and a remainder, so that it can be used in lieu of the ‘/’ and ‘%’ operators of a textual programming language.Operators can also be polymorphic, especially with respect to arrays and structs. For example, when presented with two array inputs of the same dimensionality, the operator produces an array output comprising the element-by-element sum of the two inputs. Likewise, when presented with an array and a scalar, the operator generates an array output in which each element is the corresponding element of the input divided by the scalar.Occasionally, the programmer may wish to drop into C for a particular set of expressions. LabView provides a “formula node” for this pur-pose, and our language should do likewise. This is sort of an inline function that compiles directly to a basic block. It is invoked when all of its inputs are available, and it generates one or more outputs.11

For- and While-loopsSince most of the focus of data parallelism, especially in scientific computing and number crunching, is on parallelization of loops, the loop construct is one of the centerpieces of this graphical program-ming language. In general, a loop is a two-dimensional graphical ob-ject that surrounds a block of graphically specified code — i.e., a set of arcs, operators, function calls, other loops, etc. It denotes the re-peated execution of that block according to some control.We have already met the for-loop in Figure 1 and Figure 2. This speci-fies the number of iterations, either explicitly or implicitly. The gen-eral format of a for-loop is shown in Figure 3:–

11 In my 2½ years of programming in LabView, I only ever dropped into a for-mula node twice. Both cases had to do with the construction of graphics trans-formation matrices, where it was easier just to specify the elements of the matri-ces rather than create a tangle of wires and sin and cos operators.

11

Figure 3

The for-loop includes:– An optional label that serves as a comment to help readability. A count or that specifies the integer number of iterations to

execute the body of the loop. The control or can be either a destination of an arc, the source of an arc, neither, or both. The count can also be derived implicitly from an indexing node (see be-low). If more than one source of the count is supplied, the smallest value prevails.

An iteration counter — i.e., an integer that indicates which itera-tion is currently executing. This is only a source of data values; it cannot be wired as a destination.

An optional termination control that allows the loop to be termi-nated early, before all iterations have completed. This is the desti-nation of a Boolean, and it can be set so that termination occurs ei-ther when the Boolean it true or it is false. We need to review the semantics of termination of a loop that is parallelized. One possibility is that concurrently executing itera-tions with indexes larger than the terminating one are suppressed, whereas concurrent executions of lower indexes are continued.

A while-loop is a loop without an explicit count. The general format is shown in Figure 4.

12

Figure 4

For readability, the while-loop is graphically different in appearance from the for-loop. Obviously, graphical properties can be associated with each loop construct to control its appearance, color, shape, etc.A while-loop contains An iteration counter identical to that of the for-loop. A termination control to indicate when to terminate the loop.

This is identical to the same control in a for-loop, but it is manda-tory.

Parallelizing while-loops is much less obvious than parallelizing for-loops and therefore needs some thought.Data flows into and out of loops via terminals at their edges. A termi-nal is either an input or an output terminal, but not both. Although ei-ther kind of terminal can be located anywhere on the perimeter of a loop, input terminals are generally located on the left side and output terminals are generally located on the right side. This makes for more readable code.There are three kinds of loop terminals in LabView and fourth needed for this language:– A simple data terminal :– When this is connected as an input, it

transmits a copy of the data to each iteration without modification, indexing, or other manipulation. When it is connected as an output, it transmits the value connected to it in the last or final iteration.When a loop is parallelized, the data of an input terminal is simply broadcast to all concurrent iterations. An output terminal, how-ever, is more complicated, because the value of the “last” iteration has to be selected and used as the output of the entire loop.12

12 The use of a simple data terminal as an output terminal may, in fact, defeat attempts to automatically parallelize a loop, because it suggests a (possibly) hid-den flow of information from one iteration to the next in order to obtain the final value. Perhaps it should be disallowed.

13

An indexed terminal :– When this is used as an input terminal, an n-dimensional array must be connected to it (n > 0). If the number of elements in the first dimension is k, then the loop count is set to k, and one element or subarray is presented to each iteration. That is, if the input array has one dimensional, then one scalar value is presented to each iteration. If the input array has two or more di-mensions, then an array with one fewer dimensions is presented to each iteration.When the indexed terminal is used as output terminal of a loop, it accumulates the values of all of the iterations into an array of one additional dimension. That is, if the body of the loop produces a scalar value, then the indexed terminal creates a one-dimensional array of values in which the number of values is the same as the number of iterations. If the body of the loop produces an n-dimen-sional array, then the index terminal accumulates an (n+1)-dimen-sional array.

Recirculating terminal pairs and :– The down-arrow is appears on the right edge of the loop and captures a value (of any data type) to recirculate to the next iteration. When the loop terminates, the output of this terminal is the value produced by the last itera-tion (equivalent to a simple data terminal).The up-arrow appears on the left edge of the loop and captures the recirculated value so that it can be used by the next or a subse-quent iteration. These up-arrows can be stacked, as in the for-loop of Figure 3. In this case, the top is the source for the recircu-lated value from the immediately preceding iteration, the second is the recirculated value from the second preceding iteration, etc.When used as an input terminal to the loop, each up-arrow pro-vides the initial recirculation value to the first iteration (or the first n iterations, if stacked n high).

The fourth kind of loop, not present in LabView would be a stream-ing terminal. When connected as an output terminal, this generates a new value from each iteration. Unlike an indexing node, the sepa-rate values are not accumulated in an array to be transmitted once along the output arc. Instead, each output value is transmitted sep-arately while the loop continues to execute. This seems more satis-factory than the simple terminal on output.13, 14

Note that a loop may contain an input or output terminal of its con-taining program or function. The semantics probably should be that 13 The semantics of a streaming terminal on input are unclear and need more

thought. Perhaps it should be disallowed.14 Parallelizing a loop with a streaming output terminal is challenging. The

streams would have to be merged, but it is unclear whether they would need to be merged in stream order or not.

14

on each iteration of the loop, a new value is read or written. Thus, the loop becomes a consumer or producer of a stream of data, and the it-erations block until data arrives on inputs or until previous output data is consumed. This provides a way to implement pipelined paral-lelism among parts of a large program. However, it prevents paral-lelizing the loops in the manner shown in the Simple Example above. 15

If-else and Case blocksAn if-else block comprises two graphic blocks of code with identical terminals. One of these terminals is distinguished as the control termi-nal of the block and indicated by the symbol. This accepts a Boolean input which, if it is true invokes one of the two blocks and if false in-vokes the other. An example of an if-else block is shown at the right of Figure 5, which is a modification of Figure 1 so that it checks for an empty array and avoids a divide by zero.

Figure 5

Only the true case is shown in this figure. In this example, the false case is hidden behind the true case and looks like Figure 6:–

Figure 6 15 This is specifically different from the semantics of LabView. There, a loop will

continue to read the same value on each iteration. A value may change as a re-sult of a concurrent activity, and the result is a race condition between the change and the reading of it. On output, writing to a terminal within a loop re-sults in a new value on the output arc but does not generate a stream of data, only a constantly changing value. Although LabView documentation speaks of supporting pipelined parallelism, I have found it very difficult to achieve in prac-tice with LabView’s semantics.

15

Note that terminals of the false block are identical in type and posi-tion to those of the true block. In this example, the false case con-nects the sums that are input on the left to the mean and standard de-viation on the right; these are known to be zero. In order to preserve the semantics of the dataflow graph, it is necessary that both sides of an if-else block produce values on all output terminals, even if they are only default values.A case block is like an if-else block but with more possible alterna-tives. The control terminal accepts numerical or enumerated values and includes a case for some or all of them. It may also include a de-fault case. The control terminal may also accept a class object or a reference (see below). If the class object or reference is NULL, one branch is taken and if not NULL, the other branch it taken.Note that as with loops, if an input or output terminal of the contain-ing program or function is embedded in one case block but not the other(s), it is only read or written if that particularly case is activated.The graphical characteristics of if-else and case blocks should be re-viewed. LabView’s practice is stacking them one on top of another works fine for that language, but it might be more useful to allow all blocks to be displayed at the same time, under control of the program-mer.

Functions and MethodsA function (or a method of a class) in this language contains three parts:– A header defining the types and names of the input and output ter-

minals and any local and/or static variables; A block diagram showing the body of the function, expressed as a

data flow graph; and An interface showing the iconic representation of the function and

its connectors — i.e., the input and output terminals and their posi-tions on the arcs.

Examples of block diagrams of functions have already been shown in the figures above. It is proposed that a header be a formatted text window with Java-like declarations. There should be three panes, one for input terminals (i.e., the parameters of the function), one for out-put terminals (i.e., the results of the function), and one for local vari-

16

ables.16,17 In keeping with LabView practice, the header and block dia-gram of a function should be stored together in a single file.The interface defines what the function looks like when placed on the block diagram of another function. An example is shown in Figure 7:–

Figure 7

This is a verbose description of the interface to a built-in signal pro-cessing filter in LabView. It shows the iconic appearance of the func-tion and each of its input and output terminals (the types of which are inferred from their colors and styles). Also shown are a descriptive comment and a hyperlink to a more complete description in a help file.Figure 8 shows a fragment of code from a surgical navigation system that calls the IIR filter of Figure 7 in two places. The only thing that appears in the body of the calling function is the icon itself and the arcs leading to and from each separate invocation of the function.

16 LabView has a much more elaborate graphical facility for representing termi-nals. This is called the front panel and is designed to look a lot like the control panel of an instrument, oscilloscope, or device.

17 In my experience, I have rarely found it necessary to include local variables in my functions. Almost all data that I use internal to a function is captured in the “wires” or arcs of the block diagram, and in the recirculated data within loops.

17

Figure 8

Interfaces need to be stored in palettes — i.e., the graphical equiva-lents of .h files in textual languages — so that other functions can in-clude them and invoke their functions. When a programmer wishes to call a function from another function s/he simply copies its icon from the appropriate palette into the block diagram of the calling function and then connects the arcs representing the inputs and outputs of that function. An interface editor must be provided so that editing the header and updating the palette can be done smoothly.Like other operations in a dataflow block diagram, a function is exe-cuted when all of its inputs are available. This implies an inherent bar-rier synchronization regarding the arrival of inputs to function so that it only executes when they have all arrived. When the function com-pletes, it puts its results onto the arcs connected to its output termi-nals. This implies some sort of pipeline synchronization to prevent data overrun.Calls to functions are inherently parallelizable if there is no directed path connecting the output of one function to the input of another. So far as the caller is concerned, there are no synchronization issues be-tween two functions other than the implicit synchronization in the dataflow graph itself.It is the intention to support recursion. That is, any function should be able to call itself either directly or indirectly (via another function call). This is clearly different from LabView, where recursion was added only recently and then only for methods of a class. The LabView limitation is probably due to the execution model of the original imple-mentation. We expect that our execution model, which is specifically designed to facilitate concurrent execution, will also be rich enough to support full recursion.

External functionsAs a matter of practicality, it is necessary to link to functions and packages developed in other languages such as C, C++, and possibly

18

Java. This could be done by providing an external function icon that derives its arguments and results from a traditional .h file or Java in-terface. There would be no attempt to parallelize the external function itself, but the call to it could be parallelized with other calls in the dataflow graph. This area obviously needs a lot more thought and de-tail.

Classes and ReferencesThere are many situations in which different parts of a program — es-pecially concurrently executing parts of that program — need to oper-ate on the same object, not just on copies of that object. The object it-self must retain state from one method to the next, and it must be pos-sible to reason about that state from the pre- and post-conditions as-sociated with the methods of the class of that object. This means two things:– There must be a primitive data type called reference that can be

copied through a program graph and that denotes a particular ob-ject. Whereas Java generally blurs the distinction between objects and references, our graphical dataflow environment must highlight it.

The methods of objects pointed to by reference must be protected by a synchronization mechanism similar to Java synchronized classes. This is necessary because in a fine-grained parallel envi-ronment, concurrent execution is the rule rather than the excep-tion, and it can happen “under the covers,” even when the pro-grammer does not explicitly fork processes or spawn threads.

Java synchronized classes are examples of monitors as introduced by Hoare [reference] and developed in the Mesa programming language at Xerox PARC [Lampson and Redell]. In implementation, monitors permit only one method to be executing at a time; concurrent at-tempts to invoke others are blocked until the executing methods ei-ther finishes or waits on a specialized data object called a condition. By this mechanism, it is possible to establish monitor invariants and to reason about the state of monitors as seen from outside by concur-rently executing programs. The result is that an intractable analysis of a concurrent system becomes tractable again and can be analyzed us-ing first-order logic.Within a monitor, of course, a lot of parallelism is possible, but that must be protected by monitor-like mechanisms on the component fields of the monitor object. It also requires that the programmer be conscious of this parallelism to the extent of well-designed condition variables and carefully partitioning a monitor object into smaller, par-allelizable parts.

19

Therefore, our graphical dataflow language will include an attribute or property of a class called synchronized, a primitive data type called reference that is associated with a synchronized class, and another primitive data type called condition that can be only used as a field within a synchronized class. A primitive operation called new can be used to create objects and return references, and primitive operations called wait and notify to permit communication among concurrent computations within the synchronized object.While it would still be possible to parallelize a loop in which each iter-ation had a copy of the same reference, there would be an inherent serialization at runtime of access to the methods of the referenced ob-ject. It should be a goal for the dataflow programming environment to at least flag such situations in order to call the programmer’s atten-tion to them. However, it is not clear whether much can be done to avoid the serialization.It seems that this concept of synchronized class would support several important kinds of applications, including:–

o Databases supporting multiple concurrent transactions18

o Cognitive processing in games and robotics where a character or robot builds up a model of its surroundings and then re-sponds to various stimuli to interact with those surroundings

o Filters of streams of data that retain state of the most recent samples so that they can, for example, be averaged.19

This issue needs a lot more study.

Execution ModelFine-grain parallelism means concurrency of execution at the granu-larity of individual function calls, iterations of loops, and even smaller fragments of code that are independent of each other. Any of these must be able to be dispatched with much less overhead than is associ-ated with traditional threads (either kernel-supported threads as in Windows and Linux or any of the popular user-space thread pack-ages). In addition, concurrent execution automatically implies out-of-order completion of concurrent activities; as a result, a traditional ex-ecution stack (i.e., linear section of virtual memory) is not appropri-ate. Finally, all of the various implicit barrier synchronizations and blocking or queuing of data values on arcs require very low-level, fast operations that should avoid invoking the operating system.18 This is task parallelism.19 Filtering is widely used in LabView, but it is implemented by a specialized

mechanism called “reentrant” execution. The two alternatives to that implemen-tation in this language are (a) reading and writing values from function termi-nals contained within loops and (b) keeping the retained state in objects of a class.

20

An appropriate mechanism to meet these needs is the (obscure) exe-cution model developed over 30 years ago for the Mesa language [Lampson, Mitchell, Satterthwaite, 1974; Lampson 1982]. In this model, there is no stack. Instead, memory for the arguments, results, and automatic variables of each function is allocated from a heap. When a function is called, a new context block is allocated from the heap and is linked back to its caller. The context block contains every-thing that is needed to execute the function:– reference to a class ob-ject, local variables, arguments, program counter, etc. Context blocks would also be allocated for parallel executions of iterations of loops and of other code fragments (which we call basic blocks).A given function call may have “calls” to multiple context blocks out-standing at the same time. The calling function (represented by its own context block) remains blocked until all of the outstanding calls are completed — that is, each of the called context blocks has re-turned. Likewise, a call to a specific function cannot proceed until all of its inputs are available, but these inputs may come from concurrent activities. A context block is allocated when the calling function or basic block is entered the first time. So long as it is not a recursive call, the context block may be reused rather than freed upon completion and reallo-cated on subsequent calls. This needs more thought.20

Note that in the dataflow model, a context block whose output serves as the input to another block can safely exit as soon at that data value has been delivered to its destination. There is no last-in, first-out rela-tionship to calls among context blocks. Moreover, the relationship be-tween context blocks is symmetrical and the compiled code to “call a block” is very similar to the compiled code to “return from a block” and the compiled code to “forward a value” to another context block.Consider, for example, Figure 8. The outputs of the (unlabelled) func-tion invocation at the top of the figure are connected to both instances of the IIR filter of Figure 7. In this execution model, the unlabelled function delivers its results directly to the context blocks of the two instances of the IIR filter. In particular, it does not return them to the caller, as in traditional execution models, so that the caller can then pass them to the filter functions.To make this all work, at least two things are needed:– A pool of threads assigned at least one per processor core to exe-

cute the dataflow graph; and A “database” of context blocks of representing outstanding invoca-

tions of functions.

20 It is one way to implement the equivalent of LabView’s reentrant property.

21

Any pool thread may execute the code associated with any context block that is “ready” — i.e., that has all of its inputs available, etc. When a thread is executing the context block, it finds the program counter, register values, pointers to class objects and global environ-ments all in that block. It continues to execute code until it becomes blocked as a result of, for example, calling another function, attempt-ing to send a data value to an arc that already has data values still outstanding, etc. When a context becomes blocked, the thread saves its registers and program counter in the context block, leaves it where it can be found by the functions providing its values, and selects an-other context block from the database to execute. Later, when the context block is ready to execute again, any available pool thread can select it from the database and resume it. Moreover, this can all be accomplished without invoking the operating system kernel, allocat-ing new pool threads, etc.Blocking is managed by lightweight barrier synchronizations which are implemented by busy waits and/or atomic instructions common to modern processors. It seems that these barriers can have a common implementation for “calling,” “returning,” and “transferring data val-ues” among context blocks, similar to the symmetry of transfers of control.As in Mesa, allocation of new context blocks from a heap can be made as efficient as ordinary function calls on stack-based execution mod-els. This is accomplished by maintaining a slab allocator similar to what is provided in the Linux kernel, so that blocks of the same size are taken from the same pool, usually by removing an item from a linked list. Blocks that can be recognized as being finished can be re-turned to the pool as efficiently as returning from a function in a stack-based execution model. However, some blocks may have out-standing references and therefore need to be garbage collected in the background.External functions in C, C++, or Java would be an exception to this execution model because their compilers assume a traditional execu-tion stack. One possible way to implement them is for the executing thread to drop out of the dataflow execution model and simply use its own stack for executing the external function. Once the external func-tion returns, it could return to its role of executing any ready context.

Compilation and Development EnvironmentFor a development environment, a plug-in to Eclipse seems like the most expedient. The main or principal window would be the dataflow graph for a function. In this window, the programmer would draw the graph using the mouse, a touch screen, or a tablet. As in LabView, s/he would add icons representing primitive operations, function invoca-

22

tions, and composite programming constructs such as loops. S/he would draw arcs from one terminal to the next and would connect arcs together where they are needed. The arcs, icons, and program-ming constructs would be converted in real-time into a directed graph data structure.Syntax and semantic checking would be done interactively, so that the programmer is continuously aware, for example, of type mismatches of arcs and terminals, direction of data flow, etc. A syntactically cor-rect function would be compiled and linked automatically whenever there is time during the drawing or manipulating of the graph, so that no separate compilation step would be required. The packaging of functions in files is yet to be determined. Since the dataflow graph of each function would occupy its own window, it would make sense that each function occupies its own file. The file would contain, at a minimum, the dataflow graph of the function, the header of the function (which would display in a separate text win-dow), and any attributes of the function needed to support program development. Presumably, the object code of the compiled function would be in a separate .o file, similar to those of other languages. 21

The creation of interface palettes also needs to be supported, as do classes and data types are shared among functions.

ConclusionThis has been an attempt to outline and document a graphical dataflow programming language capable of supporting automated tools for parallelizing computations and of supporting concurrent exe-cution at the granularity of functions and basic blocks. Programs in this language are “written” but rather are “drawn” in graphical win-dows. Each directed graph represents a function in this language. The arcs of the graph represent the flow of data values from nodes to other nodes. Terminal nodes in the graph represent the inputs and outputs of the function and the constant values used by the function. Interior nodes represent primitive operations on the data values, composite operators such as loops and case or if-else blocks, and invocations of other functions.Any function is inherently parallelizable from its directed graph. That is, when data flows in parallel paths of the graph, the operators and functions along those paths can be executed concurrently according

21 The LabView model of each function occupying its own file with the .vi ex-tension is simple and attractive. In the LabView case, the file contains the graph-ical representation of the block diagram, the “front panel” of the function, and the compiled code.

23

to the execution model. Where paths join and where pipelining along is path is possible, synchronization is automatically provided.The internal representation of a dataflow graph needs to be such that tools and algorithms can be developed to operate upon it. These tools would recognize common patterns and map essentially serial pro-grams into parallel programs — for example, by parallelizing the itera-tions of loops. They could also provide feedback to the programmer as to the structure of a program, the degree of parallelization that is pos-sible, and the Amdahl speedup factor.A research project to develop such tools and algorithms is described in a companion document.

References[To be provided.]

24

research on programming for a multi-core environmentweb.cs.wpi.edu/~lauer/concurrency/sketch of...

Documents