reducing and vectorizing procedures for telescoping languages

25
291 0885-7458/02/0800-0291/0 © 2002 Plenum Publishing Corporation International Journal of Parallel Programming, Vol. 30, No. 4, August 2002 (© 2002) Reducing and Vectorizing Procedures for Telescoping Languages Arun Chauhan 1 and Ken Kennedy 1 1 Department of Computer Science, Rice University, Houston, Texas 77005. E-mail: {achauhan,ken}@rice.edu Received March 15, 2002; revised April 15, 2002 At Rice University, we have undertaken a project to construct a framework for generating high-level problem solving languages that can achieve high perfor- mance on a variety of platforms. The underlying strategy, called telescoping languages, builds problem-solving systems from domain-specific libraries and scripting languages by extensively preanalyzing libraries to produce a precom- piler that optimizes library calls within the scripts as if they were primitives in the underlying language. Our study of applications written in Matlab by the signal processing group at Rice University has identified a collection of old and new optimizations that show promise for this particular domain. Two promising new optimizations are procedure vectorization and procedure strength reduction. By transforming these programs, at source level, according to the strategies described in this paper, we were able to achieve speedups ranging from a factor of 1.1 to 1.6 over the entire applications—with speedups for individual functions as high as 3.3. KEY WORDS: Telescoping languages; matlab; specialization; vectorization; reduction in strength. 1. INTRODUCTION As high-performance computers have become more complex, application development in traditional programming languages has become more difficult because of the need to structure programs to achieve high

Upload: arun-chauhan

Post on 03-Aug-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

291

0885-7458/02/0800-0291/0 © 2002 Plenum Publishing Corporation

International Journal of Parallel Programming, Vol. 30, No. 4, August 2002 (© 2002)

Reducing and Vectorizing Proceduresfor Telescoping LanguagesArun Chauhan1 and Ken Kennedy1

1 Department of Computer Science, Rice University, Houston, Texas 77005. E-mail:{achauhan,ken}@rice.edu

Received March 15, 2002; revised April 15, 2002

At Rice University, we have undertaken a project to construct a framework forgenerating high-level problem solving languages that can achieve high perfor-mance on a variety of platforms. The underlying strategy, called telescopinglanguages, builds problem-solving systems from domain-specific libraries andscripting languages by extensively preanalyzing libraries to produce a precom-piler that optimizes library calls within the scripts as if they were primitives inthe underlying language. Our study of applications written in Matlab by thesignal processing group at Rice University has identified a collection of old andnew optimizations that show promise for this particular domain. Two promisingnew optimizations are procedure vectorization and procedure strength reduction.By transforming these programs, at source level, according to the strategiesdescribed in this paper, we were able to achieve speedups ranging from a factorof 1.1 to 1.6 over the entire applications—with speedups for individual functionsas high as 3.3.

KEY WORDS: Telescoping languages; matlab; specialization; vectorization;reduction in strength.

1. INTRODUCTION

As high-performance computers have become more complex, applicationdevelopment in traditional programming languages has become moredifficult because of the need to structure programs to achieve high

performance on specific architectures. As a result, scientific applicationdevelopment in such languages is becoming the exclusive domain ofexperts. This has the unfortunate side effect of limiting the rate of scientificprogress. One strategy for overcoming this problem is to make it possiblefor end users to develop applications in very high level domain-specificprogramming systems. The popularity of Matlab2 is evidence of the

2 Matlab a registered trademark of MathWorks Inc.

promise of this approach. However, such systems typically do not achieveperformance levels that are sufficient to support many applicationdomains, so, many programs written in Matlab and other such languagesare rewritten in traditional languages by professional programmers forbetter performance.

At Rice University, we have undertaken a project to construct aframework for generating high-level problem solving languages that canachieve high performance on a variety of platforms. The underlyingstrategy in this framework is to use advanced compiler technology to con-struct such systems from domain-specific libraries that are specificallyprogrammed to be invoked from flexible scripting languages. This is similarto the way such systems are constructed today. However, because existingscripting languages typically treat libraries as black boxes, they fail toachieve acceptable performance levels for compute-intensive applications.Previous research has shown that this problem can be ameliorated bytranslating scripts to a conventional programming language and usingwhole-program analysis and optimization to improve performance.

This approach suffers from the disadvantage that script compilationtimes can be long and there is no provision to take advantage of the speci-fic knowledge of the library developer on how to improve code involvingmultiple calls to the library. To overcome these disadvantages, we proposea new approach called telescoping languages, (1, 2) in which the libraries thatprovide component operations accessible from scripts are extensivelyanalyzed and optimized in advance for use in a smart script compiler. Byusing the knowledge gained from this analysis and preliminary compilationand from annotations provided by the library designer, the compiler will beable to process scripts containing references to the library components asrapidly and effectively as if they were primitive operations in the baselanguage. The telescoping languages approach is diagrammed in Fig. 1.

In this strategy, the translator generator (language building compiler),which could run for many hours, would analyze and optimize the domainlibrary under a variety of assumptions about the calling program, many ofwhich are generated from advice and sample calling sequences provided by

292 Chauhan and Kennedy

DomainLibrary

Language-BuildingCompiler

EnhancedLanguageCompiler

Script ScriptTransl tor

OptimizedObject

Programa

Fig. 1. Telescoping languages.

the library developer. The result would be an ‘‘enhanced language com-piler’’ which would optimize and specialize calls to the domain library toproduce code that would feed into the vendor optimizing compiler.

The basic idea of this approach is to invest substantive time in libraryanalysis and optimization, on the assumption that the domain libraries willbe recompiled far less often than scripts that invoke them. By exhaustivelyexploring the implementation space for library modules in advance, it willbe possible to automatically build a powerful framework for transformingand optimizing uses of library primitives. The preliminary analysis andoptimization pass may take hours, but it would be worth the cost if thelibraries were reused in many different scripts.

A critical technical challenge is how to use computation power tosupport speculative optimizations of library routines that will yield highenough performance benefits to make the scripting language efficientenough for developing real applications. To address this challenge we willneed to develop new optimizations that can be systematically applied to theprocedures of a domain library. With this in mind, we have entered into acollaboration with signal-processing researchers at Rice who are engaged inthe development of a number of different applications in Matlab. Theseusers are unwilling to give up Matlab for a lower-level programming lan-guage, but they need much higher performance and efficiency levels in thefinal applications, particularly if these applications are to run efficiently onembedded computers in hand-held wireless devices.

Our study of their applications has led to the development of two newoptimizations: procedure vectorization and procedure strength reduction. Ourpreliminary investigations indicate that these optimizations can dramati-cally improve the performance of Matlab applications. Specifically, ourexperiments show that procedure strength reduction alone yields perfor-mance improvements of 10–40% on whole applications, with much higherlocal gains.

Reducing and Vectorizing Procedures for Telescoping Languages 293

The remainder of this paper is organized as follows. In the nextsection, we survey some of the conventional optimizations that can havehigh benefit on library routines written in a language like Matlab anddescribe how they might be useful in the telescoping languages framework.We then, introduce procedure strength reduction and procedure vectoriza-tion. In Sections 3 and 4, we discuss these two optimizations in details.Section 5 presents the results of our preliminary experiments with theseoptimizations. Related work is surveyed in Section 6. Finally, we discussthe ongoing development and future research in Section 7 and conclude inSection 8.

2. STRATEGIES FOR ANALYSIS AND OPTIMIZATION

The telescoping languages framework requires that decisions aboutwhich optimizations are to be applied to libraries must be made far inadvance of compilation of a program that invokes these libraries. Fortu-nately, because it can expend enormous computing power on the libraryprocessing phase, the language generation system can overcome thisproblem by generating many different specialized variants of the sameroutine, each optimized for different potential contexts. Then, at applica-tion compile time, the right specialized version of a library component canbe rapidly selected based on compile-time approximations of the types andproperties of the parameters passed to that library procedure.

For this strategy to work, we need to incorporate three types of com-pilation technologies:

• Library analysis strategies are used to construct jump functions thatcan quickly summarize the side-effects to parameters passed fromthe script to the library, making it possible to instantly propagateparameter properties (e.g., type and value) through procedure callsat script compilation time. (3) In addition, the analysis phase identifiespromising optimization points in the library and reasons backwardfrom those points to the preconditions at the call that are needed tomake the optimizations feasible.

• Recognition and exploitation of identities is a critical component thatmakes it possible to translate sequences of calls to library routinesinto equivalent sequences that are more efficient. In addition, libraryannotations can be used to help the language generation system toidentify opportunities for specialized optimizations, such as reduc-tion in strength, derived from annotations provided by the librarydesigner.

294 Chauhan and Kennedy

• Procedure specialization is used to produce variants of the libraryroutines optimized in advance to different potential calling contexts.At script compilation time, the compiler first uses a property-prop-agation procedure, which invokes jump functions at library invoca-tion points, and produces as precise an estimate as possible of theruntime properties of each parameter at each library call in theprogram being compiled. Then the compiler selects the best variantfor the called procedure using a fast matching procedure similar tounification (from theorem proving). (4)

In our exploration of Matlab programs for signal processing, we havediscovered several important opportunities for using each of these. Becausejump functions have been dealt with extensively in the literature, we willconcentrate on exploitation of identities and code specialization here.

2.1. Recognition and Exploitation of Identities

In the telescoping languages framework, library designers will be ableto provide annotations that help the language generation system producefast, effective optimizing compilers for the target language. These annota-tions can be used for two purposes—to help the language generationsystem focus on producing the right specialized variants and as a basis fora high-level optimization system that improves performance by replacingsequences of library calls with equivalent, but more efficient, sequences. Tosupport the first purpose, the annotations will need to provide hints as tohow a given library routine is likely to be used. These hints might indicatewhich parameters are likely to be constant from call to call, and whichprocedures might be called in a loop with the loop index passed as oneparameter (this will become important later). Many of these annotationscan be derived from a well-designed collection of sample calling sequences.

To illustrate how the compiler can make use of library annotations,consider Fig. 2 that shows a code fragment from a real Digital Signal Pro-cessing (DSP) application. The expression pi*np/Num_osc is a commonargument to sin and cos and does not change between the calls. Comput-ing trigonometric functions can be expensive. If the compiler knows thatthe sin or cos are both used on the same argument, it can replace the twocalls with a single call to a routine that computes both sine and cosine of asingle argument in time only slightly greater than the time to compute thesine alone. Such a relationship among trigonometric functions may becommon knowledge, but a similar relationship between library routines canonly be known through user annotations.

Reducing and Vectorizing Procedures for Telescoping Languages 295

function z = jakes mp1 (blength, speed, bnumber, N Paths)....for k = 1:N Paths

....xc = sqrt(2)*cos(omega*t step*j') ...

+ 2*sum(cos(pi*np/Num osc).*cos(omega*cos(2*pi*np/N)*t step.*jp));xs = 2*sum(sin(pi*np/Num osc).*cos(omega*cos(2*pi*np/N)*t step.*jp));....

end

Fig. 2. Code fragment from a DSP application showing opportunity for replacing anexpensive library call by a simpler expression.

Another case where user annotations can be valuable is in identifyingsequences of library routines that tend to be called in certain ways and,possibly, an equivalent less expensive call that can replace such a sequencein certain situations. Figure 3 shows an example of such a case that occursin another DSP code. This example shows that the function callschange_form_inv and change_form always occur in pairs. Moreover,the input to the latter is always the direct output of change_form_inv,

function [s,r,j hist] = min sr1(xt,h,m,alpha)....while ˜ok

....invsr = change form inv(sr0,h,m,low rp);big f = change form(xt-invsr,h,m);....while iter s < 3*m

....invdr0 = change form inv(sr0,h,m,low rp);sssdr = change form(invdr0,h,m);....

end....invsr = change form inv(sr0,h,m,low rp);big f = change form(xt-invsr,h,m);....while iter r < n1*n2

....invdr0 = change form inv(sr0,h,m,low rp);sssdr = change form(invdr0,h,m);....end

....end

Fig. 3. Code fragment from a DSP application showingpattern of library call sequences.

296 Chauhan and Kennedy

or a simple modification of it. This can lead to opportunities for optimiza-tion by combining the two functions. In the very least, it will eliminate acopy operation and a function call overhead which can be significant inMatlab.

2.2. Procedure Specialization

One way to think of specialization is as a kind of procedure cloning,which is a well known compiler technique. (5) In Matlab programs for DSPthere are many specializations that will yield significant benefits. Beforecoming to those, however, we will survey some of the high-payoff optimi-zations for these programs.

2.2.1. Useful Optimizations for Matlab

This section describes some source level transformation techniquesthat are highly relevant from the perspective of compiling high-level lan-guages like Matlab.Vectorization is a technique that has traditionally been used in compil-

ing for vector machines. (6, 7) It turns out that it is also a very effectivesource transformation technique to improve Matlab programs. The reasonfor this is that loops in Matlab have very high overheads and often libraryprocedure calls inside vectorizable loops can be replaced by their vectorcounterparts. Consider Fig. 4 that shows an example. This is the procedure

function z = jakes mp1 (blength, speed, bnumber, N Paths)....for k = 1:N Paths

....xc = sqrt(2)*cos(omega*t step*j') ...

+ 2*sum(cos(pi*np/Num osc).*cos(omega*cos(2*pi*np/N)*t step.*jp));xs = 2*sum(sin(pi*np/Num osc).*cos(omega*cos(2*pi*np/N)*t step.*jp));

% for j = 1 : Num% xc(j) = sqrt(2) * cos (omega * t step * j);% xs(j) = 0;% for n = 1 : Num osc% cosine = cos(omega * cos(2 * pi * n / N) * t step * j);% xc(j) = xc(j) + 2 * cos(pi * n / Num osc) * cosine;% xs(j) = xs(j) + 2 * sin(pi * n / Num osc) * cosine;% end% end....

end

Fig. 4. Code fragment from a DSP application showing opportunity for vectorization.

Reducing and Vectorizing Procedures for Telescoping Languages 297

jakes_mp1 that was also shown in Fig. 2. The complicated lookingexpressions that compute xc and xs are, in fact, equivalent to the com-mented loop nest that occurs just below the statements. This translationfrom a nested loop to vector statements was done by hand by the endusers, but it is something that a compiler can easily handle with currentvectorization technology. In this case, the vectorization reduces the totalrunning time of the procedure from about 50 sec. to about 1.5 sec. on a336 MHz SPARC processor—a speedup of more than 33!

The same example also demonstrates the opportunities for commonsubexpression elimination. (8) The common subexpressions that occur inMatlab programs are often vector expressions, therefore, eliminating themcan lead to significant performance gains.

Dynamic array re-shaping can be expected to occur frequently inprograms written in scripting languages like Matlab. This is confirmed byour study of several DSP simulation applications. Array re-shaping canoccur either implicitly when a column or row is added to a matrix, or it canbe explicit through the reshape library call. A classic technique to opti-mize array re-shaping is beating and dragging along. (9) This technique defersdata movement as much as possible and recomputes array indexes, when-ever possible, in terms of the original shape. Another, source level, tech-nique for Matlab is to compute the largest possible size for a dynamic arrayand allocate all of it in the beginning through the zeros call. This approachhas been reported to result in significant performance improvement insome cases. (10)

2.2.2. New Specializing Transformations

Based on our experience we have identified two types of specializationtransformations that offer high promise for improvements in Matlabprograms for DSP. Our study revealed that frequently a procedure is calledwithin a loop in which the only thing that varies from call to call is theloop index (or a simple function thereof) which is passed as a parameter.This presents two opportunities for optimization.

First, if the loop containing the call to the library can be distributedaround that call so that the call is the only thing in the resulting loop, itmay be feasible to interchange the loop and the call and then vectorizestatements in the procedure body with respect to the loop. (11) To use thisstrategy, which we call procedure vectorization, in the telescoping languagesframework, we would need to generate a variant of the procedure in whichthe loop is already interchanged to the inside and vectorized. (This variantwould have extra parameters for the loop bounds.) In addition, the proce-dure would need to have jump functions that can be used to determine

298 Chauhan and Kennedy

whether loop distribution around the call is possible. However, given thevalue of vectorization in Matlab, we expect that this would be a high-payoff optimization on many applications. Section 4 discusses this strategyin more details.

Second, we could break the procedure into two component proce-dures, one that computes and saves the values of variables that do notchange when only the single parameter (a function of the loop index) ischanging from call-to-call. The other computes the next value on each callfrom the previous one. Once these two procedures are available, the firstcan be called outside the loop and the second, which is presumably muchmore efficient, inside the loop. This optimization, which we call procedurestrength reduction, is discussed in Section 3. For this transformation to beapplied in telescoping languages we need some mechanism for determiningwhich parameters might be the only ones to vary during a loop. Otherwise,the number of variants needed will be large. Programmer annotations orsample calling sequences can be very helpful here.

3. PROCEDURE STRENGTH REDUCTION

Our approach of procedure strength reduction is reminiscent ofoperator strength reduction, (12) hence the name. In our case, however, thestrength reduction is applied to library procedures that are the operations.

It is often the case that library routines are called in a loop with anumber of arguments remaining loop invariant. The computations thatdepend only on these loop invariant arguments can be abstracted away intoan initialization part that can be moved outside the loop. The part that iscalled inside the loop depends on the loop index and performs the incre-mental computation. Figure 5 illustrates this in a simple abstract case.Function f is called inside a for loop in which the arguments c1, c2, and c3remain invariant. Thus, f can be split into an initialization function fm thatcomputes the invariant part and fD that computes the iteration dependentpart.

for i = 1:Nx = f (c1, c2, i, c3);

end

fµ(c1, c2, c3);for i = 1:N

x = f∆(i);end

Fig. 5. Example of procedure strength reduction. Argumentsci are invariant in the for loop.

Reducing and Vectorizing Procedures for Telescoping Languages 299

Figure 6 illustrates this idea with concrete example of a real DSPapplication. The upper part of Fig. 6 shows the original code and the lowerpart shows the code after applying procedure strength reduction. In thiscase, procedure strength reduction can be applied to all the three proce-dures that are called inside the loops. Only the arguments ii and snr varyacross loop iterations. As a result, computations performed on all theremaining arguments can be moved out of the procedures.

% Initialization....for ii = 1:200chan = jakes mp1(16500,160,ii,num paths);....for snr = 2:2:4

....[s,x,ci,h,L,a,y,n0] = ...newcodesig(NO,l,num paths,M,snr,chan,sig pow paths);....[o1,d1,d2,d3,mf,m] =codesdhd(y,a,h,NO,Tm,Bd,M,B,n0);....

endend....

% Initialization....jakes mp1 init(16500,160,num paths);....[h, L] =newcodesig init 1(NO,l,num paths,M,sig pow paths);

m = codesdhd init(a,h,NO,Tm,Bd,M);for ii = 1:200chan = jakes mp1 iter(ii);....a = newcodesig init 2(chan);....for snr = 2:2:4

....[s,x,ci,y,n0] = newcodesig iter(snr);[o1,d1,d2,d3,mf] = codesdhd iter(y);....

end....end....

Fig. 6. Applying procedure strength reduction to procedures called in ctss.

300 Chauhan and Kennedy

Notice that in the case of the procedure newcodesig not all thearguments are invariant inside the second level of the enclosing loop nest.Therefore, the initialization part for newcodesig cannot be moved com-pletely out of the loop nest, resulting in two initialization components—newcodesig_init_1 and newcodesig_init_2.

In principle, for maximal benefit, a procedure should be split intomultiple components so that all the invariant computation is movedoutside of the loops. Thus, in telescoping languages model, newcodesigwould be a library routine that would have been specialized for a context inwhich only its fifth argument varies across invocations. Such a specializa-tion of procedures is clearly context dependent. As mentioned before,example calling sequences and annotations by the library writer would beused to guide the specialization.

Figure 7 shows a general case for multi-level splitting of procedures. Inthis case a0 ...ak are argument sub-sequences where the sub-sequence ai isinvariant at loop level i but not at i−1. Indiscriminate specialization canlead to a combinatorial explosion of clones. The extent of reduction instrength must be weighed against the extra overheads of calling the ini-tialization (m) procedures, the extra space needed to store library clones,and the script compilation time. Appendix A discusses some of the trade-offs involved.

for i1 = 1:N1

for i2 = 1:N2

....for ik = 1:Nk

x = f (α0 α1 α2 ... αk);end

....end

end

fµ0(α0);for i1 = 1:N1

fµ1(α1);for i2 = 1:N2

....fµk¡ 1(αk −1);for ik = 1:Nk

x = fµk (αk);end

....end

end

Fig. 7. Example of a general case of procedurestrength reduction. Argument sub-sequence airepresents the arguments that are invariant at leveli but not at i−1.

Reducing and Vectorizing Procedures for Telescoping Languages 301

Depending on the arguments that are invariant, one or more of thereturn values of a procedure may also be invariant. This knowledge isneeded to be able to propagate the invariance property. The decision aboutwhether to reduce a procedure in strength may affect the decision for otherprocedures whose arguments depend on the first procedure.

4. PROCEDURE VECTORIZATION

Vectorization of statements inside loops turns out to be a big win inMatlab programs. We can extend this idea to procedure calls, where a callinside a loop (or a loop nest) can be replaced by a single call to the vec-torized version of the procedure. In the context of telescoping languagesthis can be done by generating an appropriate variant of the procedure.

Consider again the DSP program ctss in Fig. 6. It turns out that theloop enclosing the call to jakes_mp1 can be distributed around it, thusgiving rise to an opportunity to vectorize the procedure. If jakes_mp1were to be vectorized, the call to it inside the loop could be moved out asshown in Fig. 8. This involves adding one more dimension to the returnvalue chan.

To effectively apply this optimization in the telescoping languagessetting, it must be possible to distribute loops around the call to the candi-date for procedure vectorization. This requires an accurate representationof the load-store side effects to array parameters of the procedure, whichwould be encapsulated in specialized jump functions that produce an

% Initialization....chan =jakes mp1 vectorized(16500,160,[1:200],num paths);for ii = 1:200

....for snr = 2:2:4....[s,x,ci,h,L,a,y,n0] = ...newcodesig(NO,l,num paths,M,snr,chan(ii,:,:), ...

sig pow paths);....[o1,d1,d2,d3,mf,m] =

codesdhd(y,a,h,NO,Tm,Bd,M,B,n0);....

endend....

Fig. 8. Applying procedure vectorization to jakes_mp1.

302 Chauhan and Kennedy

approximation to the patterns accessed. An example of such a representa-tion is a regular section descriptor (RSD). (13, 14) Methods for computingthese summaries are well-known.

In practice, the benefit of vectorization will need to be balancedagainst the cost of a larger memory footprint as well as the costs of spe-cialization as indicated in the previous section.

5. EXPERIMENTAL EVALUATION

We studied three different DSP applications to evaluate the idea ofprocedure strength reduction. Procedure vectorization was applicable inone of the cases. All these applications are part of real simulation experi-ments being done by the wireless group in the Electrical and ComputerEngineering Department at Rice University.

In order to evaluate our idea we transformed the applications by handcarefully applying only the transformations that a practical compiler can beexpected to perform and those that are relevant to procedure strengthreduction. The transformations included common subexpression elimina-tion, constant propagation, loop distribution, and procedure strengthreduction. The transformations were carried out at the source-to-sourcelevel and both the original as well as the transformed programs were rununder the standard Matlab interpreter, unless noted otherwise. The appli-cations ranged between about 200 and 800 lines in size and all took severalhours to days to complete their runs. Some of the timing results wereobtained by running the applications for fewer iterations than usual inorder to keep the run time reasonable.

All timing results are from experiments conducted on a 336 MHzSPARC based machine under Matlab 5.3.

5.1. ctss

ctss is a program that simulates a digital communication systemwhere the data is encoded by a convolutional code and is then transmittedover a wireless channel using a new modulation scheme that exploits thecharacteristics of the channel to improve performance. The top levelprogram consists of a nested loop and procedure calls inside those loops.While these procedures are written in Matlab and are part of the program,some of these are actually being used as domain-specific library routinessince they are used by multiple programs.

Figure 9 shows the improvements resulting from our transformationsapplied by hand. The first part of Fig. 9 shows the performance improve-ments achieved in various top-level procedures relative to the original

Reducing and Vectorizing Procedures for Telescoping Languages 303

Fig. 9. Performance improvements in ctss after applying procedure strength reduction. Thegraphs show only the top level procedures. Almost all performance gains for the programcome out of codesdhd.

running time. Notice that the procedure jakes_mp1 achieved more thanthree fold speedup. These results do not include the dramatic performanceimprovement that results from vectorizing the loop inside the procedurejakes_mp1 since that vectorization had already been performed by theuser.

The whole program achieved only a little less than 40% speedimprovement since most of the time in the application was spent in theprocedure codesdhd as shown in the second part of Fig. 9. After applyingreduction in strength to this procedure its execution time fell from 23,000sec. to 14,640 sec., and accounted for almost the entire performance gainfor the whole application.

In all cases the initialization parts of the procedures were called muchless frequently than the iterative parts, effectively making the time spent inthe initialization parts comparatively insignificant.

Figure 10 shows the timing results for jakes_mp1 upon applyingprocedure vectorization. This procedure is fully vectorizable with respect to

304 Chauhan and Kennedy

Fig. 10. Performance improvement in jakes_mp1 after apply-ing procedure vectorization.

the loop index of the loop inside which it is called in the main program.The chart shows a performance improvement of 33% over the original codefor a 100 iteration loop. This performance gain was almost unchangeddown to one iteration loop.

Vectorization resulted in a smaller improvement in performancecompared to strength reduction because of two possible reasons. First,vectorization of the procedure necessitated an extra copying stage insidethe procedure. Second, vectorization resulted in a much higher memoryusage giving rise to potential performance bottlenecks due to memoryhierarchy. For very large number of loop iterations the effects of memoryhierarchy can cause the vectorized loop to perform even poorer than theoriginal code. However, these effects can be mitigated by strip-mining theloop.

5.2. sML_chan_est

This application computes the delay, phase, and amplitude of all theechoes produced by the transmissions from a group of cell phones in realworld scenario. The application is written under the Simulink™3 environ-

3 Simulink is a registered trademark of MathWorks Inc.

ment provided with Matlab. We studied a particular procedure in the

Reducing and Vectorizing Procedures for Telescoping Languages 305

Fig. 11. Performance improvement in sML_chan_est codeafter applying procedure strength reduction. This procedure iscalled inside a complicated Simulink application and is the mosttime consuming component.

application called sML_chan_est that is the one where the applicationspends most of its time.

Figure 11 shows the result of applying strength reduction to this pro-cedure. The chart shows the time taken by the initialization call and theiterative call. The initialization call has a loop that resizes an array in eachiteration. If the entire array is preallocated (using zeros) then the timespent in initialization drops from 12 sec. (init call) to 6.2 sec. (init withpreallocation). This illustrates the value of the techniques needed to handlearray reshaping and resizing.

5.3. outage_lb_fad

This application computes a lower bound on the probability of outagefor a queue transmitting in a time varying wireless channel under averagepower and delay constraints. It is a relatively small application that spendsalmost all of its time in a single procedure call. However, complex anddeeply nested loops make the application run time very long (severalhours). (See Fig. 12).

A simple application of procedure strength reduction brings theruntime of this application down by 13.5%. This indicates the power of ourapproach in benefitting even relatively tightly written code.

306 Chauhan and Kennedy

Fig. 12. Performance improvement in outage_lb_fad afterapplying procedure strength reduction. This is a stand-alone sim-ulation application that spends almost the entire time in one pro-cedure. This improvement is for the entire application all of whichcomes from optimizing the single procedure.

5.4. Effect of Compilation

Matlab has a compiler called mcc that translates Matlab into C orC++. Studying the output of the code emitted by the compiler indicatesthat it performs a fairly straight forward translation from Matlab to libraryfunction calls. Indeed, the output resembles a parse tree of the originalprogram. Therefore, the benefit of the compiler can be expected to bemaximum when interpretive overheads are high. This would be the casewhen the application spends considerable time in deeply nested loops.

We studied the effects of compilation by mcc on procedure strengthreduction for outage_lb_fad. The procedure ser_test_fad, that accountsfor almost the entire running time for the application, has a 5-level deeploop. Figure 13 shows the results. While the performance improvement dueto procedure strength reduction falls a little for the compiled code, it is stillsignificant at more than 11%. Combining procedure strength reduction andcompilation results in a performance improvement of 23.4%.

The ‘‘stand-alone’’ column indicates runtimes for the code that wascompiled to run as a stand-alone application. It is not clear why the stand-alone version of the code is slower than the compiled version that runsunder the Matlab environment. It could be an artifact of dynamicallyloadable libraries.

Reducing and Vectorizing Procedures for Telescoping Languages 307

Fig. 13. Effect of compilation on ser_test_fad that is calledfrom the application outage_lb_fad and where the applicationspends almost all of its running time.

It should be observed here that this application has very highinterpretive overheads due to deep loop nesting and, therefore, represents abest-case scenario for compilation. Other applications showed no signifi-cant performance gains due to compilation. However, a sophisticatedtechnique can utilize program properties to generate more optimized code.We believe that such a technique can speedup almost all Matlab programs.Section 7 discusses our approach to compiling Matlab within the telescop-ing languages framework.

6. RELATED WORK

As mentioned before, Matlab package comes with a Matlab compiler,called mcc, by The MathWorks, Inc. (15) The mcc compiler works bytranslating Matlab into C or C++ and then using a user specified C/C++compiler to generate object code. The translated code does not indicatethat the compiler performs any advanced inter-procedural analysis that isproposed for telescoping languages.

Type inferencing for Matlab has been addressed by the FALCONproject at University of Illinois. (16, 17) Their approach is to translate Matlabprograms into FORTRAN 90. The biggest drawback of their approach isthat they handle all function calls through inlining. This, clearly, leads toa potential blowup in compilation time if library routines have to be

308 Chauhan and Kennedy

optimized. Moreover, recursive functions can not be handled with thisapproach and some newer language features of Matlab have not beenaddressed.

Building on the FALCON system based on DeRose’s work, Universityof Illinois and Cornell University are developing a Just In Time (JIT)compiler for Matlab under their joint project called MAJIC. (18) In thiscontext, some recent work by Menon and Pingali has explored source leveltransformations for Matlab. (10, 19) Their approach is based on using formalspecifications for relationships between various operations and trying touse an axiom based approach to determine an optimal evaluation order.They evaluated their approach for matrix multiplication.

Currying in functional programming languages is a concept related toour idea of procedure strength reduction. (20) In the functional programmingworld, partial evaluation has been well studied. (21) However, many ideasfrom functional languages do not carry over well into languages likeMatlab where functions (procedures) can have side-effects.

A technique called ‘‘automatic differentiation’’ of programs has beensuccessfully used in replacing high-cost procedure calls inside loops byincremental computation. (22) This technique is mainly employed in numeri-cal codes for replacing the original computation by a less expensive, butoften approximate, computation. The code to compute the increments isgenerated automatically from the original code through automatic dif-ferentiation. (23) While our focus is on moving strictly invariant code outsideprocedures, user directed automatic differentiation could complement ourapproach in specific contexts.

APL programming language is the original matrix manipulation lan-guage. Several techniques developed for APL can be useful in Matlabcompilation, for example, the techniques to handle array reshaping. (9)

Designing annotation languages for optimizing libraries is a subject ofongoing research. (24)

There are many projects under way to translate Matlab into lowerlevel languages like C, C++, or FORTRAN. Some of these target parallelmachines using standard message passing libraries. These include the Ottersystem at Oregon State University, the CONLAB compiler from Universityof Umea in Sweden, and MENHIR from Irisa in France. (25–27) TheMATCH project at Northwestern University attempts to compile Matlabdirectly to special purpose hardware. (28)

7. AN IMPLEMENTATION STRATEGY AND FUTURE WORK

This article has evaluated procedure strength reduction and procedurevectorization, along with other transformations, at the Matlab source level.

Reducing and Vectorizing Procedures for Telescoping Languages 309

As outlined in Fig. 1 the telescoping languages strategy envisions an approachof translating high level languages into intermediate forms that are compiledinto highly optimized object code. This is the path that we are taking inimplementing a prototype of telescoping languages compiler for Matlab.

Translating to an intermediate language, like C or FORTRAN,enables us to leverage the highly tuned vendor compilers that are widelyavailable and that do a very good job of implementing traditional low leveloptimizations. Additionally, using a popular low-level language as theintermediate representation opens the doors to using back-end librarieswritten in these languages that have been extensively tuned for specificarchitectures, e.g., ATLAS. (29) It also provides a route to making the tele-scoping languages framework relatively independent of the scripting lan-guage. This is achieved by having a light weight front-end translate scriptsinto the intermediate language without attempting to do any high leveltransformations and then carrying out all the transformations within thisintermediate form.4 This also allows library routines to be written directly

4 Thanks to Dr. John Mellor-Crummey for providing this insight.

in the intermediate language. Finally, the Matlab mcc compiler, mentionedearlier, translates into exactly such a form that can be used as a convenientfront-end to the telescoping languages system.

Most high level scripting languages, including Matlab, are typeless.Type inferencing is needed not only to be able to translate into a typedintermediate language but also to enable several optimizations based ontypes. For example, library procedures can be specialized based on thetypes of their arguments and return values. As indicated earlier, specializa-tion of libraries in the telescoping languages framework is enabled by anannotation language. The language would allow expressing high levelproperties of library procedures and their relationships within specific con-texts. Library writers would use the annotations to guide the compiler ingenerating specialized versions of library procedures. Type inferencing aswell as the design of an annotation language are work in progress.

The problem of utilizing the knowledge encoded in library annotationsis a challenging one to solve. The solution space includes database-stylequery optimization based on first order logic, rewriting systems, and selflearning techniques based on artificial intelligence. Exploring some of thesetechniques is part of the future work.

8. CONCLUSIONS

We described a compilation framework for Scripting Languages thatwe call telescoping languages. Figure 14 depicts this framework graphically.

310 Chauhan and Kennedy

Fig. 14. Compilation framework.

Our work focuses only on the library and script compilation components.Type inferencing is needed since Matlab programs do not have variabledeclarations and a smart runtime system that includes dynamic or Just inTime (JIT) compilation can be useful for Grid-computing scenarios. (30)

Many existing compilation techniques are useful for compiling tele-scoping languages, especially, vectorization, CSE (common sub-expressionelimination), and invariant code motion. We introduced two new tech-niques, procedure strength reduction and procedure vectorization. Boththese techniques work within the telescoping languages model by providingthe benefits of interprocedural analysis without incurring extra costs atscript compilation time or requiring library sources. We evaluated proce-dure strength reduction by studying three real DSP applications. Applica-tion of procedure strength reduction leads to up to 40% overall applicationperformance improvement and up to 300% procedure level performanceimprovement through source level transformations. For a procedure in oneof the applications, applying procedure vectorization leads to a 33% per-formance improvement.

The idea of procedure strength reduction can be extended to includeoperator strength reduction. Thus, not only can the expressions thatdepend only on the invariant arguments be extracted out of a procedure,the remaining expressions can also be reduced further. In many cases theremay not be an easy way of reducing these remaining expressions. In thesecases, automatic differentiation can be helpful.

APPENDIX A

A.1. Procedure Strength Reduction for Nested Loops and Its

Relative Gain

Suppose that we apply reduction in strength to a procedure that iscalled inside a loop nest k levels deep. We associate execution times witheach reduction as shown in Fig. 15.

Reducing and Vectorizing Procedures for Telescoping Languages 311

for i1 = 1:N1

for i2 = 1:N2

....for ik = 1:Nk

x = f (α0 α1 α2 ... αk);end

....end

end

fµ0(α0); // Tµ0

for i1 = 1:N1

fµ1(α1); // Tµ1

for i2 = 1:N2

....fµk¡ 1(αk −1); // Tµk −1

for ik = 1:Nk

x = fµk (αk); // Tµk

end....end

end

Fig. 15. Applying procedure strength reductionin a general case along with the execution times ofvarious components.

The original running time, Th is given by:

Th=1Dk

i=1Ni 2×T

where T is the original running time of one call to f. The new executiontime, Tn, for the translated code is given by

Tn=Tm0+(N1)×Tm1+(N1.N2)×Tm2+·· ·+1Dk−1

i=1Ni 2×Tmk−1+1D

k

i=1Ni 2×Tmk

Thus, the difference in running time, TD, is

TD=Th−Tn

and relative improvement in speed, TD/Th, is given by

TDTh=1−

TnTh

or

TDTh=1−

1T×5Tmk+

Tmk−1Nk+Tmk−2Nk−1 .Nk

+·· ·+Tm1

<ki=2 Ni

+Tm0

<ki=1 Ni6

312 Chauhan and Kennedy

This clearly provides an upper bound on the amount of performanceimprovement that can be achieved with this method, which is 1−Tmk/T.

In addition, this equation also provides another useful insight. It isusually the case that the sum of the running times of all fms is equal tothe original running time T of f, i.e., T=;k

i=0 Tmi . Clearly, to obtainmaximum performance improvement the summing series in the bracketsmust be minimized. It is minimized if we can make all Tm1 · · ·Tmk−1 valueszero while maximizing Tm0 given the constraint that all Tmi sum to T. Thiscorresponds to the intuition that computation should be moved out of theentire loop nest, if possible. However, notice that except the first term, allother terms in the brackets have iteration range in the denominator. Thus,for any reasonably large loop the contribution from all those terms isinsignificant. For example if Nk is 100 the effect of all terms after the first isof the order of only 1%. This leads to the conclusion that except for thecase when the innermost loop is very short (in which case the compilershould consider loop unrolling) splitting the procedure f more than oncemay not provide significant benefits. From compiler’s perspective, it neednot spend much time attempting to reduce procedures multiple times for amulti-level loop since the marginal benefits after the first split are minimal.

ACKNOWLEDGMENTS

We thank Dr. John Mellor-Crummey, Dr. Rob Fowler, and Dr.Bradley Broom for several insightful discussions. The DSP code was madeavailable by the Center for Multimedia Communications and its directorProf. Behnaam Aazhang. Many others helped in understanding variousaspects of the codes, including, Dr. Srikrishna Bhashyam, Dr. VidyaVenkataraman, Dr. Hyeokho Choi, Shriram Sarvotham, Dr. DineshRajan, and Bryan Jones.

The work by Chauhan was supported by the Texas Advanced Tech-nology Program Grant 003604-0061-2001. Other funds for the project camefrom the National Science Foundation Grant EIA-9975020 and the LosAlamos Computer Science Institute Grant 03891-001-99-46.

REFERENCES

1. Ken Kennedy, Telescoping Languages: A Compiler Strategy for Implementation of High-level Domain-specific Programming Systems, Proc. Int’l Parallel and Distributed Proc.Symp. (May 2000).

2. Ken Kennedy, Bradley Broom, Keith Cooper, Jack Dongarra, Rob Fowler, DennisGannon, Lennart Johnson, John Mellor-Crummey, and Linda Torczon, TelescopingLanguages: A Strategy for Automatic Generation of Scientific Problem-Solving Systemsfrom Annotated Libraries, J. Parallel and Distributed Computing, 61(12):1803–1826(December 2001).

Reducing and Vectorizing Procedures for Telescoping Languages 313

3. Dan Grove and Linda Torczon, Interprocedural Constant Propagation: A Study of JumpFunction Implementations, Proc. ACM SIGPLAN Conf. Progr. Lang. Design and Imple-mentation, pp. 90–99 (June 1993).

4. Mike S. Paterson and Mark N. Wegman, Linear Unification, J. Comput. Syst. Sci.,16(2):158–167 (1978).

5. Keith D. Cooper, Mary W. Hall, and Ken Kennedy, Procedure Cloning, Proc. IEEE Int’lConf. Computer Lang. (April 1992).

6. John R. Allen and Ken Kennedy, Automatic Translation of Fortran Programs to VectorForm, ACM Trans. Progr. Lang. Syst., 9(4):491–542 (October 1987).

7. Randy Allen and Ken Kennedy, Advanced Compilation for Vector and Parallel Computers,Morgan Kaufmann Publishers, San Mateo, California (2001).

8. Preston Briggs and Keith D. Cooper, Effective Partial Redundancy Elimination,Proc. ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 159–170(June 1994).

9. Philips S. Abrams, An APL Machine, Ph.D. thesis, Stanford Linear Accelerator Center,Stanford University (1970).

10. Vijay Menon and Keshav Pingali, A Case for Source Level Transformations in Matlab,Proc. ACM SIGPLAN/USENIX Conf. Domain Specific Lang. (1999).

11. Mary Hall, Ken Kennedy, and Kathryn McKinley, Interprocedural Transformations forParallel Code Generation, Proc. Supercomputing, pp. 424–434 (November 1991).

12. Frances E. Allen, John Cocke, and Ken Kennedy, Reduction of Operator Strength,In S. Muchnick and N. Jones, eds., Program Flow Analysis: Theory and Applications,Prentice–Hall, pp. 79–101 (1981).

13. David Callahan and Ken Kennedy, Analysis of Interprocedural Side-effects in ParallelProgramming Environment, J. Parallel Distribut. Comput., 5:517–550 (1988).

14. Paul Havlak and Ken Kennedy, An Implementation of Interprocedural Bounded RegularSection Analysis, IEEE Trans. Parallel Distribut. Syst., 2(3):350–360 (July 1991).

15. http://www.mathworks.com/. Mathworks, Inc.16. Luiz Antônio DeRose, Compiler Techniques for Matlab Programs, Ph.D. thesis, Univer-

sity of Illinois at Urbana-Champaign (1996).17. Luiz DeRose and David Padua, Techniques for the Translation of MATLAB Programs

into Fortran 90, ACM Trans. Progr. Lang. Syst., 21(2):286–323 (March 1999).18. http://polaris.cs.uiuc.edu/majic/. MAJIC Project.19. Vijay Menon and Keshav Pingali, High-level Semantic Optimization of Numerical Code,Proc. ACM-SIGARCH Int’l Conf. Supercomputing (1999).

20. Haskell B. Curry and Robert Fayes, Combinatory Logic, Vol. 65, Studies in Logic and theFoundation of Mathematics, North-Holland Pub. Co., Amsterdam (1958).

21. ACM SIGPLAN Workshop on Partial Evaluation and Semantics Based Program Mani-pulation.

22. Andreas Griewank, Evaluating Derivatives: Principles and Techniques of Algorithmic Dif-ferentiation, SIAM, Philadelphia, Pennsylvania (2000).

23. http://www.cs.rice.edu/’ adifor/. ADIFOR Project: Automatic Differentiation of Fortran.24. Samuel Z. Guyer and Calvin Lin, An Annotation Language for Optimizing Software

Libraries, Proc. ACM SIGPLAN/USENIX Conf. Domain Specific Languages (1999).25. Michael J. Quinn, Alexey Malishevsky, and Nagajagadeswar Seelam, Otter: Bridging the

Gap Between MATLAB and ScaLAPACK, Proc. IEEE Int’l Symp. High PerformanceDistributed Computing (August 1998).

26. Peter Drakenberg, Peter Jacobson, and Bo Kågström, A CONLAB Compiler for a Dis-tributed Memory Multicomputer. SIAM Conf. Parallel Processing for Scientific Comput-ing, 2:814–821 (1993).

314 Chauhan and Kennedy

27. Stéphane Chauveau and Francois Bodin, MENHIR: An Environment for High Perfor-mance MATLAB, Proc. Workshop Languages, Compilers, and Run-time Systems forScalable Computers (May 1998).

28. Anshuman Nayak, Malay Haldar, Abhay Kanhere, Pramod Joisha, Nagraj Shenoy, AlokChoudhary, and Prith Banerjee, A Library-based Compiler to Execute MATLABPrograms on a Heterogeneous Platform. Proc. Conf. Parallel and Distributed ComputingSystems (August 2000).

29. R. Clint Whaley and Jack J. Dongarra, Automatically Tuned Linear Algebra Software,Proc. SC: High Performance Networking and Computing Conf. (November 1998).

30. http://www.hipersoft.rice.edu/grads/. GrADS: Grid Application Development SoftwareProject.

Reducing and Vectorizing Procedures for Telescoping Languages 315