three par am fast givens

Upload: dgh3

Post on 09-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Three Par Am Fast Givens

    1/8

    1996 International Conference on Parallel Processing

    A THREE-PARAMETER FAST GIVENS QR ALGORITHM FORSUPERSCALAR PROCESSORStJames J. Carrig Jr. and Gerard G. L. MeyerDepartment of Electrical and Computer EngineeringThe Johns Hopkins U niversityBaltimore, MD 21218jcarrig @ece.j hu.edu,

    Abstract - We present a three parameter Fast GivensQR algorithm that exploits parallelism to improve perfor-mance on superscalar processors. We provide a selectionofparame ter values o r which the new algorithm reduces othe standard algorithm, but show that non-standard valuesminimize the number o cache misses, memory referencesand pipeline stalls. Using a tractable model of a super-scalar machine architecture, we derive rules or estimatingthe optimal combination of parameter values. Applyingthese rules, we observe a speedup over the standard al-gorithm of 2.4 on the Intel Pentium Pro system, 2.0 on asingle thin POWER 2 processor o the IBM SP2, 1.6 on asingle wide POW ER2 processor o the IBM SP2, and 4.2on a single R8000proc essor of the SGI POW ER ChallengeXL .

    INTRODUCTIONVarious forms of QR decomposition arise in signalprocessing app lications that solve the full rank least squares

    problem, pe rform Kalman filtering, find the eigenvalues ofa matrix, or compute the singular value decomposition ofa matrix. This paper m odifies the standard fast Givens QRalgorithm by introdu cing a cache parameter, a register pa-rameter, and a pipeline parameter that exploit parallelismto improve performance on superscalar processors. Theproposed algorithm is numerically identical to the standardFast Givens algorithm-the reduction in execution time re-sults from streamlining memory and pipeline performance.QR Decomposition: Given i? E X m X , 2 n,rank(A) = n, compute Q E R and R E RmXthat satisfy:

    The QR decomposition problem is defined as:

    This work wa s supported in part by a grant of time from the Do DHigh Performance Computing Modernization Program using the U.S.Army Research Laboratorys SGI POWER Challenge XL system andthe Maui High Performance Computing Centers IBM SP2 system.Access to these DoD HPC resources wa s made possible through spon-sorship from the U . S . Army Research Laboratory DoD Major SharedCenter, Aberdeen Proving Ground, MD. This work wa s also supportedby the loan of a Pentium Pro system from Intel.

    gmeyer @ece.jhu.edu

    with individual elements Ri,j of R equal to zero for alli > j . We base our developm ent on the fast Givens familyof algorithms because this family requires the smallest num-be r of floating point operations w hen solving the full rankleast squares problem [ I ] and there exists a parallel fastGivens algorithm with low communication requirements[ 2 ] . Furthermore, the three parameters that we proposemay be embedded into the parallel version.

    This paper presents a three parameter fast Givens algo-rithm that efficiently solves the QR decomposition problemon machines with cache memo ries. On these machines,memory op erations have become just as critical to the ex-ecution time as floating point operations. The algorithmdesigner must therefore attempt to minimize both memoryand floating point operations.

    An established approach for designing matrix algo-rithms with fewer memory references is to develop fac-torizations that exploit block matrix multiply primitivescommonly implemented as the Level 3 Basic Linear Al-gebra Subprograms (BLA S) [3], [4]. Both the LAPACK[5 ]and ScaLAPACK [6]library routines use this approach.These libraries provide routines for computing the Hous e-holder QR based on Sch reiber and Van Loans [7]storage-efficient WY representation of Householder transforma-tions [8]. The main drawback of introducing block op-erations, however, is that memory overhead is reduced atthe cost of additional floating point operations. Thu s theoptimum block size is a trade-off between the number ofrequired floating point operations and the efficiency of thememory access. This trade-off has been studied by Gallivanet al. in the context of LU decomposition [9] , IO].We use an altemative approach that does n ot increase,or even alter, the floating point operations. T hrough d esignparameters U , , and w we control the zeroing order andthe update order. The resulting parameterized algorithmis numerically identical to the standard algorithm-only thememory overhead is modified. We then apply a tractablesuperscalar model to study the memory overhead as a func-tion of the design parameters, and use the results of thisstudy to develop rules for estimating the optimal combina-

    11-110190-3918/96 $5.00 0 996 IEEE

    Authorized licensed use limited to: Jet Propulsion Laboratory. Downloaded on September 17, 2009 at 17:55 from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Three Par Am Fast Givens

    2/8

    1996 International Conference on Parallel Processingtion of parameter values. From the user's point of view, theend result is a faster, numerically identical QR algorithm. We base our algorithm development upon the architec-ture shown in Figure 2. The Processor consists of a Pro-

    THE STAN DARD FAST GIVENS QRALGORITHMThe fast Givens algorithm computes scaled versions

    of the matrices Q and R that satisfy Eq. (1). If matri-ces A and E x are input to the algorithm as anaugmented matrix A = [ J , O l E R" ( n+b) , then the Figure 2: The superscalar machine model that i s the basisfor our algorithm development.L Jalgorithm compu tes the non-zero elements of a positive di-agonal matrix D E RmXm ,hile transforming A intoD i Q T A = D i [R ,QTB] . n many applications Q isnot needed and the scale factor 0 3 ancels in subsequentoperations .

    (A , d ) = Fast-Givens-QR( A )Matrices: A E Rmx (n+b ) ,E R"1: d = l2: F o r j = 1 o n3:4:5:6: E n d F o r7: E n d F o r

    F o r i = m to j + 1by -1(A , d , , a1 , ,&)= rotate-l(A, d , , j )(A ) = update-l-l-type-t(A, i , j , 1 ,P I )

    Figure 1 The standard Fast Givens QR algorithm.The standard fast Givens algorithm taken from Golub

    [11 is reproduced in Figure 1 using a stylized subset of theMatlab programming language. In line 1 the algorithminitializes each element of the vector d to 1. Lines 2-7update d while transforming A in to D*QTA. Zeros a reintroduced into the A matrix from top to bottom and leftto right. Th e zero in row i and column j is obtained bycombining rows i and i - 1 through either a type 0 or atype 1 fast Givens rotation. Th e subroutines r o t a t e - 1and update-1-1- typed (t E ( 0 , l ) ) accomplish theneeded transformation. Note that these subroutines arespecial cases of the rota te-X and upda te -Xu- typedsubroutines provided in the appendix.

    SUPERSCALAR ARCHITECTUREWhether w e solve the QR decomposition on a desktopworkstation or on a massively parallel supercomputer, we

    most likely are using superscalar processors. A su perscalarprocessor is a processor that is capable of concurrent ex-ecution of scalar instructions [l 11. Ideally, the proces sorachieves perfect concurrence between memory and floatingpoint instructions w hile issuing instructions at a high clockrate. T he processor supports this clock rate through the useof several registers, large caches, and arithmetic pipelines.

    gram Control and Integer Unit (PCIU), a Floating Point(FP ) Uni t, and a collection of p Registers. These compo-nents are connected to a Memory consisting of the Programana! Integer M em ory, a Cache of 0 >> p bytes, and a FPMemory of more than 6 bytes. The PCIU controls the pro-gram flow. Whenever possible, it issues instructions tothe FP Unit and FP Registers to achieve concurrence be-tween memory and arithmetic operations. When there isperfect concurrence, there is a possibility that the memoryoperations consume 100 % of the execution time. For thisreason, we direct our efforts towards minimizing the datatransferred over the Registers to Cache and the the C ache toFP Memory interconnections, highlighted in Figure 2. Weinclude the Program and Integer Mem ory for completeness,but only floating point data and calculations are consideredin this analysis.

    Every variable in he algorithm corresponds to auniq ueword of data. All data is initially stored in the FP Memory,but the data must be read into the cache and the registersbefore calculations may proceed within the FP Unit. Alldata in the registers is immediately available for use by theFP Unit. Any register may update its contents by loadingany variable from the cache. Similarly, any register maystore its contents to any variable in the cache. Thus, thecontents of the cache are m odified when a register writes tothe cache, and when the cache reads from the FP M emory.

    A variable can only be read by the processor, that is avariable can only be loaded into a register, if the variableis present in the cache. Since the cache is initially empty,the first data item that the processor attempts to read resultsin a cache miss. As a result, the processor waits whilethe missing variable is read from the FPMemory into thecache. Variables in the cache are ordered by the time thatthey were most recently part of a computation. When thecache is full and a new variable is needed by the processor,the least recently used variable is expelled from the cacheso that the desired variable may be read in its place.We wish to emphasize an important operational dif-ference between the collection of registers and the cache.With registers we assume the freedom to load and storewhichever variables we choose, whereas the cache is con-fined to removing old data according to the least recentlyused (LRU ) policy.

    11-12

    Authorized licensed use limited to: Jet Propulsion Laboratory. Downloaded on September 17, 2009 at 17:55 from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Three Par Am Fast Givens

    3/8

    1996 International Conference on Parallel Processing-Finally, we assume that the computation is pipelined.

    This means that while a new calculation may be startedevery cycle, the results of the calculation are not availablefor N, cycles. A pipeline stall occurs if a computationcannot be initiated until the result of a previous computationbecomes available.A THR EE-PARAMET ER FAST GIVENS QRALGORITHM

    The s tandard fast Givens QR a lgorithm is efficient forhand calculations and for computer systems where serialfloating point operations dominate the execution time, butit is not efficient for most superscalar processors. We in-

    ( A , d ) = ThreeParameterFast-Givens-QR(A, , A , w ) :Matrices: A E Rmx("+*),E R"

    1: d = l2: For r = m to 2 by -U3:4:5:6:7:8: l = i - X9: End for10:11: ( A , d , t , c r l , P 1 ) = r o t a t e - l ( A , d , i , j )12 :13: End for14: End for15: End for

    F o r j = 1 o m i n ( m - + U , n )1 = m i n ( r + - 1,m)F o r i = min(r+ - 1,m ) to% 1 preserves i

    ~ U Z ( T ~ + j , j + )+ X - 1by - A( A , d , , l lA,..a x , P A ) =rotate-X(A, d , , j )( A ) = update-Xu-type-t(A, i, , q , I ,. . . , PA)

    F o r i = 1 t o m u z ( r + j - u , j + 1)by -1(A) = update-l-l-type-t(A, i , j ,c q , I)

    Figure 3: The three-parameter Fast Givens QR algorithm.

    troduce three parameters to improve performance on su-perscalar processors: U alters the temporal locality of theprogram to minimize cache misses, X uses additional reg-isters to reduce the intermediate memory operations, andw orders independent calculations to ensure fine grain par-allelism. Figure 3 presents the three parameter fast Givensalgorithm. Th e standard algorithm can be recovered by set-ting U = m - 1,and X = 1. When X = 1 all values of wyield identical algorithms. Figure 11 illustrates this case.The Cache P arameter U

    The cache parameter U is targeted to improve perfor-mance on problem s where the cache is large enough to holdbetween 4 and m - 1 ows of the A matrix. Smaller prob-

    lems are already cache efficient because every element isread into the cache exactly once regardless of the algorithmused. Extremely large problems call for a modification ofthis technique.

    There exists considerable freedom in the zeroing ord erof the fast Givens QR algorithm. We exploit this freedomin order to find a zero ordering that is more cache efficientthan the standard ordering. Th e standard ordering zeroescolumns from bottom to top, starting with the leftmost col-umn and working towards the right. Placing a zero in rowi, requires rows i and i - 1 o be read and modified. Thus,the entire matrix has been read by the time the first columnof zeros is complete. Suppose that the cache is not largeenough to hold the entire matrix, then the rows near thebottom of the matrix which were least recently used, willhave been expelled from the cache. When the next columnof zeros is begun, the rows needed are those which havebeen expelled.

    Figure 4a illustrates the order in which the standardfast Givens algorithm introduces zeros when m = 10,n = 4, and b = 1. Notice, for example, that the tenthzero introduced is in row 10,column 2. The computationscorresponding to this zero alters only rows 9 and 1 0 anddo not depend upon rows 1 through 8. Since the compu-tations associated with the zeros labeled 3,4,5,6,7,8, and9 do not depend upon or alter rows 9 and 10, these com-putations are completely independent of the computationsfor zero number 10. Therefore, zero number 10 could beintroduced before or between zeros number 3,4,5,6,7,8 or 9without altering the numerical properties of the algorithm.Designers of high level parallel algorithms have exploitedandge neralized thiso bserva tion [12], [13], [14], [151, [161,[2]. If the following two cond itions are satisfied, a zero maybe added in row i and column j and the resulting sequenceyields an algorithm which is numerically identical to thestandard fast Givens algorithm:

    Condition I : Alj = 0 fo r 1 = i + 1 ,. . ,mCondition 2: Al,k = 0 fo r 1 = i - 1, . ,m and

    k = 1 , . . . , j - 1.These conditions allow for the standard zeroing order andalso a diagonal sequence show n in Figure 4b, that is oftenthe basis of parallel algorithms.

    The three-parameter fast Givens algorithm introduceszeros into the A matrix in diagonal bands of height U . Tobeprecise, we define the bands ab,k E { 1 , 2 , . . . ,[e1to be the set of elements:

    s1k = {A,.+j-l,jlr = m - k - )u,r + j - - l < m ,1 = 1 , . . ,min(u, - ) ,j = l , . . . , n } . ( 2 )

    Figure 4c highlights the bands that corresponding to thecase U = 3. In words, a band consists of groups of up

    11-13

    Authorized licensed use limited to: Jet Propulsion Laboratory. Downloaded on September 17, 2009 at 17:55 from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Three Par Am Fast Givens

    4/8

    1996 International Conference on Parallel Processing

    (a)a 2 - 1 (b) cr = 1 (c) U = 3Figure 4: The order in which zeros are introduced as afunction of U when m = 10 , n = 4 an d b = 1.

    to U consecutive elements per column for each of the ncolumns. The group in column j + 1 is offset from thegroup in column j by one row in the positive direction.The bands themselves stack vertically atop one another andare ordered from bottom to top. Within each band, wezero from bottom to top and from left to right. That is,after zeroing U elements in a single column, the algorithmadvances the row index by cr (if possible) and column indexby 1. When U 2 - 1,we recover the standard ordering.When (T = 1, we recover the diagonal ordering. Notethat the introduction of the parameter U requires a callingstructure involving a triply nested loop. Th e additionalloop variable (T in line 2, Figure 3 ) marks the row numberof the first zero introduced into each band, and is identicalto the value of T in Eq. ( 2 ) . Observe that for any valueof U , the three parameter algorithm satisfies Conditions 1and 2. Condition 1 is satisfied because we zero the bandsfrom bottom to top and zero from bottom to top within eachband. Condition 2 is s atisfied because we zero from left toright and we offset each group of U rows by 1.The R eeister Parameter X

    Let us now consider the source of intermediate mem-ory operations in the update computations. It makes nodifference in our analysis whether a type 0 or a type 1 up -date is calculated as the same elements are accessed.

    Among other things, introducing a zero in row i andcolumn j modifies the matrix elements Ai-l ,k for k =J + 1, .. ,n + b. We later modify these same elements forthe zero in row i- 1. Therefore the storing and reloading ofeach intermediate value of A,- 1 , k is overhead. If the zerosin rows i and i - 1were introduced together, this overheadcould be eliminated. If X zeros were introduced together,the overhead of storing and reloading X - 1 elements percolumn could be eliminated.

    is limited by thenumber of registers available. Suppos e we completely up -date each column based on X zeros before moving on toupdate the next column. This requires 2(X - 1) more reg-

    The actual benefit of increasing

    isters than the standard approach to hold the additional co-efficients {az,Pa, . . . ax,P A } .

    Generalizing the standard algorithm to support the up-date of X zeros at a time, requires a s et of a and ,B coeffi-cients to correspond to each added zero. Also, the type ofrotation to apply for each zero must be recorded. Since eachof the X zeros may be either type 0 or type 1, there are 2'possible sequences of rotation types. We keep track of eachindividual rotation by allowing t to take on values in the set{ 0 , 1 , . . . 2' - l} . f t is written using binary notation,the mos t significant bit specifies the type of the first zero,while the least significant bit specifies the type of the X t hzero. T his may be implemented efficiently by using 2' - 1subroutines-one correspond ing to each possible value of tFigure 12 presents the general form of the subroutines, andFigures 5 and 1 1 provide specific examples. Note that theyi,j s hold intermediate quantities which are intended to bestored in registers. P rogramming these q uantities as scalarsaids the compiler in recognizing this benefit.

    Figure 5: When X = 3, w = 1, an d t = 3, up-d a te-Aw-type-t reduces to upda t e-3 1-type-3.

    The Pipeline Parameter wThe pipeline parameter,w , further reorders the compu-

    tation to avoid pipeline stalls. Th is is som etimes necessaryto fully realize the improvem ents identified in the previoussection.

    To avoid pipeline stalls, we n eed every set of N, con-secutive computations to be independent. This pipelinerestriction may no t be satisfied when X > 1. Observe thesubroutine update-3-1-type-3 in Figure 5. The calcu-lations in lines 4 nd 5 cannot begin until the calculation inline 2 is complete. Similarly, the calculations in lines 6 and7 cannot begin until the calculation in line 4 s complete.If N , is relatively large, several cycles will elapse betweenthe initiation of the calculation in line 2 and the calculationsin lines 4 and 5. Similarly more cycles elapse between theinitiation of the calculation in line 4 and the calculations inline 6 and 7.

    11-14

    Authorized licensed use limited to: Jet Propulsion Laboratory. Downloaded on September 17, 2009 at 17:55 from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Three Par Am Fast Givens

    5/8

    1996 International Conference on Parallel ProcessingThe key to avoiding these pipeline stalls is to recognize

    that the computations for different columns (i.e., differentvalues of k ) are still independent of one another. We intro-duce the parameter w that defines the number of columnsthat we completely update for X zeros before beginning theupdate on the next w columns. That is, we apply a1 andp1 to w columns, and then apply a2 and p2 to the same wcolumns, etc. This update pattern is illustrated in Figure 6.w must be chosen to be large enough not to stall the pipeline,and small enough that the intermediate quantities may beheld in registers. Recall from the previous section that wecan only reduce the num ber of stores and reloads if we haveenough additional registers. This restriction is now moresevere, as we must hold Ai- l , k + k s , k2 = 0 , . . ,w - 1 nregisters to achieve the same benefits.

    Figure 6: The parameters X and w control the order inwhich the updates are completed.

    The Optimal Parameter CombinationBased on the su perscalar model and the intuitive anal-

    ysis of the proceeding s ections, the optimal value of u de -pends on the problem dimensions and the size of the cache.Also, u should be a multiple of X so that the efficient up-dating routine update-Xu-type-t (A > 1) s called formost updates. In contrast, the optimal values of X and wdepend only on the machine and not the problem dimen-sions.

    In [17] we model the effect of the parameter u whenX is equal to 1, and derive the following estimate for theoptimal value of U over the set {1,2, . .}:

    This es timate characterizes the machine by 6 and selects 3to match 6 and the problem size. A physical interpretationof this rule is that the execution time is minimized by choos-ing the largest va lue of u such that u+2 rows of n+b- 1elements fit into the cache. Note that when u zeros areintroduced into one column of the A matrix, o + 1 rowsare read and modified. When the first zero is introducedinto the next column, an additional row is processed. Theremaining u - 1zeros in that column use (r rows that wererecently read into the cache. In order to ensure that these

    Pentium Pro (Standard algorithm: 9.23 s)2 3 4 5 6

    4 6 4 s 4 6 4 s 4 6 4 s 4.64s 4.64s4 4 4 s 4 4 6 s 4 5 3 s 4 5 5 s 4 4 6 s419s 426s 4.14s 4.30s 447s43 1 s 4.34s 4.23 s 42 5 s 4.33 s4 8 0 s 4 5 6 s 4 3 7 s 4 3 6 s 4 4 4 s

    POWER2 ( thin) (Standard algoritlim. 5.17 s)I h \w I 2 3 4 5 6338s 338s 338s 3 .38s 338s 338s2 7 3 s 2 8 5 s 2 6 4 s 2 7 8 s 2 6 8 s 2 6 5 s2 9 4 s 2 9 3 s 2 9 5 s 2 8 8 s 2 9 5 s 2 9 5 s304s 304s 302s 3.91s 393s 394s367s 380s 2 68s 2 .62s 273s 27 3s

    2 2 0 s 2 2 0 s 2 2 0 s 2 2 0 s 2 2 0 s 2 2 0 s2.02s 212s 2 2 1 s 2 2 0 s 2 2 7 s 2 2 4 s

    194s 209s 2.11s 218s 2.19s184s 196s 2 0 1 s 2 2 5 s 3.77s1 8 4 s 2 4 0 s 1 8 9 s 4 0 7 s *

    *Not tested due to compiler limitationsTable I: The performance of ou r processors versus designparameters X and w , when m = 3,000, n = 200, b = 0,and u = (TA.rows are available, the cache must be large enou gh to storeu + 2 rows.will be a multiple of A. Empirical evidence presented in[171 shows that underestim ating the optim al value of u isgenerally preferable to overestimating its value. T herefore,we cho ose a value of U that is the largest multiple of thatis less than or equal to Cr. We then take the maximum ofthis value with X to provide a legal value of U to allow forthe case ii < A, that is:

    We now modify Eq. (3) so that the chosen value of

    Equation (4) provides a means for choosing u once the ma-chine has been ch aracterized by 8 and X has been selected.

    Allowing 6 io equal the true size of the cache workswell when most of the cache is reserved for program dataand blocks in the cache are either directly mapped to mem-ory or the machine implements a true LRU block replace-ment policy. In other situations, especially when a randomline replacement policy is used, parts of the cache maybe replaced before the entire cache has been filled by the

    11-15

    Authorized licensed use limited to: Jet Propulsion Laboratory. Downloaded on September 17, 2009 at 17:55 from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Three Par Am Fast Givens

    6/8

    1996 International Conference on Parallel Processing-three parameter algorithm. These situations suggest thatone should choose an effective value of 8 that is smallerthan the true size. T he effective size may be determined bymeasuring the execution time as a function of CT for a repre-sentative problem, and then selecting a value of 6 so that theestimate given by Eq. (4) matches the optimal measuredvalue of CT .

    Table 11: Each machine is characterized by three integers:the effective size of the cache (Q), nd x, nd W which aredetermined fro m Table I.The optimal values of X and w are constrained by thenumber o f available registers to be one of a few sm all inte-gers. Since the optimal combination of these parameters is

    designed to be independent of the problem dimensions, wemay determine the best match for each machine by measur-ing the performance of any typical problem for all feasiblecombinations of X and w . Ideally, these parameters couldbe measured independently of any cache effects by choos-ing a problem which is small enough to fit in the cache.Unfortunately, a problem that satisfies this constraint maybe so small that we canno t accurately measure its executiontime. For these reasons, w e applied the rule given by Eq.(4) to a problem with dimensions m = 3,000, n = 200,and b = 0 to determine the optimal com bination of X andw for each target machine. Results are provided in Table I.In sum mary, we characterize each machine by the threeintegers, 8, x, ndw , providedinTableI1. Basedupontheseintegers, we choose A = A, w = i;, nd CT = u x .

    RESULTS AND CONCLUSIONSFigures 7-9 compare the execution time for four algo-

    rithms on a 133MHz Intel Pentium Pro system, a 66 MHzthin POWER2 processor of the IBM SP 2, a 66 MHz w idePOWER2 processor of the IBM SP2 , and a 75 MHz R8000processor of the SGI POWE R Challenge XL. Note that thestandard algorithm and diagonal algorithm are special casesof the three-parameter algorithm. W hen X = 1, he subrou-tine update-Aw-type-t simplifies and the parameter wfalls out. That is, the algorithm is the same for all valuesof w, and so no value of w is specified for these algorithms.The three-parameter algorithm is always faster thanthe standard and diagonal algorithms. This is not surprisingas the three-parameter algorithm optimizes over a designspace that includes these algorithms, and these algorithmshave been shown to correspon d to a near worst case selec-tion of parameter values.

    All three problems use b = 0 to allow comparison withthe established LAPACK Householder routine DGEQRF(with a block-size of 64n) that uses the W Y representationof Householder v ectors to achieve efficient memory o per-ations. Although the standard fast Givens and the standardHouseholder algorithms have roughly the same number ofarithmetic operations, the memory-efficient Householderroutine used by LAPACK trades improved mem ory perfor-mance for an increased num ber of floating point operations.The three-parameter fast Givens algorithm executes fasterbecause it does not trade numerical operations for im provedmemory efficiency.

    2 271 8 Q Diagonal [a=l, GI) 30

    1 6 3414 39..-.

    46P51

    5 0 8 6R

    8 1 2% VIQl aC

    w0 6 91

    0 4 137

    0 2 2730

    Pentium Pro POWER2 (thin) POWER2 (wide) R8000

    Figure 7: The three-par ameter algorithm outpei$orms theother established algorithms on all fou r processors, whenm = 400, n = 300, an d b = 0.Proper parameter selection yields a speedup over the

    standard algorithm of up to 2.4 on the Pentium Pro, 2.0 onthe POWER2 thin node, 1.6 on the POWER 2 wide node,and 4.2 on the RS000. Th e greatest benefit is obtainedon the R8000, because it achieves the most concurrencebetween memory and arithmetic operations. Streamliningthe memory bottleneck allowed a performance of 70% ofthe advertised 300 MFLOPS theoretical peak.

    We conclude by observing that the proposed algorithmexploits the fine grain parallelism, but do es nothing to pro-hibit the coarse grain parallelism. T herefore, these param-eters may be incorporated into a coarse grain parallel fastGivens algorithm.

    REFERENCES[l] G.H. Golub and C.F. Van Loan, Matrix Computa-

    tions, The Johns Hopkins University Press, Balti-more, (1989), 642 pp.

    11-16

    Authorized licensed use limited to: Jet Propulsion Laboratory. Downloaded on September 17, 2009 at 17:55 from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Three Par Am Fast Givens

    7/8

    1996 International Conference on Parallel Processing-~~ _.55 24

    50 El Diagonal (ci=l,kl) 270 ouseholder (LAPACK DG EQRF)

    45 3040 33

    11 35 388-l330 v)

    25 5ac0 20 67

    15 RY

    10 1345 2610

    w

    Pentium Pro POWER2 (thin) POWER2 (wide) RR000

    Figure 8: The three-parameter algorithm outperforms theother established algorithms on all four processors, whenm = 1,000 ,n = 1,000,nd b = 0.[2] G.G.L. Meyer and M. Pascale, A family of parallel

    QR factorization algorithms, In High PerformanceComputing Symposium 95, July, 1 995), pp. 95-106.

    [3] J.J. Dongarra, J. Du Croz, S. Hamm arling, and I.Duff,A set of level 3 basic linear algebra subprograms,ACMT rans. Math. Software, (Marc h, 199 0), pp. 1-17.

    [4] J.J. Dongarra, J. Du Croz, S. Hammarling, and R.J.Hanson, An extended set of FORTR AN basic linearalgebra subprograms, ACM Trans. Math. Software,(March, 1988), pp. 1-17.

    [5] J.J. Dongarra and D.W. Walker, Software librariesfor linear algebra computations on high performancecomputers, SIAMRev. , (June, 19 95), pp. 151-180.

    [6] J. Choi, J.J. Dongarra, and D.W. Walker, The de-sign of a parallel dense linear algebra software library:Reduction to Hessenberg, tridiagonal, and bidiagonalform, Numel: Algorithms, (October, 1995), pp. 379-399.

    [7] R. Schreiber and C. Van Loan, A storage-efficientWY representation for products of H ouseholder trans-formations, SIAM J. Sci. Statist. Comput., (January,1989) pp. 53-57.

    [8 ] C. Bischof and C . Van Loan, T he WY representationfor products of Householder matrices, SIAM J. Sci.Statist. Comput., (January, 1987), pp. s2-sl3.

    [9] K. Gallivan, W. Jalby, and U. Meier, The use ofBLA S3 in linear algebra on a parallel processor with

    25 0 26

    225 Diagonal (a=l,kl) 292 0 33175 37

    Householder LAPAC

    h(D

    E i s 0 44E 125 -C 5a lo o 65I2

    75 87

    %- a-a

    50 131

    25 261

    Penhum Pro POWER2 (thin) POWER2 (wide) R 8 0 0

    Figure 9: The three-parameter algorithm outperforms theother established algorithms on all fou r processors, whenm = 2 , 500 , n = 1,250,and b = 0.a hierarchical m emory,SIAM J. Sci. Statist. Comput.,(November, 1987), pp. 1079-1083.

    [lo] K. Gallivan, W. Jalby, U. Meier, and A.H. Sameh,Impact of hierarchical memory sy stems on linear al-gebra algorithm design, Internat. J. SupercomputerAppl., (Spring, 1988), pp. 12-48.

    [1 ] M. Johnson, Superscalar Microprocessor Design,Prentice Hall, Englewood Cliffs, (1991), 288 pp.

    [12] A. Sameh and D. Kuck, On stable parallel linearsystem solvers, J. Assoc. C omput. Mach,, (January,1978), pp. 81-91.

    131 R.E. Lord, J.S. Kowalik, and S.P. Kumar, Solvinglinear algebraic equations on an MIM D computer, J.Assoc. Com put. Mach,, (January, 1983), pp. 103-1 17.

    141 J.J. Modi and M.R.B. Clarke, An alternative Givensordering, Numer: Math., 43, (1984), pp. 83-90.

    151 M. Cosnard, J.M. Muller, and Y. Robert, Paral-lel QR decom position of a rectangular matrix, Nu-mer: Math., 48, (1986), pp. 239-249.161 M. Cosnard and E.M. Daoudi, Optimal algorithmsfor parallel Givens factorizations on a coarse-grained

    PRAM, J. Assoc. C omput. Mach,, (March, 1994), pp.399-421.

    [17] J.J. Carrig Jr. and G.G.L. Meyer, A banded fastGivens QR algorithm for efficient cache utilization,Electrical and Com puter Engineering, Johns HopkinsUnivers ity, 96-04, (Mar ch, 19 96), 3 1 pp.

    11-17

    Authorized licensed use limited to: Jet Propulsion Laboratory. Downloaded on September 17, 2009 at 17:55 from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Three Par Am Fast Givens

    8/8

    1996 International Conference on Parallel P rocessingAPPENDIX

    ( A , , , 1 , ? I , . . . a~~PA )= rotate-X(A, d , i , j ) :Matrices: A E R"x(" tb ) , E R"1: t = O2: For i z = 0 to X - 13:4:5:6:7:7:8:9 :8:8:8:9:10:11 :11:11:12 :13 :13 :13 :13 :14 :15: EndFor

    Figure 10: Th e r o t a t e - / \ subroutine, where X is an arbi-trary positive integer:

    Figure 11: WhenX = 1 heupdate-Xw-type-tsubrou-tine, given in Figure 12 , reduces the update-1-1-typedsubroutine. T he lines labeled (0) re used to o rm the up-date-1-1-type-0 subroutine; the lines labeled ( I ) areused to form the update-1-1-type-1ubroutine.

    ( A )= update-Xw-type-t(A, i , , 1 , P I , . . .,ax,PA) :Matrices: A E R"x("tb)% This routine shows the general structure of an update of% t y p e t E 0 , . . . 2 ' - I}1: 1 = j + 1 % 1preserves the final k for line 202: For le = j + 1 o n + b - w + 1b y w3:4(0):S(0):4(1):S(1):6:7:

    8:9(0):lO(0):9(1):10(1):11 :12 :13 :14(0):15(0):14(1):15(1):16:17:

    % Do first updateFor le2 = 0 to w - 1

    Y 1 , k z = P I A i - l , k + k z + A i , k - t k zA i , k + k z - i - l , k + k z + f I A i , k + k z7 1 , k z = A i - l , k + k z + P I A i , k t k zA i , k + k z = f l A i - l , k + k z -k A i , k + k zEnd ForFor i2 = 2 to X - 1% Do each intermediate updateFor k 2 = 0 to w - 1Y i 2 , k z = P i z A i - i z , k + k z + Y i 2 - l , k z

    A i - i ? + l , k + k z = A i - i z , k + k z + a i z ^ / i z - l , k zY i z , k z = A i - i z , k + k z + @ i z Y i z - l , k 2A i - i 2 + l , k + k z = f i z A i - i 2 , k + k 2 -k Y i 2 - i , k 2End For

    End For%Do X t h updateF o r k z = O t o w - l

    A i - X + l , k t k z = A i - X , k + k z + f X Y X - l , k zA i - X , k + k z = p X A i - X , k + k z + Y X - l , k zA i - X t l , k + k z = f X A i - X , k t k z + Y X - 1 , bA i - X , k + k z = A i - X , k + k ? + P X Y X - l , k zEnd For

    l = k + w

    Figure 12: Th e update-Xu-type-t subroutine whereand w are arbitrary integers and t E { 0,1, . . . 2' - }is used to code X individual updates. The lines labeled (0)are used when the individual transformation is of type 0,whereas the lines labeled (I) are used when the individualtransformation is of type I . For an efJicient implementa-tion, every value of t should be hard-coded to orm its ownsubroutine with all loops of length X and w completely un-rolled.

    11-18