tiling, block data layout, and memory hierarchy...

15
Tiling, Block Data Layout, and Memory Hierarchy Performance Neungsoo Park, Member, IEEE, Bo Hong, and Viktor K. Prasanna, Fellow, IEEE Abstract—Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques (copying, padding, etc.). The total miss cost is reduced considerably. Experiments on several platforms (UltraSparc II and III, Alpha, and Pentium III) show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout. Index Terms—Block data layout, tiling, TLB misses, cache misses, memory hierarchy. æ 1 INTRODUCTION T HE increasing gap between memory latency and proces- sor speed is a critical bottleneck in achieving high performance. The gap is typically bridged by a multilevel memory hierarchy that can hide memory latency. This memory hierarchy consists of multilevel caches, which are typically on and off-chip caches. To improve the effective memory hierarchy performance, various hardware solutions have been proposed [3], [7], [9], [10], [20]. Recent processors such as Intel Merced [25] provide increased programmer control over data placement and movement in a cache-based memory hierarchy, in addition to providing some memory streaming hardware support for media applications. To exploit these features, it is important to understand the effectiveness of control and data transformations. Along with hardware solutions, compiler optimization techniques have received considerable attention [14], [15], [23]. As the memory hierarchy gets deeper, it is critical to efficiently manage the data. To improve data access performance, one of the well-known optimization techni- ques is tiling. Tiling transforms loop nests so that temporal locality can be better exploited for a given cache size. However, tiling focuses only on the reduction of capacity cache misses by decreasing the working set size. Cache in most state-of-the-art machines is either direct-mapped or small set-associative. Thus, it suffers from considerable conflict misses, thereby degrading the overall performance [6], [12]. To reduce conflict misses, copying [12], [26] and padding [16], [21] techniques with tiling have been proposed. However, most of these approaches target mainly the cache performance, paying less attention to the Translation Look-Aside Buffer (TLB) performance. As problem sizes become larger, TLB performance becomes more significant. If TLB thrashing occurs, the overall performance will be drastically degraded [24]. Hence, both TLB and cache must be considered in optimizing applica- tion performance. Most previous optimizations, including tiling, concen- trate on single-level cache [6], [12], [16], [21], [26]. Multilevel memory hierarchy has been considered by a few research- ers. For improving multilevel memory hierarchy perfor- mance, a new compiler technique is proposed in [28] that transforms loop nests into recursive form. However, only multilevel caches were considered [22], [28] with no emphasis on TLB. It was proposed in [13] that cache and TLB performance be considered in concert to select the tile size. In this analysis, TLB and cache were assumed to be fully-set associative. However, the cache is direct or small set-associative in most of the state-of-the-art platforms. Some recent work [4], [11], [12], [18], [19], [26] proposed changing the data layout to match the data access pattern, to reduce cache misses. It was proposed in [11] that both data and loop transformation be applied to loop nests for optimizing cache locality. In [4], conventional (row or column-major) layout is changed to a recursive data layout, referred to as Morton layout, which matches the access pattern of recursive algorithms. This data layout was shown to improve the memory hierarchy performance. This was confirmed through experiments; we are not aware of any formal analysis. 640 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003 . N. Park is with Samsung Electronics Co., Seoul, Korea. E-mail: [email protected]. . B. Hong and V.K. Prasanna are with the Department of Electrical Engineering Systems, University of Southern California, Los Angeles, CA 90089-2562. E-mail: {bohong, prasanna}@halcyon.usc.edu. Manuscript received 27 Nov. 2001; accepted 26 Nov. 2002. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 115439. 1045-9219/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society

Upload: others

Post on 10-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

Tiling, Block Data Layout, andMemory Hierarchy Performance

Neungsoo Park, Member, IEEE, Bo Hong, and Viktor K. Prasanna, Fellow, IEEE

Abstract—Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data

transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate

layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on

TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To

improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we

propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted

simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication,

simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size

selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques (copying, padding, etc.). The total miss

cost is reduced considerably. Experiments on several platforms (UltraSparc II and III, Alpha, and Pentium III) show that tiling with block

data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is

also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up

to 15 percent faster than that using Morton data layout.

Index Terms—Block data layout, tiling, TLB misses, cache misses, memory hierarchy.

æ

1 INTRODUCTION

THE increasing gap between memory latency and proces-sor speed is a critical bottleneck in achieving high

performance. The gap is typically bridged by a multilevelmemory hierarchy that can hide memory latency. Thismemory hierarchy consists of multilevel caches, which aretypically on and off-chip caches. To improve the effectivememory hierarchy performance, various hardware solutionshave been proposed [3], [7], [9], [10], [20]. Recent processorssuch as Intel Merced [25] provide increased programmercontrol over data placement and movement in a cache-basedmemory hierarchy, in addition to providing some memorystreaming hardware support for media applications. Toexploit these features, it is important to understand theeffectiveness of control and data transformations.

Along with hardware solutions, compiler optimizationtechniques have received considerable attention [14], [15],[23]. As the memory hierarchy gets deeper, it is critical toefficiently manage the data. To improve data accessperformance, one of the well-known optimization techni-ques is tiling. Tiling transforms loop nests so that temporallocality can be better exploited for a given cache size.However, tiling focuses only on the reduction of capacitycache misses by decreasing the working set size. Cache inmost state-of-the-art machines is either direct-mapped orsmall set-associative. Thus, it suffers from considerableconflict misses, thereby degrading the overall performance

[6], [12]. To reduce conflict misses, copying [12], [26] andpadding [16], [21] techniques with tiling have beenproposed. However, most of these approaches targetmainly the cache performance, paying less attention to theTranslation Look-Aside Buffer (TLB) performance. Asproblem sizes become larger, TLB performance becomesmore significant. If TLB thrashing occurs, the overallperformance will be drastically degraded [24]. Hence, bothTLB and cache must be considered in optimizing applica-tion performance.

Most previous optimizations, including tiling, concen-

trate on single-level cache [6], [12], [16], [21], [26]. Multilevel

memory hierarchy has been considered by a few research-

ers. For improving multilevel memory hierarchy perfor-

mance, a new compiler technique is proposed in [28] that

transforms loop nests into recursive form. However, only

multilevel caches were considered [22], [28] with no

emphasis on TLB. It was proposed in [13] that cache and

TLB performance be considered in concert to select the tile

size. In this analysis, TLB and cache were assumed to be

fully-set associative. However, the cache is direct or small

set-associative in most of the state-of-the-art platforms.Some recent work [4], [11], [12], [18], [19], [26] proposed

changing the data layout to match the data access pattern,

to reduce cache misses. It was proposed in [11] that both

data and loop transformation be applied to loop nests for

optimizing cache locality. In [4], conventional (row or

column-major) layout is changed to a recursive data layout,

referred to as Morton layout, which matches the access

pattern of recursive algorithms. This data layout was shown

to improve the memory hierarchy performance. This was

confirmed through experiments; we are not aware of any

formal analysis.

640 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003

. N. Park is with Samsung Electronics Co., Seoul, Korea.E-mail: [email protected].

. B. Hong and V.K. Prasanna are with the Department of ElectricalEngineering Systems, University of Southern California, Los Angeles, CA90089-2562. E-mail: {bohong, prasanna}@halcyon.usc.edu.

Manuscript received 27 Nov. 2001; accepted 26 Nov. 2002.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 115439.

1045-9219/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society

Page 2: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

The ATLAS project [27] automatically tunes severallinear algebra implementations. It uses block data layoutwith tiling to exploit temporal and spacial locality. Inputdata, originally in column major layout, is remapped intoblock data layout before the computation begins. Thecombination of block data layout and tiling has shownhigh performance on various platforms. However, theselection of the optimal block size is done empirically atcompile time by running several tests with different blocksizes.

In this paper, we study block data layout as a datatransformation to improve memory hierarchy performance.In block data layout, a matrix is partitioned into subma-trices called blocks. Data elements within one such block aremapped onto contiguous memory. These blocks are ar-ranged in row-major order. First, we analyze the intrinsicTLB performance of block data layout. We then analyze theTLB and cache performance using tiling and block datalayout. Based on the analysis, we propose a block sizeselection algorithm. Morton data layout is also discussed as avariant of block data layout. The contributions of this paperare as follows:

. We present a lower bound analysis of TLB perfor-mance. Further, we show that block data layoutintrinsically has better TLB performance than row-major layout (Section 2). As an abstraction of matrixoperations, the cost of accessing all rows and allcolumns is analyzed. Compared with row majorlayout, we show that the number of TLB misses isimproved by Oð

ffiffiffiffiffiPvpÞ, where Pv is the page size.

. We present TLB and cache performance analysiswhen tiling is used with block data (Sections 3.1 and3.2). In tiled matrix multiplication, block data layoutimproves the number of TLB misses by a factor of B,where B is the block size. Cache performanceanalysis is also presented. We validate our analysisthrough simulations using SimpleScalar [2].

. On the basis of our cache and TLB analysis, wepropose a block size selection algorithm (Section 3.3).The best block sizes found by ATLAS fall in therange given by our algorithm.

. We validate our analysis through simulations andmeasurements using matrix multiply, LU decom-position, and Cholesky factorization (Section 4).

. We compare the performance of block data layoutand Morton data layout. Block size selection forMorton data layout is limited. This limitation causesthe performance of Morton data layout to be worsethan that of block data layout. Experimental resultson UltraSparc II and Pentium III show that matrixmultiplication and LU decomposition executionsusing block data layout were up to 15.8 percentfaster than that obtained using Morton data layout.

The rest of this paper is organized as follows: Section 2

describes block data layout and gives analysis of its

TLB performance. Section 3 discusses the TLB and cache

performance when tiling and block data layout are used in

concert. A block size selection algorithm is described based

on this analysis. Section 4 shows simulation and experi-

mental results. Concluding remarks are presented in

Section 5.

2 BLOCK DATA LAYOUT AND TLB PERFORMANCE

In this paper, we assume the architecture parameters to befixed (e.g., cache size, cache line size, page size, TLB entrycapacity, etc.). The following notations are used in thispaper. Stlb denotes the number of TLB entries. Pv denotesvirtual page size. It is assumed that the TLB is fully set-

associative with Least-Recently-Used (LRU) replacementpolicy. Block size is B�B, where it is assumed B2 ¼ kPv.Sci is the size of the ith level cache. Its line size is denoted asLci. Cache is assumed to be direct-mapped.

In Section 2, we analyze the TLB performance of blockdata layout. We show that block data layout has betterintrinsic TLB performance than conventional data layouts.

2.1 Block Data Layout

To support multidimensional array representations, most

programming languages provide a mapping function,which converts an array index to a linear memory address.In current programming languages, the default data layoutis row-major or column-major, denoted as canonical layouts[4]. Both row-major and column-major layouts have similardrawbacks. For example, consider a large matrix stored inrow-major layout. Due to large stride, column accesses can

cause cache conflicts. Further, if every row in a matrix islarger than the size of a page, column accesses can causeTLB trashing, resulting in drastic performance degradation.

In block data layout, a large matrix is partitioned intosubmatrices. Each submatrix is a B�B matrix and allelements in the submatrix are mapped onto contiguousmemory locations. The blocks are arranged in row-majororder. Another data layout of recent interest is Morton data

layout [3]. Morton data layout divides the original matrixinto four quadrants and lays out these submatricescontiguously in the memory. Each of these submatrices isfurther recursively divided and laid out in the same way. Atthe end of recursion, elements of the submatrix are storedcontiguously. This is similar to the arrangement of elementsof a block in block data layout. Morton data layout can thus

be considered as a variant of the block data layout. Theyonly differ in the order of blocks. Fig. 1 shows block datalayout and Morton data layout with block size 2� 2. Due tothe similarity, the following TLB analysis holds true forMorton data layout also.

2.2 TLB Performance of Block Data Layout

In this section, we present a lower bound on the TLB missesfor any data layout. We discuss the intrinsic TLB perfor-mance of block data layout using a generic access pattern.

We give an analysis on the TLB performance of block datalayout and show improved performance compared withconventional layouts. Throughout this section, we consideran N �N array.

2.2.1 A Lower Bound on TLB Misses

In general, most matrix operations consist of row andcolumn accesses, or permutations of row and column

accesses. In this section, we consider an access pattern,where an array is accessed first along all rows and thenalong all columns. The lower bound analysis of TLB misses

PARK ET AL.: TILING, BLOCK DATA LAYOUT, AND MEMORY HIERARCHY PERFORMANCE 641

Page 3: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

incurred in accessing the data array along all the rows andall the columns is as follows.

Theorem 1. For accessing an array along all the rows and thenalong all the columns, the asymptotic minimum number ofTLB misses is given by 2 N2ffiffiffiffi

Pvp .

Proof. Consider an arbitrary mapping of array elements topages. Let Ak ¼ fij at least one element of row i is inpage kg. Similarly, let Bk ¼ fjj at least one element ofcolumn j is in page kg. Let ak ¼ jAkj and bk ¼ jBkj. Notethat ak � bk � Pv. Using the mathematical identity thatthe arithmetic mean is greater than or equal to thegeometric mean (ak þ bk � 2

ffiffiffiffiffiPvp

), we have:

XN2

Pv

k¼1

ðak þ bkÞ � 2N2

Pv

ffiffiffiffiffiPv

p:

Let xi (yj) denote the number of pages where elements inrow i (column j) are scattered. The number of TLBmisses in accessing all rows consecutively and then allcolumns consecutively is given by

Tmiss �XNi¼1

ðxi ÿOðStlbÞÞ þXNj¼1

ðyj ÿOðStlbÞÞ:

OðStlbÞ is the number of page entries required for

accessing row i (column j) that are already present in

the TLB. Page k is accessed ak times by row accesses,

thus,PN

i¼1 xi ¼PN2

Pv

k¼1 ak. Similarly,PN

j¼1 yj ¼PN2

Pv

k¼1 bk.

Therefore, the total number of TLB misses is given by

Tmiss �XN2

Pv

k¼1

ðak þ bkÞ ÿ 2N �OðStlbÞ � 2� N2ffiffiffiffiffiPvp ÿ 2N �OðStlbÞ:

As the problem size (N) increases, the number of pagesaccessed along one row (column) becomes larger thanthe size of TLB (Stlb). Thus, the number of TLB entriesthat are reused is reduced between two consecutive row(column) accesses. Therefore, the asymptotic minimumnumber of TLB misses is given by 2 N2ffiffiffiffi

Pvp . tu

We obtained a lower bound on TLB misses for any layoutwhen data are accessed along all rows and then along allcolumns. This lower bound of TLB misses also holds when

data is accessed along an arbitrary permutation of all rowsand columns.

Corollary 1. For accessing an array along an arbitrarypermutation of row and column accesses, the asymptoticminimum number of TLB misses is given by 2 N2ffiffiffiffi

Pvp .

2.2.2 TLB Performance

In this section, we consider the same access pattern asdiscussed in Section 2.2.1. Consider a given N �N arraystored in a canonical layout. Without loss of generality,canonical layout is assumed to be row-major layout. Duringthe first pass (row accesses), the memory pages are accessedconsecutively. Therefore, TLB misses caused by rowaccesses is equal to N2

Pv. During the second pass (column

accesses), elements along the column are assigned to N

different pages. Hence, a column access causes N TLBmisses. Since N � Stlb, all N column accesses result in N2

TLB misses. The total number of TLB misses caused by allrow accesses and all column accesses is thus N2

PvþN2.

Therefore, in canonical layout, TLB misses drasticallyincrease due to column accesses.

Compared with canonical layout, block data layout hasbetter TLB performance. The following theorem shows thatblock data layout minimizes the number of TLB misses.

Theorem 2. For accessing an array along all the rows and thenalong all the columns, block data layout with block size

ffiffiffiffiffiPvp

�ffiffiffiffiffiPvp

minimizes the number of TLB misses.

To prove this theorem, we consider two cases: B2=Pv � 1

and B2=Pv � 1. For each case, we estimate the number of

TLB misses by comparing the TLB size, the number of page

entries in a row, and the number of pages in a block. The

optimal block size is then derived from these estimations.

Detailed proof of Theorem 2 is presented in [17]. In general,

the number of TLB misses for a B�B block data layout is

k N2

B þ N2

B . It is reduced by a factor of ðPvþ1ÞBPvðkþ1Þ ð� B

kþ1Þ when

compared with canonical layout. When B ¼ffiffiffiffiffiPvp

(k ¼ 1),

this number approaches the lower bound shown in

Theorem 1.Theorem 2 holds true even when data in block data

layout is accessed along an arbitrary permutation of allrows and columns.

642 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003

Fig. 1. Various data layouts: block size 2� 2 for (b) and (c). (a) Row-major layout, (b) block data layout, and (c) Morton data layout.

Page 4: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

Corollary 2. For accessing an array along an arbitrarypermutation of rows and columns, block data layout withblock size

ffiffiffiffiffiPvp

�ffiffiffiffiffiPvp

minimizes the number of TLB misses.

Similar to Theorem 2 and Corollary 2, the number of TLBmisses is minimized when blocks are stored in Morton datalayout and elements are accessed along rows and columns.

Corollary 3. For accessing an N �N array along along all therows and then along all the columns (or along an arbitrarypermutation of rows and columns), Morton data layout withblock size

ffiffiffiffiffiPvp

�ffiffiffiffiffiPvp

minimizes the number of TLB misses.

To verify our analysis, simulations were performedusing the SimpleScalar simulator [2]. It is assumed thatthe page size is 8KByte and the data TLB is fully set-associative with 64 entries (similar to the data TLB inUltraSparc 2.) Double precision data points are assumed.The block size is set to 32. Table 1 shows the comparison ofTLB misses using block data layout with using canonicallayout. Table 1a shows the TLB misses for the “first all rowsand then all columns” access. For small problem sizes, TLBmisses with block data layout are considerably less thanthose with canonical layout. This is due to the fact that TLBentries used in a column (row) access are almost fullyreused in the next column (row) access. For a problem sizeof 1; 024� 1; 024, a 504.37 times improvement in the numberof TLB misses is obtained with block data layout. Thisnumber is much less than the lower bound obtained fromTheorem 1. This is because the TLB entries are reused forthis problem size. For larger problem sizes the TLB entriescannot be reused. The total number of TLB missesapproaches the lower bound. For these large problem sizes,TLB misses with block data layout are upto 16 times lesscompared with canonical layout.

To verify Corollaries 1 and 2, two sets of access patternswere simulated: An arbitrary permutation of all rows andcolumns, and an arbitrary permutation of all rows followedby an arbitrary permutation of all columns. With theseaccess patterns, TLB entries referenced during one row

(column) access are not reused when accessing the next row(column). The number of TLB misses with block data layoutapproaches the lower bound on TLB misses. The results areshown in Tables 1b and 1c. Morton data layout shows aperformance similar to block data layout.

Even though block data layout has better TLB perfor-mance compared with canonical layouts with generic accesspattern, it alone does not reduce cache misses. The dataaccess pattern of tiling matches well with block data layout.In the following section, we discuss the performanceimprovement of TLB and caches when block data layoutis used in conjunction with tiling.

3 TILING AND BLOCK DATA LAYOUT

Tiling is a well-known optimization technique that im-proves cache performance. Tiling transforms the loop nestso that temporal locality can be better exploited for a givencache size. Consider an N �N matrix multiplicationrepresented as Z ¼ XY. The working set size for the usual3-loop computation is N2 þ 2N . For large problems, theworking set size is larger than the cache size, resulting insevere cache thrashing. To reduce cache capacity misses,tiling transforms the matrix multiplication to a 5-loop nesttiled matrix multiplication (TMM) as shown in Fig. 2a. Theworking set size for this tiled computation is B2 þ 2B. Toefficiently utilize block data layout, we consider a 6-loopTMM as shown in Fig. 2b instead of a 5-loop TMM.

3.1 TLB Performance

In this section, we show the TLB performance improvementof block data layout with tiling. To illustrate the effect ofblock data layout on tiling, we consider a generic accesspattern abstracted from tiled matrix operations. The accesspattern is shown in Fig. 3, where the tile size is equal to B.

With canonical layout, TLB misses will not occur whenaccessing consecutive tiles in the same row, if B � Stlb.Hence, the tiled accesses along the rows generate N2

PvTLB

misses. This is the minimum number of TLB missesincurred in accessing all the elements in a matrix. However,

PARK ET AL.: TILING, BLOCK DATA LAYOUT, AND MEMORY HIERARCHY PERFORMANCE 643

TABLE 1Comparison of TLB Misses

(a) Along all rows and then all columns, (b) arbitrary permutation of row and column accesses, and (c) arbitrary permutation of all rows followed byarbitrary permutation of all columns accesses.

Page 5: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

tiled accesses along columns cause considerable TLB

misses. B page table entries are necessary for accessing

each tile. For all tiled column accesses, the total number of

TLB misses is Tcol ¼ B� NB � N

B ¼ N2

B . It is reduced by a

factor of B compared with the number of TLB misses for all

column accesses without tiling (see Section 2.2).The total number of TLB misses are further reduced

when block data layout is used in concert with tiling, as

shown in Theorem 3. Throughout this paper, the block size

of block data layout is assumed to be the same as the tile

size so that the tiled access pattern matches block data

layout. In block data layout, a two-dimensional block is

mapped onto one-dimensional contiguous memory loca-

tions. A block extends over several pages, as shown in Fig. 4,

for an example of block size B2 ¼ 1:7Pv. To analyze TLB

misses for column accesses using block data layout, the

average number of pages in a block is required.

Lemma 1. Consider an array stored in block data layout with

block size B�B, where B2 ¼ kPv. The average number of

pages per block is given by kþ 1.

Proof. For block size kPv, assume that k ¼ nþ f , where n

is a nonnegative integer and 0 � f < 1. The probability

that a block extends over nþ 1 contiguous pages is

1ÿ f . The probability that a block extends over nþ 2

contiguous pages is f . Therefore, the average number

of pages per block in block data layout is given by:

ð1ÿ fÞ � ðnþ 1Þ þ f � ðnþ 2Þ ¼ kþ 1. tuTheorem 3. Assume that an N �N array is stored using block

data layout. For tiled row and column accesses, the total

number of TLB misses is ð2þ 1kÞN

2

Pv.

Proof. Blocks in block data layout are arranged in row-major

order. So, a page overlaps between two consecutive blocks

that are in the same row. The page is continuously

accessed. The number of TLB misses caused by all tiled

row accesses is thus N2

Pv, which is the minimum number of

TLB misses. However, no page overlaps between two

consecutive blocks in the same column. Therefore, each

block along the same column goes through ðkþ 1Þdifferent pages according to Lemma 1. The number of

TLB misses caused by all tiled column accesses is thus,

Tcol ¼ ðkþ 1Þ � NB � N

B ¼ ðkþ 1Þ N2

kPv. Therefore, the total

TLB misses caused by all row and all column accesses is

Tmiss ¼ ð2þ 1kÞN

2

Pv. tu

For tiled access, the number of TLB misses using canonical

layout is N2

Pvþ N2

B , where B ¼ffiffiffiffiffiffiffiffikPvp

. Using Theorem 3,

compared with canonical layout, block data layout reduces

the number of TLB misses byffiffiffiffiffiffikPvp

þffiffikp

2kþ1 ¼ Bþffiffikp

2kþ1 .

644 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003

Fig. 2. Tiled matrix multiplication. (a) 5-loop tiled matrix multiplication and (b) 6-loop tiled matrix multiplication.

Fig. 4. Block extending over page boundaries. (a) Over two pages and (b) over three pages.

Page 6: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

To verify our analysis, simulations for tiled row and

column accesses were performed using the SimpleScalar

simulator. The simulation parameters are equal to those in

Section 2. A 32� 32 block size was considered. The block

size is the same as the page size. Table 2 shows TLB misses

for three different cases. For problem sizes of 2; 048� 2; 048

and 4; 096� 4; 096, the number of TLB misses conform our

analysis in Theorem 3. The number of TLB misses with

block data layout is 91 percent less than that with canonical

layout. For a problem size of 1; 024� 1; 024, TLB misses

with block data layout is 2,081, which is very close to the

minimum number of TLB misses (2,048). This is a special

case in which each block starts on a new page.

A similar analytical result can be derived for real

applications. Consider the 5-loop TMM with canonical layout

in Fig. 2a. Array Y is accessed in a tiled row pattern. On the

other hand, arrays X and Z are accessed in a tiled column

pattern. A tile of each array is used in the inner loops ði; k; jÞ.The number of TLB misses for each array is equal to the

average number of pages per tile, multiplied by the number of

tiles accessed in the outer loops ðkk; jjÞ. The average number

of pages per tile isBþ B2

Pv. Therefore, the total number of TLB

misses is given by: 2N3ð 1B2 þ 1

BPvÞ þN2ð1Bþ 1

PvÞ.

Consider the 6-loop TMM on block data layout as shown

in Fig. 2b. A B�B tile of each array is accessed in the inner

loops ði; k; jÞ with block layout. The number of TLB misses

for each array is equal to the average number of pages per

block multiplied by the number of blocks accessed in the

outer loops ðii; kk; jjÞ. According to Lemma 1, the average

number of pages per block is B2

Pvþ 1ð¼ kþ 1Þ. Therefore, the

total number of TLB misses (TM) is

TM ¼ B2

Pvþ 1

� �2N

B

� �3

þ N

B

� �2( )

¼ 2N3 1

BPvþ 1

B3

� �þN2 1

Pvþ 1

B2

� �:

ð1Þ

Compared with the 5-loop TMM with canonical layout, TLBmisses decrease by a factor of OðBÞ using the 6-loop TMM.Note that the 6-loop TMM uses block data layout.

To verify our TLB miss estimation, simulations on the6-loop TMM were performed. The problem size was fixedat 1; 024� 1; 024. Simulation parameters were the same asthose in Section 2. Fig. 5 compares our estimations (givenby (1)) with the simulation results. Fig. 6 shows that blockdata layout reduced TLB misses considerably comparedwith tiling.

3.2 Cache Performance

For a given cache size, tiling transforms the loop nest so thatthe temporal locality can be better exploited. This reducesthe capacity misses. However, since most of the state-of-the-art architectures have direct-mapped or small set-associa-tive caches, tiling can suffer from considerable conflictmisses that degrade the overall performance. Fig. 7a showscache conflict misses. These conflict misses are determinedby cache parameters such as cache size, cache line size andset-associativity, and runtime parameters such as array sizeand block size. Performance of tiled computations is thussensitive to these runtime parameters.

If the data layout is reorganized from a canonical layoutto a block layout (assuming tile size is same as block size)before tiled computations start, the entire data that isaccessed during a tiled computation will be localized in ablock. As shown in Fig. 7b, a self interference miss does notoccur if the block is smaller than the cache since all elementsin a block can be stored in contiguous memory locations.

In general, cache miss analysis for direct mapped cachewith canonical layout is complicated because the selfinterference misses cannot be quantified easily. Cacheperformance analysis of tiled algorithm was discussed in[12]. The cache performance of tiling with copyingoptimization was also presented. We observe that thebehavior of cache misses for tiled access patterns on blocklayout is similar to that of tiling with copying optimizationon canonical layout. We have derived the total number ofcache misses for 6-loop TMM (which uses block datalayout). Detailed proof can be found in [17]. For ith level

PARK ET AL.: TILING, BLOCK DATA LAYOUT, AND MEMORY HIERARCHY PERFORMANCE 645

TABLE 2TLB Misses for All Tiled Row Accesses Followed by

All Tiled Column Accesses

Fig. 5. Comparison of TLB misses from simulation and estimation.

Fig. 6. Comparison of TLB misses using tiling + BDL and tiling only.

Page 7: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

cache with line size Lci and cache size Sci, the total numberof cache misses (CMi) is:

CMi �

N3

Lci1B 2þ 3Lciþ2L2

cið ÞSci

� �þ 1

N þ4Bþ6Lci

Sci

� �for B <

ffiffiffiffiffiffiScip

N3

Lci4BSciþ 2

Bÿ2SciB2 þ 2ÿ 1

N þ6LciSci

n ofor

ffiffiffiffiffiffiScip

�B<ffiffiffiffiffiffiffiffiffi2Scip

N3

Lci1þ 2

Bþ 1þ LcB

ÿ �Bþ2LcSci

� �n ofor

ffiffiffiffiffiffiffiffiffi2Scip

� B:

8>>>>>>>><>>>>>>>>:ð2Þ

To verify the cache miss estimations, we conductedsimulations using SimpleScalar for 6-loop TMM with blockdata layout. The problem size was fixed at 1; 024� 1; 024. A16KByte direct mapped cache was assumed (similar toL1 data cache in UltraSparc II). Fig. 8 compares ourestimated values (given by (2)) with the simulation results.

3.3 Block Size Selection

To test the effect of block size, experiments wereperformed on several platforms. Fig. 9 shows theexecution time of a 6-loop TMM with size 1; 024� 1; 024on UltraSparc II (400 MHz) as a function of block size. Itcan be observed that block size selection is significant forachieving high performance.

With canonical data layout, tiling technique is sensitive

to problem and tile sizes. Several GCD-based tile size

selection algorithms [6], [8], [12] were proposed to optimize

tiled computation. However, their performance is still

sensitive to the problem size. In [12], TLB and cacheperformance were considered in concert. This approachshowed better performance than algorithms that separatelyconsider cache or TLB. However, all these approaches arebased on canonical data layout. On the other hand, ATLAS[27] utilizes block data layout. However, the best block sizeis determined empirically at compile time by evaluating theactual performance of the code with a wide range of blocksizes.

In a multilevel memory hierarchy system, it is difficult topredict the execution time (Texe) of a program. But, Texe isproportional to the total miss cost of TLB and cache. Inorder to minimize Texe, we will evaluate and minimize thetotal miss cost for both TLB and l-level caches. We have:

MC ¼ TM �Mtlb þXli¼1

CMiHiþ1; ð3Þ

where MC denotes the total miss cost, CMi is the number ofmisses in the ith level cache, TM is the number of TLBmisses, Hi is the cost of a hit in the ith level cache, and Mtlb

is the penalty of a TLB miss. The ðlþ 1Þth level cache is themain memory. It is assumed that all data reside in the mainmemory (CMlþ1 ¼ 0). Using the derivative of MC withrespect to the block size, we can find the optimal block sizethat minimizes the overall miss cost.

For a simple 2-level memory hierarchy that consists ofonly one level cache and TLB, the total miss cost (denoted asMCtc1) in (3) reduces to:

MCtc1 ¼ TM �Mtlb þ CM �H2;

where H2 is the access cost of main memory. In the aboveestimation, Mtlb and CM are substituted with (1) and (2),

646 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003

Fig. 7. Example of conflict misses. (a) Canonical layout and (b) block data layout.

Fig. 8. Comparison of cache misses from simulation and estimation for

6-loop TMM. Fig. 9. Execution time of TMM of size 1; 024� 1; 024.

Page 8: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

respectively. Using the derivative of MC, the optimal blocksize (Btc1) that minimizes the total miss cost caused byL1 cache and TLB misses is given as

Btc1 �

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2Lc1Mtlb

Pvþ 2þ 3Lc1þ2L2

c1

Sc1

h iH2

� �Sc1

4H2

vuut: ð4Þ

We now extend this analysis to determine a range foroptimal block size in a multilevel memory hierarchy thatconsists of TLB and two levels of cache. The miss cost isclassified into two groups: miss cost caused by TLB andL1 cache misses and miss cost caused by L2 misses. Figs. 10aand 10b show the miss cost estimated through (1) and (2).Fig. 10a is the separated TLB, L1, and L2 miss cost, usingUltraSparc II parameters. Fig. 10b shows the variance of theestimated total miss costs as the ratio between L1 cache misspenalty (H2) and L2 cache miss penalty (H3) varies. Using (4),we discuss the total miss cost for three ranges of block size:

Lemma 2. For B < Btc1, MCðBÞ > MCðBtc1Þ.Proof. Using the derivatives of TLB and cache miss ((1) and

(2)), it can be easily verified that dMCtc1dB < 0 and dCM2

dB < 0for B < Btc1. This is shown in Fig. 10a. For B < Btc1, TLB,L1, and L2 miss cost increase as block size decreases,thereby increasing the total miss cost. Therefore, theoptimal block size cannot be in the range B < Btc1. tu

Lemma 3. For B >ffiffiffiffiffiffiffiSc1p

, MCðBÞ > MCðffiffiffiffiffiffiffiSc1p

Þ.Proof. In the range B >

ffiffiffiffiffiffiffiSc1p

, TLB miss cost is optimized bytiling and block data layout. However, the change in TLBmiss cost is negligible as the block size increases. Sinceblock size is larger than L1 cache size, self-interferencesoccur in this range. The number of L1 cache missesdrastically increases as shown in Fig. 10a. ForffiffiffiffiffiffiffiSc1p

� B <ffiffiffiffiffiffiffiffiffi2Sc1p

, the ratio of derivatives of (2) for L1and L2 misses is as follows:

H2dCM1

dB

H3dCM2

dB

���������� ¼ H2

H3

N3

Lc14Sc1þ 4Sc1

B3 ÿ 2B2

h iN3

Lc24Sc2ÿ 2þ 3Lc2þ2L2

c2

Sc2

� �1B2

h i������

������:Let B2 ¼ �Sc1 (1 � � < 2). Note that Lc2 � Sc2.

H2dCM1

dB

H3dCM2

dB

���������� � H2

H3� Lc2Lc1� Sc2Sc2 ÿ 2�Sc1

� 2�ÿ 1þ 2ffiffiffiffi�p

ffiffiffiffiffiffiffiSc1

p� �>H2

H3� Lc2Lc1� Sc2Sc2 ÿ 4Sc1

� 3þffiffiffi2p�ffiffiffiffiffiffiffiSc1

p� �:

In a general memory hierarchy system, Sc2Sc2ÿ4Sc1

� 1 since

Sc1 � Sc2. Also, Lc2Lc1� 1 and

ffiffiffiffiffiffiffiffiffi2Sc1p

> H3H2 . Therefore,

H2dCM1

dB

H3dCM2

dB

���������� > 1:

Thus, although the number of L2 cache misses decreases

(dCM2

dB < 0), the total miss cost increases forffiffiffiffiffiffiffiSc1p

� B <ffiffiffiffiffiffiffiffiffi2Sc1p

because the increase in L1 cache miss cost is larger

than the decrease in L2 cache miss cost. For B �ffiffiffiffiffiffiffiffiffi2Sc1p

,

there is no reuse in L1 cache. Thus, the L1 cache miss cost

saturates. Fig. 10b shows the change of the total miss cost

as the ratio of H3

H2varies. Even though L2 miss penalty is

40 times that of L1 miss penalty, TMðBÞ > TMðffiffiffiffiffiffiffiSc1p

Þ for

B �ffiffiffiffiffiffiffiffiffi2Sc1p

, because L1 self-interference miss cost is

dominantly large for B �ffiffiffiffiffiffiffiffiffi2Sc1p

. Therefore, the optimal

block size cannot be in the range B >ffiffiffiffiffiffiffiSc1p

. tu

Theorem 4. The optimal block size Bopt satisfies Btc1 � Bopt

<ffiffiffiffiffiffiffiSc1p

.

Proof. This follows from Lemmas 2 and 3. Therefore, anoptimal block size that minimizes the total miss cost islocated in

Btc1 � Bopt <ffiffiffiffiffiffiffiSc1

p: ð5Þ

We select a block size that is a multiple of Lc1 (L1 cacheline size) in this range. tuTo verify our approach, we conducted simulations

using UltraSparc II parameters (Table 3). Fig. 11 showsthe simulation results of 6-loop TMM using block datalayout. As discussed, the number of TLB and L2 missesdecreased as block size increases. Also, the minimumnumber of L1 misses was obtained for B ¼ 36 and thendrastically increased for B > 45. Fig. 12 shows the total

PARK ET AL.: TILING, BLOCK DATA LAYOUT, AND MEMORY HIERARCHY PERFORMANCE 647

Fig. 10. Miss cost estimation for 6-loop TMM (UltraSparc II parameters). (a) Miss cost of TLB, L1, and L2 cache (Btc1 is obtained using (4)). (b) Total

miss cost with various L2 miss penalty.

Page 9: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

miss cost. For UltraSparc II, Btc1 ¼ 32:2,ffiffiffiffiffiffiffiSc1p

¼ 45:3, andLc1 ¼ 4. Theorem 4 suggests the range for optimal blocksize is to be 36-44. Simulation results show that theoptimal block size for this architecture was 44.

We also tested ATLAS on UltraSparc II. Through a widesearch ranging from 16 to 44, ATLAS found 36 and 40 as theoptimal block sizes. These blocks lie in the range given by(5). We further tested 6-loop TMM with respect to differentproblem and block sizes. For each problem size, weperformed experiments by testing block sizes ranging from8-80. In these tests, we found that the optimal block size foreach problem size was in the range given by (5) as shown inFig. 13. These experiments confirm that our approachproposes a reasonably good range for block size selection.

4 EXPERIMENTAL RESULTS

To verify our analysis, we performed simulations and

experiments on the following applications: tiled matrix

multiplication (TMM), LU decomposition, and Cholesky

factorization (CF). The performance of tiling with block data

layout (tiling + BDL) is compared with other optimization

techniques: tiling with copying (tiling + copying), and tiling

with padding (tiling + padding). For tiling + BDL, the tile

size (of the tiling technique) is chosen to be the same as the

block size of the block data layout. Input and output is in

canonical layout. All the cost in performing data layout

transformations (from canonical layout to block data layout

and vice versa) is included in the reported results. As stated

in [12], we observed that the copying technique cannot be

applied efficiently to LU and CF applications, since copying

648 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003

Fig. 12. Total miss cost for 6-loop TMM. Fig. 13. Optimal block sizes for 6-loop TMM.

Fig. 11. Simulation results of 6-loop TMM. (a) TLB miss, (b) L1 miss, and (c) L2 miss.

TABLE 3Features of Various Experimental Platforms

Page 10: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

overhead offsets the performance improvement. Hence, wedo not consider tiling + copying for these applications. In allour simulations and experiments, the data elements aredouble-precision.

4.1 Simulations

To show the performance improvement of TLB and cachesusing tiling + BDL, simulations were performed using theSimpleScalar simulator [2]. The problem size was1; 024� 1; 024. Two sets of architecture parameters wereused: UltraSparc II and Pentium III. The parameters arelisted in Table 3.

Fig. 14 compares the TMM simulations of differenttechniques, based on UltraSparc II parameters. Tiling + BDLreduces L1 and L2 cache misses as well as TLB misses.Block size 32 leads to increased L1 and L2 cache misses forblock data layout because of the cache conflicts betweendifferent blocks. Tiling + BDL reduced 91-96 percent of TLBmisses as shown in Fig. 14a. This confirms our analysispresented in Section 3.1. Fig. 15 shows the total miss cost(calculated from (3)) for TMM using block size 40� 40. L1,L2, and TLB miss penalties were assumed to be 6, 24, and30 cycles, respectively. This figure shows that tiling + BDLresults in the smallest total miss cost and that the TLB misscost with tiling + BDL is negligible compared with L1 andL2 miss costs. Though tiling + BDL has more L2 cache

misses than tiling + padding, its total miss cost is smaller.Fig. 12 shows the effect of block size on the total miss costfor TMM using tiling + BDL. As discussed in Section 3.3, thebest block size (44) is in the range 36-44 suggested by ourapproach.

Fig. 16 presents simulation results for LU using IntelPentium III parameters. Similar to TMM, the number of TLBmisses for tiling + BDL was almost negligible comparedwith that for tiling + padding as shown in Fig. 16a. For bothtechniques, L1 and L2 cache misses were reduced con-siderably because of 4-way set-associativity. For tiling +padding, when the block size was larger than L1 cache size,the padding algorithm in [16] suggested a pad size of 0.There is essentially no padding effect, thereby drasticallyincreasing L1 and L2 cache misses. Fig. 17 shows the blocksize effect on total miss cost using tiling + padding andtiling + BDL. Tiling + padding reduced L1 and L2 cachemiss costs considerably. However, TLB miss costs were stillsignificantly high, affecting the overall performance. Asdiscussed in Section 3.3, the suggested range for optimalblock size is 32-44. Simulations validate that the optimalblock size achieving the smallest miss cost locates in therange selected using our approach.

4.2 Execution on Real Platforms

To verify our block size selection and the performanceimprovements using block data layout, we performedexperiments on several platforms as tabulated in Table 3.gcc compiler was used in these experiments. The compileroptimization flags were set to “-fomit-frame-pointer-O3 -funroll-loops.” Execution time was the userprocessor time measured by sys-call clock(). All the datareported here is the average of 10 executions. The problemsizes ranged from 1; 000� 1; 000 to 1; 600� 1; 600.

Fig. 18 shows the comparison of execution time of tiling+ BDL with other techniques. The performance of tiling +TSS (tile size selection algorithm [6]) shown in this figureselects block size based on GCD computation. Tiling solvesthe cache capacity miss problem but it cannot avoid conflictmisses. Conflict misses are strongly related to the problemsize and block size. This makes tiling sensitive to problemsize. As discussed in Section 3.2, block data layout greatlyreduces conflict misses, resulting in smoother performancecompared with others.

PARK ET AL.: TILING, BLOCK DATA LAYOUT, AND MEMORY HIERARCHY PERFORMANCE 649

Fig. 14. Simulation results for TMM using UltraSparc II parameters. (a) TLB misses, (b) L1 misses, and (c) L2 misses.

Fig. 15. Total miss cost for TMM using UltraSparc II parameters.

Page 11: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

The effect of block size on tiling + BDL is shown in

Figs. 19, 20, and 21. Various problem sizes were tested and

results on all these problems showed similar trends as in

Figs. 19, 20, and 21. As an illustration, the results for

problem size of 1; 024� 1; 024 are shown. As shown in

Figs. 19, 20, and 21, the optimal block sizes for Pentium III,

UltraSparc II, Sun UltraSparc III, and Alpha 21264 are 40,

44, 76, and 76, respectively. All these numbers are in the

range given by our block size selection algorithm. For

example, the range for best block size on Alpha 21264 is 64-

78. This confirmed that our block size selection algorithm

proposes a reasonable range. As discussed earlier, blocksizes 32 and 64 should be avoided (for use with block datalayout) because the performance degrades due to conflictmisses between blocks.

Figs. 22, 23, and 24 show the execution time comparisonof tiling + BDL with tiling + copying and tiling + padding.In these figures, block size for tiling + BDL was given by ouralgorithm discussed in Section 3.3. The tile size for thecopying technique was given by the approach in [12]. Thepad size was selected by the algorithm discussed in [16].Tiling + BDL technique is faster than using other optimiza-tion techniques, for almost all problem sizes and on all theplatforms.

4.3 Block Data Layout and Morton Data Layout

Recently, nonlinear data layouts have been considered toimprove memory hierarchy performance. One such layoutis the Morton data layout (MDL) as defined in Section 2.1.Similar to block data layout, elements within each block aremapped onto contiguous memory locations. However,Morton data layout uses a different order to map blocksas shown in Fig. 1. This order matches the access pattern ofrecursive algorithms. In this section, we compare theperformance of recursive algorithms using MDL (recursive+ MDL) with iterative tiled algorithms using BDL (iterative+ BDL), for matrix multiplication and LU decomposition.We show that the performance of recursive + MDL iscomparable with that of iterative + BDL if the block size of

650 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003

Fig. 16. Simulation results for LU using Pentium III parameters. (a) TLB misses, (b) L1 cache misses, and (c) L2 cache misses.

Fig. 17. Effect of block size on LU decomposition using Pentium III parameters. (a) Tiling + Padding and (b) Tiling + BDL.

Fig. 18. Execution time comparison of various techniques for TMM on

Pentium III.

Page 12: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

MDL lies in the optimal block size range for BDL as given

by our algorithm ((5) in Section 3.3). However, if the block

size of MDL is outside this range, recursive + MDL is

slower than iterative + BDL.

Similar to block data layout, block size for Morton layoutalso plays an important role in the performance. However,due to recursion, the choice of block sizes is limited. For anN �N matrix, if the depth of recursion is d, the block size ofMDL is given by BMDL ¼ N

2d. Such a block size can lie

PARK ET AL.: TILING, BLOCK DATA LAYOUT, AND MEMORY HIERARCHY PERFORMANCE 651

Fig. 19. Effect of block size on TMM. (a) Alpha 21264, (b) UltraSparc II, (c) UltraSparc III, and (d) Pentium III.

Fig. 20. Effect of block size on LU decomposition. (a) Alpha 21264, (b) UltraSparc II, (c) UltraSparc III, and (d) Pentium III.

Fig. 21. Effect of block size on Cholesky factorization. (a) Alpha 21264, (b) UltraSparc II, (c) UltraSparc III, and (d) Pentium III.

Fig. 22. Execution time of TMM. (a) Alpha 21264, (b) UltraSparc II, (c) UltraSparc III, and (d) Pentium III.

Fig. 23. Execution timeof LU decomposition. (a) Alpha 21264, (b) UltraSparc II, (c) UltraSparc III, and (d) Pentium III.

Page 13: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

outside the optimal range given by our approach. Ourexperiment results show that this degrades the overallperformance.

Experiments using TMM and LU were performed onUltraSparc II and Pentium III. Table 4 shows the execution

time comparison of MM using iterative + BDL withrecursive + MDL. For iterative + BDL, we selected theblock size according to the algorithm discussed in

Section 3.3. For recursive + MDL, we tested variousrecursion depths (resulting in various basic block sizes)and used the best for comparison. For problem size 1; 280�1; 280 and 1; 408� 1; 408, optimal block sizes for recursive +

MDL were 40 and 44, respectively, which were in the rangegiven by our algorithm, 36-44. Both the layouts showedcompetitive performance for these cases. For problem size1; 600� 1; 600, recursive + MDL was up to 15.8 percent

slower than iterative + BDL. Among possible choices of 25,50, and 100, the performance of recursive + MDL was

optimized at block size 25, where 25 ¼ 160025 . This is because

it is outside the optimal range specified by our algorithm.

Table 5 shows the execution time comparison of tiled

LU decomposition using BDL and recursive LU decom-

position [28] using MDL. These results confirm our

analysis.

5 CONCLUDING REMARKS

This paper studied a critical problem in understanding the

performance of algorithms on state-of-the-art machines that

employ multilevel memory hierarchy. We showed that

using block data layout, TLB misses as well as cache misses

are reduced considerably. Further, we proposed a tight

range for block size using our performance analysis. Our

analysis matches closely with simulation based as well as

experimental results.

652 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003

Fig. 24. Execution time of Cholesky factorization. (a) Alpha 21264, (b) UltraSparc II, (c) UltraSparc III, and (d) Pentium III.

TABLE 4Comparison of Execution Time of TMM on Various Platforms: All Times Are in Seconds

(a) Pentium III and (b) UltraSparc II.

TABLE 5Comparison of Execution Time of LU Decomposition on Various Platforms: All Times Are in Seconds

(a) Pentium III and (b) UltraSparc II.

Page 14: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

This work is part of the Algorithms for Data IntensiVeApplications on Intelligent and Smart MemORies (ADVI-SOR) Project at USC [1]. In this project, we focus ondeveloping algorithmic design techniques for mappingapplications to architectures. Through this, we understandand create a framework for application developers toexploit features of advanced architectures to achieve highperformance.

ACKNOWLEDGMENTS

The authors would like to thank Shriram Bhargava Gundala

for careful reading of drafts of this work. They also would

like to thank Sriram Vajapeyam and Cauligi S. Raghavendra

for their inputs on a preliminary version of this work. This

work was supported by the DARPA Data Intensive Systems

Program under contract F33615-99-1-1483, monitored by

Wright Patterson Air Force Base, and the US National

Science Foundation under grant No. 99000613 and, in part,

by an equipment grant from Intel Corporation.

REFERENCES

[1] ADVISOR Project, http://advisor.usc.edu, 2002[2] D. Burger and T.M. Austin, “The SimpleScalar Tool Set, Version

2.0,” Technical Report 1342, Computer Science Dept., Univ. ofWisconsin-Madison, June 1997.

[3] J.B. Carter, W.C. Hsieh, L.B. Stoller, M.R. Swanson, L. Zhang, E.L.Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M.A. Parker, L.Schaelicke, and T. Tateyama, “Impulse: Building a SmarterMemory Controller,” Proc. Fifth Int’l Symp. High PerformanceComputer Architecture, pp. 70-79, Jan. 1999.

[4] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra, and M.Thottethodi, “Nonlinear Array Layouts for Hierarchical MemorySystems,” Proc. 13th ACM Int’l Conf. Supercomputing, June 1999.

[5] M. Cierniak and W. Li, “Unifying Data and Control Transforma-tions for Distributed Shared-Memory Machines,” Proc. SCMSIGPLAN ’95 Conf. Programming Language Design and Implement-sion, pp. 205-217, June 1995.

[6] S. Coleman and K.S. McKinley, “Tile Size Selection Using CacheOrganization and Data Layout,” Proc. SIGPLAN ’95 Conf.Programming Language Design and Implementation, June 1995.

[7] R. Espasa, J. Corbal, and M. Valero, “Command Vector MemorySystems: High Performance at Low Cost,” Technical Report UPC-DAC-1998-8, Universitat Polit‘ecnica de Catalunya, 1998.

[8] K. Esseghir, Improving Data Locality for Caches, Master’s Thesis,Dept. of Computer Science, Rice Univ., Sept. 1993.

[9] A. Gonzalez, C. Aliagas, and M. Valero, “A Data Cache WithMultiple Caching Strategies Tuned to Different Types of Locality,”Proc. Int’l Conf. Supercomputing, pp. 338-347, July 1995.

[10] T.L. Johnson, M.C. Merten, and W.W. Hwu, “Run-Time SpatialLocality Detection and Optimization,” Proc. 30th Int’l Symp.Microarchitecture, Dec. 1997.

[11] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee,“Improving Locality Using Loop and Data Transformations in anIntegrated Framework,” Proc. 31st IEEE/ACM Int’l Symp. Micro-architecture, Nov. 1998.

[12] M. Lam, E. Rothberg, and M. E. Wolf, “The Cache Performanceand Optimizations of Blocked Algorithms,” Proc. Fourth Int’l Conf.Architectural Support for Programming Languages and OperatingSystems, Apr. 1991.

[13] N. Mitchell, K. Hogstedt, L. Carter, and J. Ferrante, “Quantifyingthe Multi-Level Nature of Tiling Interactions,” Int’l J. ParallelProgramming, 1998.

[14] D. Padua, “The Fortran I Compiler,” IEEE Computing in Science andEng., Jan./Feb. 2000.

[15] D.A. Padua, “Outline of a Roadmap for Compiler Technology,”IEEE Computing in Science and Eng., Fall 1996.

[16] P.R. Panda, H. Nakamura, N. Dutt, and A. Nicolau, “AugmentingLoop Tiling with Data Alignment for Improved Cache Perfor-mance,” IEEE Trans. Computers, vol. 48, no. 2, Feb. 1999.

[17] N. Park, B. Hong, and V.K. Prasanna, “Memory HierarchyPerformance of Tiling and Block Data Layout,” Technical ReportUSC-CENG 02-15, Dept. of Electrical Eng., Univ. of SouthernCalifornia, Jan. 2003.

[18] N. Park, D. Kang, K. Bondalapati, and V.K. Prasanna, “DynamicData Layouts for Cache-Conscious Factorization of DFT,” Proc.Int’l Parallel and Distributed Processing Symp. 2000, Apr. 2000.

[19] N. Park and V.K. Prasanna, “Cache Conscious Walsh-HadamardTransform,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing,May 2001.

[20] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C.Kozyrakis, R. Thomas, and K. Yelick, “A Case for IntelligentDRAM: IRAM,” IEEE Micro, Apr. 1997.

[21] G. Rivera and C.-W. Tseng, “Data Transformations for Eliminat-ing Conflict Misses,” ACM SIGPLAN Conf. Programming LanguageDesign and Implementation, June 1998.

[22] G. Rivera and C.-W. Tseng, “Locality Optimizations for Multi-Level Caches,” Proc. IEEE Supercomputing, Nov. 1999.

[23] V. Sarkar and G.R. Gao, “Optimization of Array Accesses byCollective Loop Transformations,” Proc. Int’l Conference of Super-computing, June 1991.

[24] A. Saulsbury, F. Dahgren, and P. Stenstrom, “Receny-Based TLBPreloading,” Proc. 27th Ann. Int’l Symp. Computer Architecture, June2000.

[25] H. Sharangpani, “Intel Itanium Processor Microarchitecture Over-view,” Microprocessor Forum, Oct. 1999.

[26] O. Temam, E.D. Granston, and W. Jalby, “To Copy or Not toCopy: A Comile-Time Technique for Assessing When DataCopying Should be Used to Eliminate Cache Conflicts,” Proc.IEEE Supercomputing, Nov. 1993.

[27] R.C. Whaley and J. Dongarra, “Automatically Tuned LinearAlgebra Software (ATLAS),” Proc. Supercomputing, Nov. 1998.

[28] Q. Yi, V. Adve, and K. Kennedy, “Transforming Loops toRecursion for Multi-Level Memory Hierarchies,” Proc. ACMSIGPLAN 2000 Conf. Programming Language Design and Implemen-tation, June 2000.

Neungsoo Park received the BS and MSdegrees in electrical engieering from YonseiUniversity, Seoul, Korea. He received the PhDdegree in electrical engineering from the Uni-versity of Southern California. He is a seniorengineer at Samsung Electronics Coporation.His research interests include parallel computa-tion, computer architecture, high-performancecomputing for signal processing, VLSI computa-

tions, and multimedia systems. He is a member of the IEEE and theIEEE Computer Society.

Bo Hong received the MS degree at TsinghuaUniversity, Beijing in July 2000. He is currently aPhD student in the Electrical Engineering De-partment at the University of Southern Califor-nia. His research interests include computerarchitecture, high-performance computing, andparallel processing on heterogeneous comput-ing architectures.

PARK ET AL.: TILING, BLOCK DATA LAYOUT, AND MEMORY HIERARCHY PERFORMANCE 653

Page 15: Tiling, block data layout, and memory hierarchy ...pdfs.semanticscholar.org/8ab1/f2e547acbcbd02bb23c... · data layout achieves up to 50 percent performance improvement over other

Viktor K. Prasanna received the BS degree inelectronics engineering from the BangaloreUniversity and the MS degree from the Schoolof Automation, Indian Institute of Science. Hereceived the PhD degree in computer sciencefrom the Pennsylvania State University in 1983.Currently, he is a professor in the Department ofElectrical Engineering as well as in the Depart-ment of Computer Science at the University ofSouthern California, Los Angeles. He is also an

associate member of the Center for Applied Mathematical Sciences(CAMS) at the University of Southern California. He served as thedivision director for the Computer Engineering Division during 1994-1998. His research interests include parallel and distributed systems,embedded systems, configurable architectures, and high-performancecomputing. Dr. Prasanna has published extensively and consulted forindustries in the above areas. He has served on the organizingcommittees of several international meetings in VLSI computations,parallel computation, and high-performance computing. He is thesteering cochair of the International Parallel and Distributed ProcessingSymposium (merged IEEE International Parallel Processing Symposium(IPPS) and the Symposium on Parallel and Distributed Processing(SPDP)) and is the steering chair of the International Conference onHigh Performance Computing (HiPC). He serves on the editorial boardsof the Journal of Parallel and Distributed Computing and the Proceed-ings of the IEEE. He is the editor-in-chief of the IEEE Transactions onComputers. He was the founding Chair of the IEEE Computer SocietyTechnical Committee on Parallel Processing. He is a fellow of the IEEEand the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at http://computer.org/publications/dlib.

654 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 14, NO. 7, JULY 2003