cake: matrix multiplication using constant-bandwidth blocks
TRANSCRIPT
CAKE: Matrix Multiplication Using Constant-Bandwidth BlocksH. T. Kung
Harvard UniversityCambridge, MA, [email protected]
Vikas NateshHarvard UniversityCambridge, MA, [email protected]
Andrew SabotHarvard UniversityCambridge, MA, [email protected]
ABSTRACTWe offer a novel approach to matrix-matrix multiplication compu-tation on computing platforms with memory hierarchies. Constant-bandwidth (CB) blocks improve computation throughput for archi-tectures limited by external memory bandwidth. Configuring theshape and size of CB blocks operating from within any memoryhierarchy level (e.g., internal SRAM), we achieve high throughputwhile holding external bandwidth (e.g., with DRAM) constant. Weexplain how, surprisingly, CB blocks can maintain constant exter-nal bandwidth as computation throughput increases. Analogous topartitioning a cake into pieces, we dub our CB-partitioned systemCAKE.
We show CAKE outperforms state-of-the-art libraries in compu-tation time on real-world systems where external bandwidth rep-resents a bottleneck, demonstrating CAKEβs ability to address thememory wall. CAKE achieves superior performance by directly us-ing theoretically optimal CB-partitioned blocks in tiling and sched-uling, obviating the need for extensive design search.
KEYWORDSmemorymanagement, computation for deep neural network (DNN),parallel processing, parallel architectures, optimal block scheduling,arithmetic intensity, matrix multiplication, memory wall
ACM Reference Format:H. T. Kung, Vikas Natesh, and Andrew Sabot. 2021. CAKE: Matrix Multiplica-tion Using Constant-Bandwidth Blocks. In The International Conference forHigh Performance Computing, Networking, Storage and Analysis (SC β21), No-vember 14β19, 2021, St. Louis, MO, USA. ACM, New York, NY, USA, 12 pages.https://doi.org/10.1145/3458817.3476166
1 INTRODUCTIONMatrix-matrix multiplication (MM) underlies many computationalworkloads in scientific computing and machine learning. For ex-ample, most computations in the forward pass of a convolutionalneural network consist of one matrix multiplication per convolu-tional layer between the inputs to and the weights of a layer (see,e.g., [26]). Computation throughput for MM depends on process-ing power, local memory bandwidth (e.g., between internal SRAMand processing cores), local memory size, and DRAM bandwidth.However, performance gains from increasing processing power arelimited by the other three factors. For example, DRAM bandwidthmay become the limiting factor as more processing power is added.
Current approaches to MM (e.g., Gotoβs algorithm [12]) are de-signed for systems where memory and compute bandwidth arepresumably balanced for target workloads. However, there is a
SC β21, November 14β19, 2021, St. Louis, MO, USA2021. ACM ISBN 978-1-4503-8442-1/21/11.https://doi.org/10.1145/3458817.3476166
need for a new approach that can adapt to architectures with dif-fering characteristics. These architectures may arise as a result ofemerging technologies such as special-purpose accelerators [10, 23],low-power systems [20, 24], 3D DRAM die stacking [6, 14, 18, 25]and high-capacity non-volatile memory (NVM) [27]. The memorywall [30] remains a fundamental problem faced by all computingsystems: shrinking technology processes do not directly addressthis issue. Finding a computation schedule which, for example, canmaximize data reuse in the local memory to alleviate this dispar-ity between memory and compute can be challenging. Often, thecomputation schedule is found through a grid search of the param-eter space, which becomes computationally intractable for largesystems.
This paper proposes the CAKE system that utilizes constant-bandwidth (CB) blocks in computation partitioning and blockscheduling. CAKE offers a theory for optimal partitioning andsubstantially reduces the search space for an optimal schedule. ACB block is a block of computation with the property that, whencomputing the block from within a local memory, the requiredexternal bandwidth is constant. We can design a CB block capableof achieving a target computation throughput by controlling itsshape and size (Section 3). With sufficient local memory resources,the CAKE algorithm can improve MM computation throughputwithout having to increase external DRAM bandwidth.
In CAKE, we partition the MM computation space, a 3D volumeof multiply-accumulate (MAC) operations, into a grid of 3D CBblocks. The blocks are scheduled and then sequentially executedon a computing platform comprising multiple computing cores(see Figure 1). This scheme is analogous to how a host can cut andserve a βcakeβ to their guests. To consume the entire cake as quicklyas possible, each guest must continuously eat: the rate of servingpieces must match the guestsβ rate of consumption.
The intuition behind CAKE is that, by adjusting the shape (i.e.,aspect ratios) and size of CB blocks, we can control the ratio ofcomputations to external memory accesses, i.e., arithmetic intensity(see Figure 4). As a result, we can use CB blocks to increase theuse of available computing power without requiring a compara-ble increase in IO bandwidth to external DRAM. We analyticallydetermine the shape and size of a CB block from available DRAMbandwidth and computing resources (Section 3). Under the CBframework, we can precisely characterize the required size andbandwidth of local memory for achieving a target computationthroughput with a given external memory bandwidth.
We evaluate CAKEβs performance in computation throughputand external memory bandwidth. Experiments are conducted ontwo desktop CPUs and a low-power embedded CPU. On the desktopCPUs, we found CAKE can outperform state-of-the-art GEMM li-braries in computation throughput (Sections 5.2.5 and 5.2.6). For thelow-power system, CAKE achieves substantially higher throughput
Table 1: Terminology used throughout the paper
Term DescriptionMM matrix-matrix multiplicationCB constant-bandwidth blockTile small matrix processed by a core (Section 3) or SIMD registers (Section 4.1)
Local Memory cache on CPUsInternal BW bandwidth between last level cache and CPU coresExternal BW bandwidth between local memory and external memory (DRAM)
Processor
L3 cache, last level cache (LLC)
L2 cache
L1 cache
Core
L2 cache
L1 cache
Core
L2 cache
L1 cache
Core Local memory
External memory
DRAM
Figure 1: Computing architecture withmany cores, multiplelevels of local memory, and external memory.
than vendor libraries (Section 5.2.4). Table 1 lists terminologies usedin this paper. The main contributions of this paper are:β’ MM block scheduling with surface sharing (Section 2).β’ CAKE constant-bandwidth blocks: blocks analytically shapedand sized to reduce external memory access requirements(Section 3).β’ Comparison of CAKE and Gotoβs algorithms (Section 4).β’ CAKE library: a drop-in replacement for MM calls used byexisting frameworks that does not require manual tuning.β’ Demonstration of CAKE library performance gains on 3different CPU architectures (Section 5).
2 BLOCK MM COMPUTATIONIn this section, we introduce a block framework for matrix multi-plication (MM) computation and how we may partition the compu-tation space into blocks. Algorithm 1 defines MM where reductionoccurs on dimension πΎ .
Consider an MM between matrices π΄ and π΅, where π΄ is sizeπ Γ πΎ and π΅ is size πΎ Γ π . The MM can be computed via a set ofvector operations using one of the two strategies. That is, we canobtain the MM result πΆ throughπ Β· π inner products betweenπrow vectors (size 1 Γ πΎ ) of π΄ and π column vectors (size πΎ Γ 1) ofπ΅, or summation of πΎ outer products between column vectors (sizeπ Γ 1) of π΄ and the corresponding row vectors (size 1 Γ π ) of π΅.
Outer products, unlike inner products, yield partial result matri-ces which will be summed together to produce πΆ , allowing reuseand in-place accumulation. In the π΄ Γ π΅ example (Figure 2), thereareπΎ outer products between vectors of sizeπΓ1 and 1Γπ , whicheach produce partial result matrices of sizeπΓπ . Partial results areaccumulated across the K-dimension (reduction dimension). Notethat we may store the intermediate partial result matrix locally tobe accumulated in place with forthcoming partial result matrices.
Thus the locally stored intermediate results are reused K times. Weuse outer-product MM throughout this paper to leverage this reuse.
Γ =A B C
(a)
=CA
B
(b)
Matrix multiplication
Outer products
K accu
mulatio
ns
(c)
Figure 2: (a)πΆ = π΄Γπ΅matrixmultiplication. (b) Computationspace represented as an π Γ πΎ Γ π 3D volume of multiply-accumulate (MAC) operations as defined by Algorithm 1. (c)Computation space as an accumulation of outer products.
Algorithm 1:Matrix multiplication algorithmfor π = 1β π dofor π = 1β π dofor π = 1β πΎ doπΆ [π] [ π] ββ πΆ [π] [ π] +π΄[π] [π] Β· π΅ [π] [ π];
The computation space forπΆ = π΄Γπ΅ is a 3D volume ofπ Β·π Β·πΎmultiply-accumulate (MAC) operations depicted in Figure 2b. Thevolume shown in Figure 2b is determined by 3 IO surfaces: In thispaper, we consider matrix π΄ as being on the βleftβ side, matrix π΅being on the βtopβ, and matrix πΆ being the βback wallβ.
2.1 Block Framework for MM ComputationIf a computing platform has a sufficient local memory (internalmemory) to hold the 3 IO surfaces, all MACs in the 3D computationspace may be completed without any additional IO operations. ForlargeMMs exceeding local memory capacity, the computation spacemust be partitioned and computed in smaller blocks, whose resultsare later combined.
In CAKE, we use uniform blocks of the same shape and size. Asshown in Figure 3, we partition theπ Γ π Γ πΎ computation spaceintoπ Γ π Γ π blocks. Section 3 describes how block dimensionsare calculated based on available resources, such as total availablecomputing power and external DRAM bandwidth. Computationof a block may be viewed as a sum of π outer products, for multi-plying the two corresponding sub-matrices within the larger MM
2
computation space. The computation yields the back wall of theblock. Block computations result inπΓπ sub-matrices which, whenaccumulated together, yield a portion of the final matrix πΆ . Matrixaddition is commutative, so computation order for blocks in thecomputation space does not matter for correctness.
All cores in the processing grid work (Figure 3b) in parallel,on input tiles at the rate of one tile result per unit time for eachcore, to compute a block by performing π outer products. Eachcore works through the π -dimension of the block computationspace, producing a row of partial results by computing tile-wisemultiplications between a single π΄ tile and π π΅ tiles. It is also possi-ble to compute computation blocks in theπ or πΎ-dimension butwe focus our presentation on the π -dimension. Each column ofcores in the grid (e.g., cores 1, 5, 9, 13 in Figure 3b) computes anouter product between a sub-column of π΄ and a sub-row of π΅ toproduce a partial result sub-matrix. The resulting sub-matrices areaccumulated, yielding a partial sub-matrix which will be furtheraccumulated with results from other blocks.
Intra-block data movement is reduced by each core sequentiallyreusing one π΄ tile with many π΅ tiles. The π΅ tiles are broadcast tocores in the same column to maximize intra-block reuse. Partialresults are summed along the πΎ dimension (towards the back of thecomputation space), maximizing reuse via in-place accumulation.
As noted earlier, a block is defined by three IO surfaces: an inputsurfaceπ΄ of sizeπ Γπ , an input surface π΅ of size π Γπ, and a resultsurface πΆ of sizeπ Γ π. Surfaces π΄ and π΅ correspond to the afore-mentioned sub-matrices within the larger MM computation space.Depending on the location of the block within the computationspace, result surface πΆ will consist of either partial or completedreduction results.
For each block, the total external memory IO and required localmemory size are both equal to the sum of the three IO surfaces. Ex-ternal memory bandwidth requirements may be determined fromthe computation time of the block. When the block is shaped prop-erly (see Figure 4), the IO time for the three surfaces will match thecomputation time of the block, allowing IO to overlap computationthat fully utilizes available processing power.
IO requirements are further reduced when sequentially com-puted blocks share an IO surface (i.e., the blocks are adjacent withinthe computation space). The surface can be kept local: the followingadjacent block can reuse the surface without needing to fetch itfrom external memory. Furthermore, if the surface is a partial resultsurface, the previous block does not need to writeback the resultsto external memory before the next computation. IO surfaces canbe shared in theπ , π , or πΎ-dimension, and IO cost is minimizedwhen the largest IO surface is reused most frequently.
2.2 Scheduling Blocks for MM ComputationTo minimize IO, blocks in the MM computation space are scheduledso adjacent blocks are computed in sequence. Per the cake analogy,scheduling is analogous to cutting a cake into large chunks (blocks)that are then divided into smaller pieces for each guest (core). Thepartitioning (Figure 3b) is then used to generate a sequence ofblocks, which are sequentially executed on the grid of cores.
Recall the 3 types of IO: π΄, π΅ and results πΆ (where results areeither partial or complete). The IO for a partial result is twice that
1 2 3
6 5 4
987
25262710
1112
98
7
242322
131415
654 19
2021
181716
123
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Processing grid
1234
5678
9101112
13141516
Slice
Block
Slice of MM space
Partitioned MM space
(a)
(b)
(c)
(d)
Figure 3: (a) Block defined by an π Γπ, π Γ π input surfacesand anπ Γ π output surface. (b) Grid of 16 processing cores.(c) Block-partitioned computation space for anMMbetweenπ Γ πΎ and πΎ Γ π matrices. (d) Rotated view of a slice of thecomputation space. The numbers represent the order of ex-ecution for blocks in a πΎ-first schedule.
of a completed result, π΄, or π΅ because it must be written backto external memory and, later, fetched again for further use. Aschedule that minimizes external IO must therefore completelyreuse all partial surfaces, and then reuse π΄ and π΅. When a partialdimension is complete there is a choice of which surface to reusenext: π΄ (in theπ dimension) and π΅ (in the π dimension). When π΅is reused first, theπ-dimension is completed after theπΎ-dimensionand vice versa if π΄ is reused first. We refer to this schedule as aπΎ-first (reduction-first) schedule.
The choice of whether to first reuse π΄ (in the π -dimension) or π΅(in theπ-dimension) after completing a reduction run (e.g., blocks 1,2 and 3 in Figure 3d) depends on the shape of the computation space.When the π΅ surface is larger (π > π) reusing the larger π΅ surfacebefore the π΄ surface minimizes IO. This is done by computing theM-dimension before the π -dimension.
To allow inter-block reuse between different dimensions (e.g.,between πΎ and π), the computation space traversal must βturnβevery time it completes a dimension. If instead the loops alwaysstarted at the 0 index of a dimension, no π΄ or π΅ surfaces wouldbe reused, leading to π (ππ + π ) missed IO surface reuses. Thetraversal direction flips after each dimension to allow for IO surfacereuse, shown in Algorithm 2. Note the pseudocode assumes π β₯ π ,when π > π the outer two loops are switched because the π΄surfaces should be reused before the π΅ surfaces. The algorithmdefines the πΎ-first computation order of blocks, which sweeps thespace of computation space by first traversing the πΎ-dimension tomaximize partial result reuse, then the π-dimension to reuse π΄,and lastly the π -dimension to reuse π΅.
3 CONSTANT BANDWIDTH BLOCK SHAPEAND SIZE
A constant bandwidth (CB) block is a block (described in Section 2.1)with dimensions (π,π, π) shaped and sized according to externalbandwidth (as seen in Figure 4). CB block shaping provides controlover arithmetic intensity (AI), allowing us to match external IOtime with computation time. AI is defined as the ratio of compu-tation volume to data transferred, which is equivalent to the ratio
3
Algorithm 2: πΎ-first block partitioning algorithm// Get the number of blocks in each dimensionππ = π/π;ππ = π /π;πΎπ = πΎ/πfor ππππ₯ = 0, 1, . . . to ππ β 1 do// Flip direction of π traversal for π΄ reuse
if ππππ₯ mod 2 == 0 thenππ π‘πππ‘ = 0;ππππ = ππ ;elseππ π‘πππ‘ = ππ ;ππππ = 0;forππππ₯ =ππ π‘πππ‘ toππππ do// Flip direction of π traversal for π΅ reuse
if ππππ₯ mod 2 == 0 then ;if ππππ₯ mod 2 == 0 then ππ π‘πππ‘ = 0;ππππ = πΎπ ;else ππ π‘πππ‘ = πΎπ ;ππππ = 0;
else
if ππππ₯ mod 2 == 0 then ππ π‘πππ‘ = πΎπ ;ππππ = 0;else ππ π‘πππ‘ = 0;ππππ = πΎπ ;
for ππππ₯ = ππ π‘πππ‘ to ππππ do// Compute inner block multiplicationπΆ [ππππ₯ βπ] [ππππ₯ β π] + =π΄[ππππ₯ βπ] [ππππ₯ β π] Γ π΅ [ππππ₯ β π] [ππππ₯ β π];
Bandwidth isequal in all cases:
(a) (b) (c)
Figure 4: Changing the shape and size of a block can keep ablockβs external bandwidth (π΅π ) constant when increasingthe computation throughput (CT) by increasing core count.The block gets taller and wider, matching πΌπ and computa-tion time π as volume (π ) changes. Note that arithmetic in-tensity is π /πΌπ . Since π /πΌπ =
π /ππΌπ/π = πΆπ
π΅π, and CB blocks in
(a), (b) and (c) have equal π΅π and increasing volume, theyhave increasing arithmetic intensity. Importantly, compu-tation throughput is π /π . Thus, CB blocks in (a), (b) and(c) have increasing computation throughput at ππππ = ππ ,4πππ2π = 2ππ , and π2πππ
ππ = πππ , respectively.
of computation throughput (CT) to external memory bandwidth(BW): π΄πΌ = πΆπ
π΅π. Therefore, we can, for example, increase CT or
decrease required BW by using CB blocks to control AI. We canalso increase the size of the CB block to leverage any additionalBW and decrease internal memory size requirements.
Consider a computing architecture with a number of cores, eachperforming one tile multiplication per unit time. As seen in Section2.1, each core handles one tile from π΄, so the number of tiles inthe π΄ surface (sizeπ Γ π) of a CB block is equal to the number ofcores. The size of π determines how many cores contribute theirpartial results for accumulation. In this analysis, we assume π is afixed optimal value calculated from available external bandwidth
(Section 3.2). Given π , we can computeπ so thatπ Β·π is the numberof available cores (Figure 3a and 3b). We reflect an increase in theamount of cores used in π (e.g., when trying to increase CT bydoubling the number of cores used, π would increase by a factor of2). To reduce the number of variables in our analysis, we setπ to amultiple of π , based on the number of available cores (i.e.,π = ππ),butπ does not need to be a multiple of π .
Thus, the CB block shape is defined by π = ππ and π = πΌππ
where πΌ β₯ 1 and π are unitless constants calculated from availableexternal memory bandwidth (see Section 3.2). When external mem-ory bandwidth is low, raising πΌ increases block computation time,thereby decreasing the CB blockβs external bandwidth requirement(BW). When there is sufficient external bandwidth, πΌ is set to 1.
To compute a CB block in the π -dimension, each core is firstloaded with one π΄ tile. π΅ tiles are then streamed to each core fromlocal memory (e.g., L3 cache). The CB block is shaped to have ex-actly one π΄ tile per core, keeping π΄ tiles stationary, to reduce localcongestion. Results are accumulated between cores and cycledback to local memory (e.g., L3 cache) for reuse and moved to exter-nal DRAM when complete. The computation time π for a CB blockis π = πΌππ unit times because each core is assigned to compute πtile multiplications.
Alternatively, we can compute a CB block in theπ orπΎ-dimension,resulting in a CB block computation time of π orπ unit times, re-spectively. Computing CB blocks in alternative directions may beadvantageous on certain architectures. For example, computingCB blocks in the πΎ-dimensions is preferable when doing in-placeaccumulation. In a future paper we will show how the same shapingmethodology applies when computing CB blocks in the π or πΎ-dimension. In this analysis, we do not factor in accumulation timeas we assume accumulation can be overlapped with multiplication.
3.1 Internal Memory Size RequirementA CB block consists of 3 IO surfaces that must be stored locally. TheIO for each surface is equal to its size. For π΄, πΌππ΄ = ππ Β· π = ππ2.For π΅, πΌππ΅ = π Β· πΌππ = πΌππ2. For result surface πΆ , πΌππΆ = ππ Β·πΌππ = πΌπ2π2. The internal memory size requirement is simply thecombined size of the surfaces:ππΈππππ‘πππππ = πΌππ΄ + πΌππ΅ + πΌπππππ‘πππ = πΌππ2 + ππ2 + πΌπ2π2 (1)
To increase the target processing power π-fold, internal memorysize must increase by a factor of π2 (due to the third term, πΌπ2π2).
3.2 External Bandwidth AnalysisWe may compute the minimum external bandwidth required for aCB block based on the external IO for its π΄ and π΅ surfaces, usingthe following equation, where π is the block computation time:
π΅ππππ =πΌππ΄ + πΌππ΅
π=πΌππ2 + ππ2
πΌππ=
(πΌ + 1πΌ
)Β· π tiles/cycle (2)
Increasing πΌ allows us to compensate for low external bandwidthbut increases both computation time and required local memorysize. Partial result IO is not considered since results are held locally.
We define external bandwidth as π΅πππ₯π‘ = π Β·π tiles/cycle, whereπ > 1 is a constant capturing the difference between available ex-ternal bandwidth and the minimum bandwidth defined previously.We satisfy the minimum external bandwidth requirement whenπ΅πππ₯π‘ β₯ π΅ππππ , or πΌ β₯ 1
π β1 .4
Now consider the case when more processing power is available(e.g., when the number of cores increases from 16 to 32, as in Figure4). We choose to increase the π and π-dimensions by a factorof π = 2 because IO and computation time will increase by thesame factor. π΅ππππ does not depend on π , which increases bothcomputation and IO time equally. As a result, we can increase thenumber of utilized cores (ππ2) while keeping the same externalbandwidth requirement (see Figure 4).
3.3 Internal Bandwidth RequirementsRecall that the local memory holds 3 IO surfaces: two input surfacesand one result surface. During a CB block computation, each inputsurface is read once and loaded onto the cores. The partial resultsurface is accessed twice: once for reading and once for storing newpartial results. We see the internal bandwidth must be at least:
πΌππ΄ + πΌππ΅ + 2πΌπππππ‘ππππ
= π π + 2ππ tiles/cycle (3)
Thus, as the number of cores (ππ2) increases, internal bandwidthmust increase proportionally (due to the second term, 2ππ) to matchexternal IO time with computation time.
4 ANALYSIS OF CAKE AND GOTO ON CPUSModern CPUs contain a multilevel memory hierarchy consistingof external memory, a shared cache for all cores, and local cacheson each core (Figure 1). We compare CAKE and Gotoβs algorithm[13] (hereafter referred to as GOTO) by adapting our computationthroughput and memory bandwidth analysis from Section 3 toCPUs with a multilevel memory hierarchy. In our analysis, weassume the memory hierarchy comprises a local L1 and L2 cacheper core, a shared L3 cache for all cores, and external DRAM.
During an MM computation, there are various ways of utilizingthe different levels of memory. Algorithms that reuse data to differ-ent degrees will differ in system resource requirements includingexternal memory bandwidth, internal memory bandwidth, and sizeof local memories (caches). To increase computation throughput viautilizing additional cores, algorithms must mitigate the constraintsimposed by cache size bottlenecks. In the previous analysis (Section3), we expressed the number of cores asππ . If we allow π > 1 onthe CPU, we are limited to using a multiple of π cores. For instance,if π = 2, we can only useππ = 2, 4, 6, . . . cores. When π = 1, we canuse any number of cores between 1 and π to clearly demonstratescaling up to π cores. For this reason, throughout Section 4, we useπ = 1.
In CAKE, when using additional cores, we increase bothπ and πby the same factor to maintain the same external bandwidth require-ment. In contrast, GOTO only increasesπ when using additionalcores.
We calculate the maximum-possible computation throughputallowed by available cores given a fixed external DRAM bandwidthfor both CAKE and GOTO. We show that, unlike GOTO, by usinginternal memory resources, CAKE need not increase external band-width to increase computation throughput when using an increasednumber of cores.
4.1 Bandwidth Analysis for GOTOGOTO is the current state-of-the-art algorithm for MM on CPUs,and is widely adopted by many libraries including Intel Math KernelLibrary (MKL) [7], ARM Performance Libraries [4] , and OpenBLAS[31]. While GOTO is able to achieve high performance when fullyutilizing available DRAM bandwidth, its bandwidth requirementquickly becomes a limiting factor when trying to use additionalcores.
We describe GOTO on a CPUwith π cores, as depicted in Figure 5.In the previous analysis (Section 3), we expressed the number ofcores asπ = ππ . Withπ = 1 for the CPU, we see that π is the numberof cores. Each of the π cores works on an independent portion ofthe MM computation (i.e., there is no accumulation between cores).
Consider an MM computationπΆ = π΄ Γ π΅ where π΄ isπ ΓπΎ , π΅ isπΎ Γπ , andπΆ isπ Γπ . The GOTO algorithm uses an outer productformulation on sub-matrices or tiles with dimensions defined interms of parameters ππ , ππ , and ππ which are chosen based oncache and register characteristics (GOTO [13]). More precisely, πΆis partitioned into column sub-matrices of sizeπ β² Γ ππ with someMβ <= M (Figure 5a). Via outer-product MM, these sub-matricesof πΆ are computed by multiplying π΄βs column sub-matrices (sizeπ β² Γππ) with π΅βs row sub-matrices (size ππ Γππ ) and accumulatingacross the πΎ-dimension. In GOTO, a square ππ Γ ππ sub-matrix(i.e.,ππ = ππ ) of π΄ resides in the L2 cache of each core while theππ Γ ππ sub-matrix of π΅ resides in the L3 cache (Figure 5b). Thevalues ofππ , ππ , and ππ are chosen such that each sub-matrix fitsinto its respective local memory level, i.e., ππ Β· ππ β€ πππ§ππΏ2 andππ Β· ππ β€ πππ§ππΏ3. Note also that ππ only refers to the πΎ dimensionvalues of π΄βs sub-matrix in the L2 cache and bears no relation to πfrom the previous analysis (Section 3).
At the level of SIMD registers, ππ Γ ππ partial result tiles areprocessed, where the width of the available registers determinesππand ππ (Figure 5e). In tiled MM,ππ Γππ tiles ofπ΄ are multiplied withππ Γππ tiles of π΅ whileππ Γππ partial result tiles ofπΆ are streamedto DRAM. For accumulation with subsequent partial results of πΆ ,previously computed partial results of πΆ are streamed from DRAMback to the CPU cores. DRAM streaming can dominate IO.
FromGOTOβs description, we derive the required external DRAMbandwidth when using additional cores. Assuming a single core canmultiply anππ Γππ tile with an ππ Γππ tile and produce anππ Γππresult tile in a single unit time, we obtainππ Β· ππ Β· ππ MAC opera-tions per unit time. The total number of MAC operations needed tocompute anππ Γππ result sub-matrix is ππ Β·ππ Β·ππ . We can computeπ ππ Γ ππ result sub-matrices in parallel since each of the π coreshas its own L2 cache to support the local SIMD computation. Thus,the time to compute π ππ Γ ππ result sub-matrices is:
π =ππ Β· ππ Β· ππππ Β· ππ Β· ππ
=ππ Β· ππππ Β· ππ
To compute π ππ Γππ result sub-matrices ofπΆ , we must bring inπ ππ Γ ππ sub-matrices of π΄ and one ππ Γ ππ sub-matrix of π΅ fromDRAM into local memory while also streaming π ππ Γ ππ partialresult sub-matrices back to DRAM:πΌπ = πΌππ΄ + πΌππ΅ + πΌππΆ = (π Β·ππ Β· ππ ) + (ππ Β· ππ ) + (π Β·ππ Β· ππ )The external (DRAM) bandwidth required is then:
π΅ππΊπππππ₯π‘ =
πΌπ
π=
(1 + π + ππ
ππΒ· π
)Β·ππ Β· ππ
5
Main memory L1 cache/SIMD registersL3 cache L2 cache
(e)
(d)
(c)(b)
(a)
BACM
N
M
K N
Kp
21
p
21
p
21
+=
+=
+= Γ
+=
+=
p
21
Note that:
Figure 5: GOTO data reuse in the CPU memory hierarchy.(a)ππ Γππ sub-matrices ofπΆ are computed by accumulatingouter products between column sub-matrices of π΄ and rowsub-matrices of π΅. (b) Each π sub-matrix fromπ΄ is reused inthe L2 cache of a core while data from π΅ is reused in the L3cache, and results for πΆ are written back to external mem-ory. (c, d) Computation of anππ Γ ππ sub-matrix of πΆ on acore. (e) Tile-level MM on CPU SIMD registers.
(e)
(d)
(c)(b)
(a)
BACM
N
M
K N
Kp
21
p
21
p
21
+=
+=
+= Γ
+=
+=
p
21
Main memory L1 cache/SIMD registersL3 cache L2 cache
Figure 6: CAKE data reuse in the CPU memory hierarchy.(a) Schedule of CB blocks. (b) Each π sub-matrix from π΄ isreused in the L2 cache of a core while data from π΅ and par-tial results for πΆ are reused in the L3 cache. (c, d) Computa-tion of a singleππ ΓπΌπππ sub-matrix ofπΆ on a core, similarto Figure 5c and d. (e) Tile-level MM, performed identicallyto Figure 5e.
by assumingππ = ππ , as noted earlier. When using an increasednumber of cores by a factor of π , the required external bandwidthfor GOTO increases by a factor of at least π . As we will present,CAKE can mitigate this issue.
4.2 Bandwidth Analysis for CAKEIn contrast to GOTO, CAKEβs bandwidth requirements do not growproportional to the number of cores used. We analyze how shapingand sizing the CB block affects external bandwidth requirements.Figure 6 shows CAKEβs block shaping on our CPU with the mul-tilevel memory hierarchy. The computation space of πΆ = π΄ Γ π΅ ispartitioned into a 3D grid of CB blocks, as described in Section 2.2.Box (a) shows the schedule of entire CB blocks within the inputmatrices.
Box (b) shows the different components of the CB block withinthe memory hierarchy. Consider a CPU with π cores. To compute asingle CB block on the CPU, the IO components are first loaded intothe L3 cache. The A component is partitioned into π squareππ Γππsub-matrices (i.e.,ππ = ππ ), and each sub-matrix is loaded into theL2 cache of one of the cores. The data component of size ππ ΓπΌπππis then broadcast from the L3 cache to each core while the partialresult component of size π Β·ππ ΓπΌ Β· π Β·ππ is reused in the L3 cache.Note that shaping the CB block on the CPU as πππ Γ π Β· ππ Γ πΌπππwith π = 1 is analogous to the ππ Γπ ΓπΌππ shaping of the CB blockin Section 3. Here, πΌ > 1 is a unitless constant that is set basedon the available DRAM bandwidth . Partial results are returnedto DRAM only after being reused for all accumulation in the πΎ-dimension. Since we reuse partial results in the L3 cache until theyare complete, the total IO for a CB block is the sum of the π΄ and π΅componentβs IO:
πΌπ = πΌππ΄ + πΌππ΅ = π Β·ππ Β· ππ + πΌπ Β·ππ Β· ππ
Computation time for CAKE is derived analogously to GOTO (Sec-tion 4.1). Box (e) in Figure 6 showsMM at the tile level on a CPU.Weassume a single core can multiply aππ Γππ tile with a ππ Γππ tile toproduce aππ Γ ππ result tile per cycle, This amounts toππ Β· ππ Β· ππmultiply-accumulate (MAC) operations per cycle. The total numberof MAC operations needed to compute aππ Γ πΌ Β· π Β·ππ result sub-matrix isππ Β·ππ Β·πΌ Β·π Β·ππ . We can compute π of theseππ ΓπΌ Β·π Β·ππresult sub-matrices in parallel, so the CB block compute time is:
π =ππ Β· ππ Β· πΌπππππ Β· ππ Β· ππ
=πΌ Β· π Β·π2
π
ππ Β· ππThe external memory bandwidth required tomaximize computationthroughput when increasing the number of cores is:
π΅ππΆπ΄πΎπΈππ₯π‘ =
πΌπ
π=πππππ (πΌ + 1) Β·ππππ
πΌππ2π
=πΌ + 1πΌΒ·ππππ (4)
We see CAKE can, unlike GOTO, balance computation with IO,while requiring only constant external bandwidth.
As described in Section 3, when growing the number of cores bya factor of π , CAKE requires that local memory size increase by afactor of π2 and internal bandwidth by a factor of π:
ππΈππΆπ΄πΎπΈπππππ
= πΌππ΄ + πΌππ΅ + πΌππΆ = πππππ Β· (πΌ + 1) + πΌπ2π2π (5)
π΅ππΆπ΄πΎπΈπππ‘ =
πΌππ΄ + πΌππ΅ + 2πΌππΆπ
=
(2π + 1
πΌ+ 1
)Β·ππππ (6)
4.3 Sizing CAKE CB Blocks to Minimize CacheEvictions
The CAKE CB block shape allows us to keep bandwidth require-ments constant, but CB block size is still dependent on availablelocal memory. Due to the least-recently-used (LRU) eviction schemeof our L2 and L3 caches, we are unable to use the entirety of each
6
cache for matrix operands. CAKE CB blocks must be sized to mini-mize superfluous memory accesses resulting from cache evictions.GOTO uses a similar strategy for sizing its blocks to minimizetranslation lookaside buffer (TLB) misses [13].
We want to reuse a CB block surface (either input surfacesπ΄ andπ΅, or partial result surface πΆ) in local memory. Assume a currentworking CB block π reuses partial results πΆ [π] in local memory. Ifthe CPU cache has size π and uses an LRU eviction protocol, wemust size our CB block such that:
πΆ + 2(π΄ + π΅) β€ πThe factor of 2 allows entries for π΄[π] and π΅ [π] to coexist withentries for πΆ [π], π΄[π + 1], and π΅ [π + 1]. Suppose we are currentlyworking on CB block π and size the block such that π΄ + π΅ + πΆ =
π . When the next CB components π΄[π + 1] and π΅ [π + 1] are firstaddressed, their data should replace the cache entries that wereused for π΄[π] and π΅ [π]. However, upon completion of CB block π ,some entries related to πΆ [π] will be LRU and will be replaced byπ΄[π + 1] and π΅ [π + 1] from block π + 1. Furthermore, the factor of 2ensures, whenπ΄[π + 2] and π΅ [π + 2] of block π + 2 are first addressed,the entries for π΄[π] and π΅ [π] are guaranteed to be LRU and will bereplaced.
4.4 Comparing CAKE and GOTOCAKE and GOTO employ similar techniques for reusing data withina memory hierarchy, as seen in Figures 5 and Figures 6. Both useouter products to compute blocks of πΆ by multiplying column sub-matrices ofπ΄ with row sub-matrices of π΅ and summing each partialproduct in theπΎ dimension. Furthermore, both partition the columnsub-matrix of π΄ into squareππ Γ ππ sub-matrices and reuse thesesquare sub-matrices in the L2 cache of each core. However, by usingCB block shaping and sizing, CAKE is able to account for availableexternal bandwidth constraints.
To accommodate additional cores, both CAKE and GOTO in-crease the size of the sub-matrices reused in local memory. Figures5b and 6b) show that both CAKE and GOTO will increase the sizeof the column sub-matrix of π΄ in theπ dimension (via πππ ) by uti-lizing the L2 cache from each additional core to operate on severalnew ππ Γ ππ sub-matrices. In GOTO, the π΅ sub-matrix occupiesmost of the L3 cache as the value of ππ is chosen based on L3 cachesize. GOTO assumes that computation power and memory band-width are balanced, allowing wide column sub-matrices of πΆ (withlarge ππ value) to be computed while buffering partial results inmain memory, i.e., GOTO attempts to overlap partial result move-ment with the computation of the next outer product. Since theπ-dimension increases proportionally with the number of cores,the partial result sub-matrix IO to main memory also increasesproportionally. GOTO relies on increasing DRAM bandwidth to bal-ance the computation time with such increasing IO requirements.
In contrast, CAKE controls the size of the π΅ and πΆ sub-matricesin the L3 cache by shaping the CB block in the π dimension (πΌπππ )according to the available DRAM bandwidth (via πΌ) and numberof cores (via π , see Figure 6a). Unlike GOTO, CAKE eliminates themovement of partial results to external memory entirely throughin-place accumulation in local memory.
Partial results of πΆ in CAKE occupy the greatest proportion ofL3 cache relative to the sub-matrices of π΄ and π΅. For example, in
Table 2: CPUs Used in CAKE Evaluation
CPU L1 L2 L3 DRAM Cores DRAM BandwidthIntel i9-10900K 32 KiB 256 KiB 20 MiB 32 GB 10 40GB/s
AMD Ryzen 9 5950X 32 KiB 512 KiB 64 MiB 128 GB 16 47GB/sARM v8 Cortex A53 16 KiB 512 KiB N/A 1 GB 4 2GB/s
our experiments with the Intel i9-10900K CPU, suppose π = 10cores, πΌ = 1, and sub-matrix dimensionsππ = ππ = 192 are chosensuch that the π΅ and πΆ CB block surfaces fill up the L3 cache. Then,the πΆ and π΅ surfaces will take up up 91% and 9% of the L3 cache,respectively. In comparison, GOTO uses all of the L3 cache for π΅.
Both CAKE and GOTO will see reduced performance as thenumber of cores grows faster relative to available internal memorysize. Past a certain point, GOTO cannot leverage additional coresas DRAM bandwidth becomes the limiting factor. Analogously forCAKE, limited internal bandwidth and local memorywill eventuallyimpede performance as the number of cores increases. However,CAKE can further increase throughput by utilizing additional localmemory or DRAM bandwidth whereas GOTO can only do so byincreasing DRAM bandwidth usage.
5 PERFORMANCE EVALUATION OF CAKE ONCPUS
Current state-of-the-art MM algorithms are based on GOTO andbounded by external memory bandwidth. We show CAKE improvesMM performance on multiple CPU architectures, achieving maxi-mum computation throughput without increasing external band-width by leveraging increased internal memory bandwidth andsize.
We evaluate CAKE on CPUs with different system characteristicconstraints. We measure CAKEβs performance, using computationthroughput and DRAM bandwidth usage as metrics.
5.1 CPUs EvaluatedWe evaluate CAKE on an ARM v8 Cortex A53, Intel i9-10900K, andAMD Ryzen 9 5950X. Each CPU has unique architecture and systemcharacteristics, shown in Table 2, allowing us to evaluate CAKEunder different constraints in external bandwidth, local memorysize, and internal bandwidth. The ARM v8 Cortex A53 is an em-bedded low-power CPU with severely limited DRAM bandwidth,local memory size, and internal bandwidth. The Intel i9-10900K isa 10-core desktop CPU with high DRAM bandwidth and large localmemory but constrained internal memory bandwidth. The AMDRyzen 9 5950X is a 16-core desktop CPU with high DRAM band-width, large local memory, and high internal memory bandwidth.In each section, we compare CAKE to the state-of-the-art GEMM li-brary for its respective CPU. We then explain CAKEβs performancein the context of each CPUβs characteristics and constraints.
5.2 CAKE Implementation and CPUExperiments
We implemented CAKE in C++ for use in our evaluations. Ourimplementation uses the BLIS kernel library [28] to execute MMat the tile level on CPU SIMD registers. BLIS was chosen for itsportability across CPU architectures, enabling CAKE to act as adrop-in GEMM library on multiple platforms.
7
In Section 5.2.2, we profile CAKE on the Intel i9 and ARM CPUsin terms of how often the CPUs are stalled onmemory requests fromdifferent memory levels. Then, in Section 5.2.3 we show CAKEβsperformance relative to Intelβs Math Kernel Library (MKL) andARM Performance Libraries (ARMPL) for various matrix sizes onthe Intel and ARM CPUs. Matrix size and shape were systemati-cally varied in three dimensions to identify regions where CAKEoutperforms the competing GEMM library. In Sections 5.2.4, 5.2.5,and 5.2.6, we vary the number of cores and measure computationthroughput (in GFLOP/s) and DRAM bandwidth usage (in GB/s)when performing a large square MM computation. Matrix sizeswere chosen to fit into the systemβs available DRAM. For example,we use relatively small 3000 Γ 3000 matrices for the ARM systemdue to its limited main memory. CAKEβs theoretically optimalDRAM bandwidth usage (calculated in Section 4.2) is shown as adashed curve. Internal bandwidths between the last level cache(LLC) and CPU cores were measured using the parallel memorybandwidth benchmark tool (pmbw) [5]. Local memory refers tothe LLC shared among all cores, and may be the L2 or L3 cache de-pending on architecture (i.e., L2 for ARM and L3 for Intel). We alsoshow performance extrapolations for CAKE and the state-of-the-artGEMM libraries when using additional CPU cores, assuming fixedDRAM bandwidth. The extrapolation assumes local memory sizeincreases quadratically and internal bandwidth increases linearlywith the number of cores. We use the last two data points in eachplot to initialize the extrapolation line.
5.2.1 PackingOverhead andConsiderations. GEMM libraries,such as Intel MKL, contain distinct packing implementations opti-mized for many matrix shapes [16] to copy input matrices into acontiguous buffer. We implemented only one packing method sinceour development effort focused on the MM kernel over packing.By packing matrices into contiguous buffers, GEMM libraries, in-cluding CAKE, are able to minimize cache evictions. As an addedbenefit, packing also prevents cache self-interference, as articulatedin [22]. In all experiments, we include packing overhead whenmeasuring throughput and DRAM bandwidth during MM.
In a typical system, cache eviction schemes are not modifiableby users. If the eviction scheme is not properly accounted for, ad-ditional runtime due to unnecessary evictions and cache misseswill dominate GEMM running time. Whenπ , π , and πΎ are large,time spent packing is small relative to time spent in the MM kernel.However, we note that, with CAKE, packing overhead for skewedmatrix shapes (i.e., where one of π , π , and πΎ is much smallerthan the other two) may constitute a significant fraction of totalcomputation time.
5.2.2 Reducing Main Memory Stalls With CAKE. We profilethe time the CPU spends serving requests from different mem-ory levels when running a fixed-size MM using CAKE, MKL, andARMPL libraries. On the Intel i9-10900K, we profile memory re-quest stalls for an MM between two 10000 Γ 10000 matrices. A stalloccurs when CPU cores are waiting for data from some memorylevel, and are unable to be productive. We use Intelβs Vtune Profiler[8] to measure the number of cycles where the CPU is stalled on theL1, L2, and L3 caches as well as main memory. On ARM v8 CortexA53, we measure the number of cache hits and DRAM memory
L1 L2 L3 Main Memory0.0
0.1
0.2
0.3
0.4
Tim
e (b
illion
s of c
lock
ticks
)
(a) Memory Request Stalls on Intel i9CakeMKL
L1 Hits L2 Hits DRAM Requests012345678
Billio
ns o
f Acc
esse
s
(b) Cache and DRAM Accesses on ARMCakeARMPL
Figure 7: (a) Number of clock ticks spent stalled on requestsat different memory levels for CAKE and MKL for 10000 Γ10000matrices using all 10 cores of the Intel i9-10900K CPU.Stalled time is defined as the time that the operands are notin the CPU registers but instead are present in a particularmemory level (L1, L2, L3, DRAM). For this large problem size,CAKEβs performance is β 10% lower than MKLβs (see Figure8 for problem sizes where CAKE outperforms MKL). How-ever, CAKE spends less absolute time stalled on main mem-ory and more time stalled on local memory than MKL. (b)Cache Hits and DRAM Memory accesses when performingMM between two 3000 Γ 3000 matrices on ARM v8 Cortex.CAKE shifts memory demand to internal memory (L1, L2,L3) whereas both MKL and ARMPL rely on main memorytransfers.
requests using the Linux Perf Tool [2] for an MM between two3000 Γ 3000 matrices.
As depicted in Figure 7a, with CAKE, the Intel CPU is most oftenstalled on memory requests to a local memory level. In contrast,with MKL, the CPU stalls most often on requests to main memory.We see similar results for the ARM CPU in Figure 7b. CAKE servesmore memory requests from the L1 cache than ARMPL, whichperforms β 2.5π₯ more DRAM requests. This reflects CAKEβs useof local memory to buffer partial results. If DRAM bandwidth re-mains fixed, as the number of cores grows, main memory stallswill significantly impact performance due to high memory accesslatencies.
5.2.3 CAKEβs Performance for Different Matrix Shapes. Asdepicted in Figure 8, we use all 10 cores for the Intel CPU and varyπ and πΎ from 1 to 8000. Each plot shows a differentπ : π aspectratio, and shaded regions represent input matrix dimensions whereCAKE outperforms MKL by at least some factor.
As the matrix size decreases in any dimension, CAKEβs through-put, relative to MKL, increases. For a general MM between squareπ Γ π matrices, arithmetic intensity is π (π ). Hence, as the prob-lem size π decreases, arithmetic intensity also decreases and theMM becomes more memory-bound. CAKE increases arithmeticintensity, and thus MM throughput, by analytically blocking thecomputation to minimize external IO.
Figure 9 shows CAKEβs speedup in computation throughput forsquare matrices of varying size compared to MKL and ARMPL onthe Intel i9-10900K and ARM v8 Cortex CPUs, respectively. Speedupin throughput for a fixed-size MM using π cores is defined as π‘π/π‘1,where π‘1 and π‘π are the throughputs for the MM using a single coreand π cores, respectively. This speedup metric allows us to evenly
8
compare CAKEβs performance across different platforms (i.e., Inteland ARM) for the same problem sizes. For absolute performanceresults, please refer to Sections 5.2.4, 5.2.5, and 5.2.4.
In Figure 9a, CAKEβs improvement in speedup over MKL is morepronounced for smaller matrices. When matrices grow to a certainsize, achievable throughput reaches a steady state, allowing MKLto approach CAKEβs performance. In contrast, Figure 9b showsthat, on the ARM CPU, CAKE has higher speedup than ARMPLfor all problem sizes. The low available DRAM bandwidth impedesARMPLβs performance when it attempts to use more cores, mean-ing CAKE outperforms ARMPL for all problem sizes. Kinks in thespeedup curves for MKL and CAKE in Figure 9a around 6 coresmay be attributable to non-deterministic latency introduced by buscontention or L3 cache coherency delays.
1000 2000 3000 4000 5000 6000 7000 8000M = N
1000
2000
3000
4000
5000
6000
7000
8000
K
(a) Relative Throughput for M = N 1.00x 1.25x 1.50x 2.00x
1000 2000 3000 4000 5000 6000 7000 8000M = 2N
1000
2000
3000
4000
5000
6000
7000
8000
K
(b) Relative Throughput for M = 2N 1.00x 1.25x 1.50x 2.00x
1000 2000 3000 4000 5000 6000 7000 8000M = 4N
1000
2000
3000
4000
5000
6000
7000
8000
K
(c) Relative Throughput for M = 4N 1.00x 1.25x 1.50x 2.00x
1000 2000 3000 4000 5000 6000 7000 8000M = 8N
1000
2000
3000
4000
5000
6000
7000
8000
K
(d) Relative Throughput for M = 8N 1.00x 1.25x 1.50x 2.00x
Figure 8: Relative improvement in throughput for CAKE,compared toMKL, over variousmatrix dimensions on an In-tel i9-10900K. Each each contour boundary indicates dimen-sions for which CAKE outperforms MKL by a certain factor.Darker regions indicate greater throughput for CAKE rela-tive to MKL.
5.2.4 Achieving Maximum Computation ThroughputAllowed Under External Bandwidth Constraints. We evalu-ate the CAKE algorithm on an ARM v8 Cortex A53, which has alow DRAM bandwidth, local memory size, and internal bandwidth.We multiply two 3000Γ 3000matrices using CAKE and the ARMPLand collect performance data using the Linux perf tool [2]. We usePerf to record DRAM accesses by monitoring the ARM PMU eventcounter for L2 cache refills from DRAM. Figures 11a and 11b show,as the number of cores increases, ARMPL must increase DRAM us-age to increase computation. As a result, ARMPL does not achieve
1 2 3 4 5 6 7 8 9 10Number of Cores (M=N=K)
1
2
3
4
5
6
7
Spee
dup
(a) Speedup For Square Matrices in CAKE vs MKLM=N=K
1000 (mkl)1000 (cake)2000 (mkl)2000 (cake)3000 (mkl)3000 (cake)
1 2 3 4Number of Cores (M=N=K)
1.0
1.5
2.0
2.5
3.0
Spee
dup
(b) Speedup For Square Matrices in CAKE vs ARMPLM=N=K1000 (armpl)1000 (cake)2000 (armpl)2000 (cake)3000 (armpl)3000 (cake)
Figure 9: Speedup in computation throughput (speedup intime compared to a single core) for square matrices on Inteli9 10900K and ARM v8 Cortex CPUs. In (a), CAKEβs speedupimprovement over MKL is more pronounced for small ma-trices. MKLβs performance approaches CAKE as matrix sizeincreases and achievable throughput reaches a steady state.In (b), CAKE outperforms ARMPL on square MM consis-tently for all sizes. Unlike CAKE, ARMPL cannot scale per-formance with the number of cores due to limited DRAMBW on the ARM CPU.
the maximum computation throughput as there is little additionalDRAM bandwidth available (Figure 11b). In contrast, CAKE canadjust the CB block shape to achieve the maximum computationthroughput for a given number of cores without increasing DRAMbandwidth.
We use pmbw to measure the ARM CPUβs internal bandwidthbetween the L2 cache and cores, as shown in Figure 11c. Internalbandwidth on the ARM CPU does not increase proportionally withthe number of cores, beyond 2 cores. Consequently, we see theDRAM bandwidth required for CAKE increases slightly above thetheoretical optimum for 3 and 4 cores in Figure 11a.
In Figure 11b, insufficient internal bandwidth causes CAKEβsthroughput to grow slightly less than proportional to the numberof cores. Our extrapolation in Figure 11c suggests that, if internalbandwidth continued to increase proportionally to the number ofcores, CAKE throughput would also continue to increase.
9
1 2 3 4 5 6 7 8 9 10Number of Cores
5
10
15
20
25
Avg.
DRA
M B
w (G
B/s)
(a) DRAM Bandwidth in CAKE vs MKLMKL ObservedCAKE ObservedCAKE Optimal
1 2 3 4 5 6 7 8 9 1011121314151617181920Number of Cores
200
400
600
800
1000
1200
Thro
ughp
ut (G
FLOP
/s)
(b) Computation Throughput in CAKE vs MKLMKL ObservedCAKE ObservedMKL extrapolatedCAKE extrapolated
1 2 3 4 5 6 7 8 9 1011121314151617181920Number of Cores
100
200
300
400
500
Band
widt
h (G
B/s)
(c) Internal Bandwidth On Intel CPUIntel i9 actualextrapolated
Figure 10: CAKE on an Intel i9-10900k, performingMMbetween two 23040Γ23040matrices. (a) Compared toMKL (blue), CAKE(red) does not need to increase DRAM bandwidth to utilize more cores. The slight increase in CAKEβs DRAM bandwidth, abovethe optimal, for 9 and 10 cores is caused by insufficient internal bandwidth between the L3 cache and CPU cores. (b) CAKE(red), achieves within 3% of MKLβs (blue) computation throughput without needing to increase DRAM bandwidth. (c) Usingpmbw, we measure internal bandwidth between the L3 cache and cores. On this CPU, internal bandwidth does not increaseproportionally with the number of cores past 6 cores. Consequently, we see in (a) that CAKEβs required DRAM bandwidthincreases slightly above optimal past 6 cores, and in (b) that insufficient internal bandwidth causes CAKE throughput to growslightly less than proportional to core count. The dotted extrapolation lines assume internal memory bandwidth increasesproportionally for each additional core, local memory size increases quadratically, and DRAM bandwidth is fixed.
1 2 3 4Number of Cores
0.2
0.4
0.6
0.8
1.0
1.2
Avg.
DRA
M B
w (G
B/s)
(a) DRAM Bandwidth in CAKE vs ARMPL
ARMPL ObservedCAKE ObservedCAKE Optimal
1 2 3 4 5 6 7 8Number of Cores
2
4
6
8
10
Thro
ughp
ut (G
FLOP
/s)
(b) Computation Throughput of CAKE vs ARMPLARMPL extrapolatedCAKE extrapolatedCAKE ObservedARMPL Observed
1 2 3 4 5 6 7 8Number of Cores
10
20
30
40
50
Band
widt
h (G
B/s)
(c) Internal Bandwidth On ARM CPUARM Cortex v8 A53extrapolated
Figure 11: CAKE on an ARM v8 Cortex A53, performing MM between 3000 Γ 3000 matrices. CAKE adjusts the CB block toachieve the maximum computation throughput without increasing DRAM bandwidth. (a) Unlike ARMPL (magenta), CAKE(red) does not need to increase DRAM bandwidth to utilize more cores. (b) CAKE achieves higher computation throughputthan ARMPL, which is limited by DRAM bandwidth. (c) Using pmbw, we find internal bandwidth between the L2 cache andcores does not increase with the number of cores for the ARM CPU. Consequently, we see in (a) that CAKEβs required DRAMbandwidth increases slightly above optimal for 3 and 4 cores, and in (b) that insufficient internal bandwidth causes CAKEβsthroughput to grow slightly less than proportional to the number of cores. The dotted extrapolation lines assume internalmemory bandwidth increases proportionally for each additional core, local memory size increases quadratically, and DRAMbandwidth is fixed.
5.2.5 Achieving Maximum Computation ThroughputAllowedUnder InternalMemoryBandwidthConstraints. Wecompare CAKE to MKL on an Intel i9-10900K for an MM betweentwo 23040 Γ 23040 matrices. We use Intel VTune Profiler to recordcomputation throughput andDRAMbandwidth usage. CAKE achievewithin 3% of MKLβs computation throughput, but requires substan-tially less DRAM bandwidth (Figures 10a, 10b).
We measure internal bandwidth between the L3 cache and pro-cessor cores, again using pmbw. As shown in Figure 10c, the IntelCPU is constrained by internal bandwidth, which fails to grow pro-portionally with the number of cores beyond 6 cores. Consequently,CAKE requires slightly more than the theoretically optimal DRAM
bandwidth for 9 and 10 cores, as seen in Figure 10a. In Figure 10b,we also see that insufficient internal bandwidth causes CAKEβsthroughput to grow less than proportional to the number of corespast 8 cores. Figure 10b includes extrapolations for expected per-formance for both for CAKE and MKL, under the assumption thatinternal memory bandwidth continues to increase proportionallywith the number of cores while DRAM bandwidth remains fixed.With sufficient local memory, CAKE will achieve the maximum pos-sible computation throughput for a given number of cores. WhileMKLβs throughput may continue to increase beyond 10 cores, it re-lies on increased DRAM bandwidth to increase throughput, which
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Number of Cores
1
2
3
4
5
Avg.
DRA
M B
w (G
B/s)
(a) DRAM Bandwidth in CAKE vs OpenBlas
OpenBlas ObservedCAKE Observed
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32Number of Cores
200
400
600
800
1000
1200
Thro
ughp
ut (G
FLOP
S/se
c)
(b) Computation Throughput of CAKE vs OpenBlas
OpenBlas extrapolatedCAKE extrapolatedCAKE ObservedOpenBlas Observed
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32Number of Cores
0
200
400
600
800
1000
1200
1400
1600
Band
widt
h (G
B/s)
(c) Internal Bandwidth On AMD CPUAMD Ryzen 9 5950Xextrapolated
Figure 12: CAKE on an AMD Ryzen 9 5950X performing MM between two 23040 Γ 23040 matrices. Relative to the other CPUsevaluated, the 5950X is least constrained by system resources. (a) Unlike OpenBLAS (black), CAKE (red) does not need toincrease DRAM bandwidth to utilize more cores. On this CPU, DRAM bandwidth is estimated from the number of L1 cache-line refills from DRAM. (b) CAKEmatches OpenBLASβs peak performance, but with a smaller DRAM bandwidth requirement,as seen previously. (c) Using pmbw, we measured internal bandwidth between the L3 cache and cores. The 5950X has highinternal bandwidth between L3 cache and cores that grows roughly linearly with the number of cores. Consequently, we seein (a) that CAKEβs required DRAM bandwidth stays constant past 9 cores, and in (b) that sufficient internal bandwidth enablesCAKE to achieve high computation throughput. The dotted extrapolation lines assume internalmemory bandwidth increasesproportionally for each additional core, local memory size increases quadratically, and DRAM bandwidth is fixed.
will plateau when all available DRAM bandwidth is consumed. Al-though 40GB/s of DRAM bandwidth is available on this system,CAKE only utilizes an average of 4.5GB/s, instead relying on localmemory bandwidth to leverage more cores. Relying on local mem-ory is preferable because, while DRAM bandwidth is high, latencyand power consumption may be problematic [29].
5.2.6 Achieving Maximum Computation ThroughputAllowed Under Available Processing Power (without othersystem resource constraints). We compare CAKE to OpenBLASwhen computing an MM between two large 23040Γ 23040matriceson an AMD Ryzen 5950X CPU. OpenBLAS was chosen for theAMD CPU because it is an optimized library using GotoBLAS andoutperforms MKL on non-Intel hardware [9]. We use AMD `Prof[15] to measure DRAM bandwidth and computation throughput.However, due to the lack of an available DRAM access counter onthis CPU at the time of writing, DRAM accesses during MM areestimated using the PMU counter for L1 data cache refills. Only aportion of total DRAM bandwidth usage is measured, so we omitcomparison to the theoretically optimal DRAM bandwidth curvein Figure 12a.
Figure 12c shows internal bandwidth between the L3 cache andprocessor cores increases roughly proportionally by 50GB/s percore. CAKE takes advantage of the increasing internal bandwidthto achieve peak computation throughput without increasing DRAMbandwidth usage beyond 9 cores (Figures 12a and 12b). Since thisCPU is not constrained by internal bandwidth or DRAM bandwidth,both CAKE and OpenBLAS can increase computation throughputwhen adding more cores, but OpenBLAS uses more DRAM band-width than CAKE.
We include extrapolations for expected performance for CAKEand OpenBLAS, again assuming internal memory bandwidth con-tinues to increase proportionally with the number of cores whileDRAM bandwidth remains fixed in Figures 12b and 12c
6 EXTENDING CAKE BEYOND CPUS ANDCAKE ARCHITECTURE SIMULATOR
6.1 Beyond CPUsCAKE trades off increased internal memory usage for reducedexternal bandwidth requirements, which can benefit any computingplatform with a memory hierarchy with multiple levels (e.g., L1, L2,DRAM).
While this paper focuses on CPUs, CAKE is not limited to theseplatforms. For example, the CAKE methodology can apply to GPUsor other heterogeneous systems comprised of general-purpose com-puting engines and special-purpose accelerators.
Indeed, CUTLASS, a GPU MM library, similarly utilizes blockshaping and sizing, but only exposes them as βcompile-time con-stants that the programmer specifies.β [1] CAKEβs CB blocks caneliminate the need to manually search for optimal block designs.For similar reasons, CAKE can also help reduce searches for optimalmulti-tenant schedules [11].
6.2 CAKE Architecture SimulatorWe developed a SystemC architecture simulator [3] using MatchLib[19] connections to validate the correctness of the CB block designand execution schedule. The simulator models timings betweenexternal memory, local memory, and cores under various systemcharacteristics (e.g., low external memory bandwidth). This flexibil-ity helps verify the correctness of the CAKE algorithm for cornercases that are difficult to analyze. Results from the simulator wereused to develop the libraries evaluated in Section 5 and will easethe development of future extensions of CAKE (as noted in Sec-tion 6.1) to various computing platforms (including GPUs, Maestro[21], TPU [17], etc.).
To reduce module complexity and simplify programming, stan-dardized packets are used for all communication between simulatedhardware modules. Packets originate from external memory andcontain headers to control routing (i.e., source routing) as well
11
as fields containing the packetβs tile index into the computationspace and CB block. Packet-based scheduling allows us to easilymodify the architecture and computation schedule. For example,to double the number of cores, we simply instantiate new mod-ules to represent the added cores, and adjust the packet headersaccordingly.
7 CONCLUSIONWhen performing a block of computations in a matrix multipli-cation (MM) space from within an internal memory, CAKE usesa constant-bandwidth (CB) block to simultaneously satisfy twoobjectives. First, CAKE adapts the computationβs arithmetic in-tensity to balance computation and IO times by controlling blockaspect ratios. Second, CAKE adjusts the computation volume bycontrolling CB block sizes to fully use all available computing power.With CAKE, we can trade off increased internal memory usage forreduced external bandwidth requirements. From a technology view-point, relying on local memory is generally preferable since DRAMhas relatively high latency and power consumption. Furthermore,CAKE explicitly specifies the required bandwidth and size of theinternal memory.
Supported by analysis and measurements on real-world systems,CAKE can substantially improve computational throughput for MMon CPUs as a drop-in replacement for MM calls. In both theory andpractice, CAKE has advanced state-of-the-art MM computation.
ACKNOWLEDGMENTSThis work was supported in part by the Air Force Research Labora-tory under award number FA8750-18-1-0112, a gift from MediaTekUSA and a Joint Development Project with TSMC.We thankWei-TeMark Ting for his comments and drawings. We also thank anony-mous reviewers whose comments helped improve and clarify thismanuscript.
REFERENCES[1] 2020. CUTLASS: Fast Linear Algebra in CUDA C++. https://developer.nvidia.
com/blog/cutlass-linear-algebra-cuda/.[2] 2021. perf: Linux profiling with performance counters. https://perf.wiki.kernel.
org/index.php/Main_Page.[3] Accellera Systems Initiative Inc. 2016. SystemC Synthesis Subset Standard
v1.4.7. Accellera Systems Initiative Inc. https://www.accellera.org/downloads/standards/systemc
[4] Arm Limited 2021. Arm Performance Libraries Reference Guide. Arm Limited.https://developer.arm.com/documentation/101004/latest/
[5] Timo Bingmann. 2013. Parallel Memory Bandwidth Benchmark/Measurement.https://panthema.net/2013/pmbw/.
[6] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule,P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen,and C. Webb. 2006. Die Stacking (3D) Microarchitecture. In 2006 39th AnnualIEEE/ACM International Symposium on Microarchitecture (MICROβ06). 469β479.https://doi.org/10.1109/MICRO.2006.18
[7] Intel Corporation. 2021. Intel oneAPI Math Kernel Library. https://software.intel.com/content/www/us/en/develop/articles/mkl-reference-manual.html.
[8] Intel Corporation. 2021. Intel VTune Profiler. https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html.
[9] Agner Fog. 2020. Intelβs "cripple AMD" function. https://www.agner.org/optimize/blog/read.php?i=49.
[10] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017.TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. InProceedings of the Twenty-Second International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS β17). 751β764.
[11] S. Ghodrati, B. H. Ahn, J. Kyung Kim, S. Kinzer, B. R. Yatham, N. Alla, H. Sharma,M. Alian, E. Ebrahimi, N. S. Kim, C. Young, and H. Esmaeilzadeh. 2020. Planaria:Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep
Neural Networks. In 2020 53rd Annual IEEE/ACM International Symposium onMicroarchitecture (MICRO). 681β697. https://doi.org/10.1109/MICRO50266.2020.00062
[12] Kazushige Goto and Robert A. van de Geijn. 2002. On Reducing TLB Misses inMatrix Multiplication. Technical Report TR-02-55 (2002).
[13] Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-PerformanceMatrix Multiplication. ACM Trans. Math. Softw., Article 12 (2008), 25 pages.
[14] J.H. Huang. 2013. Keynote. (2013). NVidia GPU Technology Conference.[15] Advanced Micro Devices Inc. 2021. AMD `Prof. https://developer.amd.com/amd-
uprof/.[16] Intel Corporation 2016. Intel oneAPI Deep Neural Network Library. Intel Corpora-
tion. https://oneapi-src.github.io/oneDNN/[17] Norman P. Jouppi, Cliff Young, et al. 2017. In-Datacenter Performance Analysis
of a Tensor Processing Unit. CoRR abs/1704.04760 (2017). arXiv:1704.04760http://arxiv.org/abs/1704.04760
[18] H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim. 2017. HBM (High Band-width Memory) DRAM Technology and Architecture. In 2017 IEEE InternationalMemory Workshop (IMW). 1β4. https://doi.org/10.1109/IMW.2017.7939084
[19] Brucek Khailany, Evgeni Khmer, et al. 2018. AModular Digital VLSI Flow for High-Productivity SoC Design. In Proceedings of the 55th Annual Design AutomationConference (San Francisco, California) (DAC β18). Association for ComputingMachinery, New York, NY, USA, Article 72, 6 pages. https://doi.org/10.1145/3195970.3199846
[20] Ashish Kumar, Saurabh Goyal, and Manik Varma. 2017. Resource-EfficientMachine Learning in 2 KB RAM for the Internet of Things. In Proceedings of the34th International Conference on Machine Learning - Volume 70 (Sydney, NSW,Australia) (ICMLβ17). JMLR.org, 1935β1944.
[21] H. T. Kung, B. McDanel, S. Q. Zhang, X. Dong, and C. C. Chen. 2019. Maestro: AMemory-on-Logic Architecture for Coordinated Parallel Use of Many SystolicArrays. In 2019 IEEE 30th International Conference on Application-specific Systems,Architectures and Processors (ASAP), Vol. 2160-052X. 42β50.
[22] Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The CachePerformance and Optimizations of Blocked Algorithms. (1991), 63β74. https://doi.org/10.1145/106972.106981
[23] Z. Li, Z. Wang, L. Xu, Q. Dong, B. Liu, C. I. Su, W. T. Chu, G. Tsou, Y. D. Chih, T. Y. J.Chang, D. Sylvester, H. S. Kim, and D. Blaauw. 2021. RRAM-DNN: An RRAMand Model-Compression Empowered All-Weights-On-Chip DNN Accelerator.IEEE Journal of Solid-State Circuits 56, 4 (2021), 1105β1115.
[24] Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, and Song Han. 2020.MCUNet: Tiny Deep Learning on IoT Devices. arXiv:2007.10319 [cs.CV]
[25] Jason Lowe-Power, Mark D. Hill, and David A. Wood. 2016. When to use3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads.arXiv:1608.07485 [cs.AR]
[26] Bradley McDanel, Sai Qian Zhang, HT Kung, and Xin Dong. 2019. Full-stackoptimization for accelerating CNNs using powers-of-two weights with FPGAvalidation. In Proceedings of the ACM International Conference on Supercomputing.449β460.
[27] Dennis Rich, Andrew Bartolo, Carlo Gilardo, Binh Le, Haitong Li, Rebecca Park,Robert Radway, Mohamed Sabry, H.-S. Philip Wong, and Subhasish Mitra. 2020.Heterogeneous 3D Nano-systems: The N3XT Approach? 127β151. https://doi.org/10.1007/978-3-030-18338-7_9
[28] Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework forRapidly Instantiating BLAS Functionality. ACM Trans. Math. Software 41, 3 (June2015), 14:1β14:33. http://doi.acm.org/10.1145/2764454
[29] T. Vogelsang. 2010. Understanding the Energy Consumption of Dynamic RandomAccess Memories. In 2010 43rd Annual IEEE/ACM International Symposium onMicroarchitecture. 363β374. https://doi.org/10.1109/MICRO.2010.42
[30] Wm. A. Wulf and Sally A. McKee. 1995. Hitting the Memory Wall: Implicationsof the Obvious. SIGARCH Comput. Archit. News 23, 1 (March 1995), 20β24.https://doi.org/10.1145/216585.216588
[31] Zhang Xianyi, Wang Qian, and Zaheer Chothia. 2012. Openblas. URL: http://xianyi.github.io/OpenBLAS (2012), 88.
12