chapel: hpcc benchmarks
DESCRIPTION
Chapel: HPCC Benchmarks. Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010. HPC Challenge (HPCC). Class 2: “most productive” Judged on: 50% performance 50% elegance Four recommended benchmarks: STREAM, RA, FFT, HPL Use of library routines: discouraged - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/1.jpg)
Chapel: HPCC Benchmarks
Brad ChamberlainCray Inc.
CSEP 524May 20, 2010
![Page 2: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/2.jpg)
HPC Challenge (HPCC) Class 2: “most productive”
• Judged on: 50% performance 50% elegance• Four recommended benchmarks: STREAM, RA, FFT, HPL• Use of library routines: discouraged
(there’s also class 1: “best performance”; Cray won 3 of 4 this year) Why you might care:
• many (correctly) downplay the top-500 as ignoring important things• HPCC takes a step in the right direction and subsumes the top 500
Historically: the judges have “split the baby” for class 22005: tie: Cray (MTA-2) and IBM (UPC)2006: overall: MIT (Cilk); performance: IBM (UPC); elegance: Mathworks (Matlab); honorable mention: Chapel and X102007: research: IBM (X10); industry: Int. Supercomp. (Python/Star-P)2008: performance: IBM (UPC/X10); productive: Cray (Chapel), IBM (UPC/X10), Mathworks (Matlab)2009: performance: IBM (UPC+X10); elegance: Cray (Chapel)
![Page 3: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/3.jpg)
HPC Challenge: Chapel Entries (2008-2009) EntriesBenchmark 2008 2009 Improvement
Global STREAM 1.73 TB/s(512 nodes)
10.8 TB/s(2048 nodes) 6.2x
EP STREAM 1.59 TB/s(256 nodes)
12.2 TB/s(2048 nodes) 7.7x
Global RA 0.00112 GUPs(64 nodes)
0.122 GUPs(2048 nodes) 109x
Global FFT single-threadedsingle-node
multi-threaded multi-node multi-node parallel
Global HPL single-threaded single-node
multi-threaded single-node single-node parallel
All timings on ORNL Cray XT4:• 4 cores/node• 8GB/node• no use of library routines
![Page 4: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/4.jpg)
HPCC STREAM and RA STREAM Triad
• compute a distributed scaled-vector addition a = b + α · c where a, b, c are vectors
• embarrassingly parallel• stresses local memory bandwidth
Random Access (RA)• make random xor-updates to a distributed table of integers• stresses fine-grained communication, updates (in its purest form)
![Page 5: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/5.jpg)
Introduction to STREAM TriadGiven: m-element vectors A, B, CCompute: i 1..m, Ai = Bi + αCi
Pictorially:
A
B
C
alpha
=
+
*
![Page 6: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/6.jpg)
Introduction to STREAM TriadGiven: m-element vectors A, B, CCompute: i 1..m, Ai = Bi + αCi
Pictorially (in parallel):
A
B
C
alpha
=
+
*
=
+
*
=
+
*
=
+
*
=
+
*
![Page 7: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/7.jpg)
STREAM Triad in Chapel
forall (a, b, c) in (A, B, C) do a = b + alpha * c;
var A, B, C: [ProblemSpace] real;
const ProblemSpace: domain(1, int(64)) = [1..m];const ProblemSpace: domain(1, int(64)) = [1..m];
=
α ·+
1 m
![Page 8: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/8.jpg)
STREAM Triad in Chapel
forall (a, b, c) in (A, B, C) do a = b + alpha * c;
var A, B, C: [ProblemSpace] real;
const ProblemSpace: domain(1, int(64)) dmapped BlockDist = [1..m];
const BlockDist = new Block1D(bbox=[1..m], tasksPerLocale=…);
const ProblemSpace: domain(1, int(64)) = [1..m];
=
α ·+
1 m
1 20-1 m
![Page 9: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/9.jpg)
EP-STREAM in Chapel Chapel’s multiresolution design also permits users to code in
an SPMD style like the MPI version:var localGBs: [LocaleSpace] real;
coforall loc in Locales do on loc { const myProblemSpace: domain(1, int(64)) = BlockPartition(ProblemSpace, here.id, numLocales); var myA, myB, myC: [myProblemSpace] real(64); const startTime = getCurrentTime(); local { for (a, b, c) in (myA, myB, myC) do a = b + alpha * c; } const execTime = getCurrentTime() - startTime; localGBs(here.id) = timeToGBs(execTime); }
const avgGBs = (+ reduce localGBs) / numLocales;
![Page 10: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/10.jpg)
Experimental Platformmachine characteristic platform 1 platform 2
model Cray XT4 Cray CX1
location ORNL Cray Inc.
# compute nodes/locales 7,832 8
processor 2.1 GHz AMD Opteron 3 GHz Intel Xeon
# cores per locale 4 2 × 4
total usable RAM per locale (as reported by /proc/meminfo)
7.68 GB 15.67 GB
STREAM Triad problem size per locale 85,985,408 175,355,520
STREAM Triad memory per locale 1.92 GB 3.92 GB
STREAM Triad percent of available memory 25.0% 25.0%
RA problem size per locale 228 229
RA updates per locale 219 224
RA memory per locale 2.0 GB 4.0 GB
RA percent of available memory 26.0% 25.5%
![Page 11: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/11.jpg)
STREAM Performance: Chapel vs. MPI (2008)
1 2 4 8 16 32 64 128 256 512 1024 20480
2000
4000
6000
8000
10000
12000
14000
Performance of HPCC STREAM Triad (Cray XT4)2008 Chapel Global TPL=12008 Chapel Global TPL=22008 Chapel Global TPL=32008 Chapel Global TPL=4MPI EP PPN=1MPI EP PPN=2MPI EP PPN=3MPI EP PPN=4
Number of Locales
GB/s
![Page 12: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/12.jpg)
STREAM Performance: Chapel vs. MPI (2009)
local 1 2 4 8 16 32 64 128 256 512 102420480
2000
4000
6000
8000
10000
12000
14000
Performance of HPCC STREAM Triad (Cray XT4) 2008 Chapel Global TPL=12008 Chapel Global TPL=22008 Chapel Global TPL=32008 Chapel Global TPL=4MPI EP PPN=1MPI EP PPN=2MPI EP PPN=3MPI EP PPN=4Chapel Global TPL=1Chapel Global TPL=2Chapel Global TPL=3Chapel Global TPL=4Chapel EP TPL=4
Number of Locales
GB/s
![Page 13: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/13.jpg)
Introduction to Random Access (RA)Given: m-element table T (where m = 2n and initially Ti = i)Compute: NU random updates to the table using bitwise-xorPictorially:
![Page 14: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/14.jpg)
Introduction to Random Access (RA)Given: m-element table T (where m = 2n and initially Ti = i)Compute: NU random updates to the table using bitwise-xorPictorially:
41 6 2
739
08
5
![Page 15: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/15.jpg)
Introduction to Random Access (RA)Given: m-element table T (where m = 2n and initially Ti = i)Compute: NU random updates to the table using bitwise-xorPictorially:
0 2 1 = 21 xor the value 21 into T(21 mod m)
46
739
8
5
repeat NU times
![Page 16: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/16.jpg)
Introduction to Random Access (RA)Given: m-element table T (where m = 2n and initially Ti = i)Compute: NU random updates to the table using bitwise-xorPictorially (in parallel):
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
![Page 17: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/17.jpg)
Introduction to Random Access (RA)Given: m-element table T (where m = 2n and initially Ti = i)Compute: NU random updates to the table using bitwise-xorPictorially (in parallel):
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Random NumbersNot actually generated using lotto ping-pong balls!
Instead, implement a pseudo-random stream:• kth random value can be generated at some cost• given the kth random value, can generate the
(k+1)-st much more cheaply
![Page 18: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/18.jpg)
Introduction to Random Access (RA)Given: m-element table T (where m = 2n and initially Ti = i)Compute: NU random updates to the table using bitwise-xorPictorially (in parallel):
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
ConflictsWhen a conflict occurs an update may be lost;
a certain number of these are permitted
![Page 19: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/19.jpg)
Introduction to Random Access (RA)Given: m-element table T (where m = 2n and initially Ti = i)Compute: NU random updates to the table using bitwise-xorPictorially (in parallel):
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
BatchingTo amortize communication overheads at lower node counts, up to 1024 updates may be precomputed per
process before making any of them
![Page 20: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/20.jpg)
RA Declarations in Chapel
const TableSpace: domain(1, uint(64)) dmapped TableDist = [0..m-1], Updates: domain(1, uint(64)) dmapped UpdateDist = [0..N_U-1];
const TableDist = new Block1D(bbox=[0..m-1], tasksPerLocale=…), UpdateDist = new Block1D(bbox=[0..N_U-1], tasksPerLocale=…);
1 20-1 NUm
0 N_U-1
0 m-1
var T: [TableSpace] uint(64);
![Page 21: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/21.jpg)
RA Computation in Chapelconst TableSpace: domain(1, uint(64)) dmapped TableDist = [0..m-1], Updates: domain(1, uint(64)) dmapped UpdateDist = [0..N_U-1];
0 N_U-1
var T: [TableSpace] uint(64);
forall (_, r) in (Updates, RAStream()) do on T(r&indexMask) do T(r&indexMask) ^= r;
r0 r1 r2 r3 r9 rN_U-1RAStream(): r17 r23
![Page 22: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/22.jpg)
RA Performance: Chapel (2009)
1 2 4 8 16 32 64 128 256 512 1024 20480
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Performance of HPCC Random Access (Cray XT4)
Chapel TPL=1Chapel TPL=2Chapel TPL=4Chapel TPL=8
Number of Locales
GUP/
s
![Page 23: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/23.jpg)
RA Efficiency: Chapel vs. MPI (2009)
32 64 128 256 512 1024 20480%
1%
2%
3%
4%
5%
6%
7%
Efficiency of HPCC Random Access on 32+ Locales (Cray XT4)
Chapel TPL=1Chapel TPL=2Chapel TPL=4Chapel TPL=8MPI PPN=4MPI No Buckets PPN=4MPI+OpenMP TPN=4
Number of Locales
% E
ffici
ency
(of s
cale
d Ch
apel
TPL
=4 lo
cal G
UP/
s)
![Page 24: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/24.jpg)
HPL Notes
![Page 25: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/25.jpg)
Block-Cyclic Distribution
L0 L1 L2
L3 L4 L5
11
BlockCyclic(start=(1,1), blksize=4)
![Page 26: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/26.jpg)
Block-Cyclic Distribution
L0 L1 L2
L3 L4 L5
11
BlockCyclic(start=(1,1), blksize=4)
![Page 27: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/27.jpg)
Block-Cyclic Distribution
L0 L1 L2
L3 L4 L5
11
BlockCyclic(start=(1,1), blksize=4)
![Page 28: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/28.jpg)
Block-Cyclic Distribution Notes:
• at extremes, Block-Cyclic is: the same as Cyclic (when blkSize == 1) similar to Block
• the same when things divide evenly• slightly different when they don’t (last locale will own more or
less than blkSize) Benefits relative to Block and Cyclic:
• if work isn’t well load-balanced across a domain (and is spatially-based), likely to result in better balance across locales than Block
• provides nicer locality than Cyclic (locales own blocks rather than singletons)
Also:• a good match for algorithms that are block-structured in nature
like HPL typically the distribution’s blocksize will be set to the algorithm’s
![Page 29: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/29.jpg)
HPL Overview Category: dense linear-algebra Computation:
• compute L-U factorization of a matrix A L = lower-triangular matrix U = upper-triangular matrix LU = A
• in order to solve Ax = b
• solving Ax = b is easier using these triangular matrices
A
??x??
b=
A=L
U
L
U??x??
b=
![Page 30: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/30.jpg)
HPL Overview (continued) Approach: block-based recursive algorithm Details:
• pivot (swap rows of matrix and vectors) to maintain numerical stability• store b adjacent to A for convenience, ease-of-pivoting
• reuse A’s storage to represent L and UL
U
bA
b
![Page 31: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/31.jpg)
HPL Configs// matrix size and blocksizeconfig const n = computeProblemSize(numMatrices, elemType, rank=2, memFraction=2, retType=indexType), blkSize = 5;
// error tolerance for verificationconfig const epsilon = 2.0e-15;
// standard random initialization stuffconfig const useRandomSeed = true, seed = if useRandomSeed then SeedGenerator.currentTime else 31415;
// standard knobs for controlling printingconfig const printParams = true, printArrays = false, printStats = true;
![Page 32: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/32.jpg)
HPL Distributions and Domains const BlkCycDst = new dmap(new BlockCyclic(start=(1,1), blkSize=blkSize));
const MatVectSpace: domain(2, indexType) dmapped BlkCycDst = [1..n, 1..n+1], MatrixSpace = MatVectSpace[.., ..n];
var Ab : [MatVectSpace] elemType, // the matrix A and vector b piv: [1..n] indexType, // a vector of pivot values x : [1..n] elemType; // the solution vector, x
var A => Ab[MatrixSpace], // an alias for the Matrix part of Ab b => Ab[.., n+1]; // an alias for the last column of Ab
![Page 33: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/33.jpg)
HPL Distributions and Domains const BlkCycDst = new dmap(new BlockCyclic(start=(1,1), blkSize=blkSize));
const MatVectSpace: domain(2, indexType) dmapped BlkCycDst = [1..n, 1..n+1], MatrixSpace = MatVectSpace[.., ..n];
var Ab : [MatVectSpace] elemType, piv: [1..n] indexType, x : [1..n] elemType;
var A => Ab[MatrixSpace], b => Ab[.., n+1];
MatVectSpace
![Page 34: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/34.jpg)
HPL Distributions and Domains const BlkCycDst = new dmap(new BlockCyclic(start=(1,1), blkSize=blkSize));
const MatVectSpace: domain(2, indexType) dmapped BlkCycDst = [1..n, 1..n+1], MatrixSpace = MatVectSpace[.., ..n];
var Ab : [MatVectSpace] elemType, piv: [1..n] indexType, x : [1..n] elemType;
var A => Ab[MatrixSpace], b => Ab[.., n+1];
MatVectSpace
![Page 35: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/35.jpg)
HPL Distributions and Domains const BlkCycDst = new dmap(new BlockCyclic(start=(1,1), blkSize=blkSize));
const MatVectSpace: domain(2, indexType) dmapped BlkCycDst = [1..n, 1..n+1], MatrixSpace = MatVectSpace[.., ..n];
var Ab : [MatVectSpace] elemType, piv: [1..n] indexType, x : [1..n] elemType;
var A => Ab[MatrixSpace], b => Ab[.., n+1];
MatrixSpace
![Page 36: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/36.jpg)
HPL Distributions and Domains const BlkCycDst = new dmap(new BlockCyclic(start=(1,1), blkSize=blkSize));
const MatVectSpace: domain(2, indexType) dmapped BlkCycDst = [1..n, 1..n+1], MatrixSpace = MatVectSpace[.., ..n];
var Ab : [MatVectSpace] elemType, piv: [1..n] indexType, x : [1..n] elemType;
var A => Ab[MatrixSpace], b => Ab[.., n+1];
Ab
![Page 37: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/37.jpg)
HPL Distributions and Domains const BlkCycDst = new dmap(new BlockCyclic(start=(1,1), blkSize=blkSize));
const MatVectSpace: domain(2, indexType) dmapped BlkCycDst = [1..n, 1..n+1], MatrixSpace = MatVectSpace[.., ..n];
var Ab : [MatVectSpace] elemType, piv: [1..n] indexType, x : [1..n] elemType;
var A => Ab[MatrixSpace], b => Ab[.., n+1];
A
![Page 38: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/38.jpg)
HPL Distributions and Domains const BlkCycDst = new dmap(new BlockCyclic(start=(1,1), blkSize=blkSize));
const MatVectSpace: domain(2, indexType) dmapped BlkCycDst = [1..n, 1..n+1], MatrixSpace = MatVectSpace[.., ..n];
var Ab : [MatVectSpace] elemType, piv: [1..n] indexType, x : [1..n] elemType;
var A => Ab[MatrixSpace], b => Ab[.., n+1];
b
![Page 39: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/39.jpg)
HPL Distributions and Domains const BlkCycDst = new dmap(new BlockCyclic(start=(1,1), blkSize=blkSize));
const MatVectSpace: domain(2, indexType) dmapped BlkCycDst = [1..n, 1..n+1], MatrixSpace = MatVectSpace[.., ..n];
var Ab : [MatVectSpace] elemType, piv: [1..n] indexType, x : [1..n] elemType;
var A => Ab[MatrixSpace], b => Ab[.., n+1];
piv
x
![Page 40: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/40.jpg)
HPL Callgraph main()
• initAB()• LUFactorize()
panelSolve() updateBlockRow() schurComplement()
• dgemm()• backwardSub• verifyResults()
![Page 41: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/41.jpg)
HPL Callgraph main()
![Page 42: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/42.jpg)
main() initAB(Ab);
const startTime = getCurrentTime();
LUFactorize(n, Ab, piv);
x = backwardSub(n, A, b);
const execTime = getCurrentTime() - startTime;
const validAnswer = verifyResults(Ab, MatrixSpace, x); printResults(validAnswer, execTime);
![Page 43: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/43.jpg)
HPL Callgraph main()
• initAB()• LUFactorize()• backwardSub• verifyResults()
![Page 44: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/44.jpg)
HPL Callgraph main()
• initAB()• LUFactorize()• backwardSub• verifyResults()
![Page 45: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/45.jpg)
LUFactorize main loop marches down block diagonal
each iteration views matrix as follows:
as computation proceeds, since these four areas shrink Block-Cyclic more appropriate than Block
tl tr
bl br
1
5
2
3
4
tr
lbr
or
![Page 46: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/46.jpg)
LUFactorizedef LUFactorize(n: indexType, Ab: [1..n, 1..n+1] elemType, piv: [1..n] indexType) { const AbD = Ab.domain; // alias Ab.domain to save typing piv = 1..n;
for blk in 1..n by blkSize { const tl = AbD[blk..#blkSize, blk..#blkSize], tr = AbD[blk..#blkSize, blk+blkSize..], bl = AbD[blk+blkSize.., blk..#blkSize], br = AbD[blk+blkSize.., blk+blkSize..], l = AbD[blk.., blk..#blkSize];
panelSolve(Ab, l, piv); if (tr.numIndices > 0) then updateBlockRow(Ab, tl, tr); if (br.numIndices > 0) then schurComplement(Ab, blk); }}
tl tr
bl br
tr
lbr
![Page 47: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/47.jpg)
What does each kernel use?
panelSolve()
updateBlockRow()
schurComplement()
lAb
tl tr
tr
bl br
![Page 48: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/48.jpg)
HPL Callgraph main()
• initAB()• LUFactorize()
panelSolve() updateBlockRow() schurComplement()
• backwardSub• verifyResults()
![Page 49: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/49.jpg)
panelSolve panelSolve(Ab, l, piv);
def panelSolve(Ab: [] ?t, panel: domain(2, indexType), piv: [] indexType) {
const pnlRows = panel.dim(1), pnlCols = panel.dim(2);
assert(piv.domain.dim(1) == Ab.domain.dim(1));
if (pnlCols.length == 0) then return; for k in pnlCols { // iterate through the columns of the panel ... }}
panel
pnlCols
pnlRows
![Page 50: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/50.jpg)
panelSolve iterate over the
columns of the panel, serially
12
34
panel
43
find the value with the largest magnitude in the column (the pivot value)
12
panel
43
swap that row with the top in that column for the whole Ab matrix
12
panel
43
12
panel
scale the rest of that column by the pivot value
![Page 51: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/51.jpg)
var col = panel[k.., k..k]; if col.dim(1).length == 0 then return; const ( , (pivotRow, )) = maxloc reduce(abs(Ab(col)), col), pivot = Ab[pivotRow, k]; piv[k] <=> piv[pivotRow];
Ab[k, ..] <=> Ab[pivotRow, ..]; if (pivot == 0) then halt("Matrix can not be factorized"); if k+1 <= pnlRows.high then Ab(col)[k+1.., k..k] /= pivot; if k+1 <= pnlRows.high && k+1 <= pnlCols.high then forall (i,j) in panel[k+1.., k+1..] do Ab[i,j] -= Ab[i,k] * Ab[k,j];
panelSolve
431 2
431 2
431 2
panel
piv
![Page 52: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/52.jpg)
HPL Callgraph main()
• initAB()• LUFactorize()
panelSolve() updateBlockRow() schurComplement()
• backwardSub• verifyResults()
![Page 53: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/53.jpg)
4
iterate over the rows of tr, serially
accumulate into each value the product of its predecessors from tl and previous rows
updateBlockRow
tl tr
1
2
3
4
1
2
3i
j
k1 k2
k1
k2
![Page 54: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/54.jpg)
if (tr.numIndices > 0) then updateBlockRow(Ab, tl, tr);
def updateBlockRow(Ab: [] ?t, tl: domain(2), tr: domain(2)) { const tlRows = tl.dim(1), tlCols = tl.dim(2), trRows = tr.dim(1), trCols = tr.dim(2); assert(tlCols == trRows);
for i in trRows do forall j in trCols do for k in tlRows.low..i-1 do Ab[i, j] -= Ab[i, k] * Ab[k,j];}
updateBlockRow
4
1
2
3i
j
k1 k2
k1
k2
![Page 55: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/55.jpg)
updateBlockRow w/ distribution
L0 L1 L2
L3 L4 L5
11
![Page 56: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/56.jpg)
updateBlockRow w/ distribution
L0 L1 L2
L3 L4 L5
11
![Page 57: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/57.jpg)
updateBlockRow w/ distribution
L0 L1 L2
L3 L4 L5
Ab TL (replicated, logically)
![Page 58: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/58.jpg)
updateBlockRow w/ distribution
L0 L1 L2
L3 L4 L5
Ab
TL (replicated, physically)
![Page 59: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/59.jpg)
HPL Callgraph main()
• initAB()• LUFactorize()
panelSolve() updateBlockRow() schurComplement()
• backwardSub• verifyResults()
![Page 60: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/60.jpg)
accumulate into each block in br the product of its corresponding blocks from bl and tr
schurComplement
tr
bl br
tr
![Page 61: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/61.jpg)
updateBlockRow w/ distribution
L0 L1 L2
L3 L4 L5
11
![Page 62: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/62.jpg)
schurComplement w/ distribution
L0 L1 L2
L3 L4 L5
replicated col, logical view replicated row, logical view
![Page 63: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/63.jpg)
schurComplement w/ distribution
L0 L1 L2
L3 L4 L5
replicated col, physical view replicated row, physical view
![Page 64: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/64.jpg)
if (br.numIndices > 0) then schurComplement(Ab, blk);
def schurComplement(Ab: [1..n, 1..n+1] elemType, ptOp: indexType) { const AbD = Ab.domain;
const ptSol = ptOp+blkSize;
const replAD: domain(2) = AbD[ptSol.., ptOp..#blkSize], replBD: domain(2) = AbD[ptOp..#blkSize, ptSol..]; const replA : [replAD] elemType = Ab[ptSol.., ptOp..#blkSize], replB : [replBD] elemType = Ab[ptOp..#blkSize, ptSol..];
forall (row,col) in AbD[ptSol.., ptSol..] by (blkSize, blkSize) { local { const aBlkD = replAD[row..#blkSize, ptOp..#blkSize], bBlkD = replBD[ptOp..#blkSize, col..#blkSize], cBlkD = AbD[row..#blkSize, col..#blkSize];
dgemm(aBlkD.dim(1).length, aBlkD.dim(2).length, bBlkD.dim(2).length, replA(aBlkD), replB(bBlkD), Ab(cBlkD));} } }
schurComplement
![Page 65: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/65.jpg)
HPL Callgraph main()
• initAB()• LUFactorize()
panelSolve() updateBlockRow() schurComplement()
• dgemm()• backwardSub• verifyResults()
![Page 66: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/66.jpg)
def dgemm(p: indexType, // number of rows in A q: indexType, // number of cols in A, number of rows in B r: indexType, // number of cols in B A: [1..p, 1..q] ?t, B: [1..q, 1..r] t, C: [1..p, 1..r] t) { for i in 1..p do for j in 1..r do for k in 1..q do C[i,j] -= A[i, k] * B[k, j];}
dgemm
![Page 67: Chapel: HPCC Benchmarks](https://reader034.vdocument.in/reader034/viewer/2022051316/56815b34550346895dc9065f/html5/thumbnails/67.jpg)
HPL Callgraph main()
• initAB()• LUFactorize()
panelSolve() updateBlockRow() schurComplement()
• dgemm()• backwardSub• verifyResults()