libraries and their performance
DESCRIPTION
Libraries and Their Performance. Frank V. Hale Thomas M. DeBoni NERSC User Services Group. Part I: Single Node Performance Measurement. Use of hpmcount for measurement of total code performance Use of HPM Toolkit for measurement of code section performance - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/1.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 1Libraries and Their Performance
Libraries and Their Performance
Frank V. Hale
Thomas M. DeBoni
NERSC User Services Group
![Page 2: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/2.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 2Libraries and Their Performance
Part I: Single Node Performance Measurement
• Use of hpmcount for measurement of total code performance
• Use of HPM Toolkit for measurement of code section performance
• Vector operations generally give better performance than scalar (indexed) operations
• Shared-memory, SMP parallelism can be very effective and easy to use
![Page 3: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/3.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 3Libraries and Their Performance
Demonstration Problem
• Compute using random points in unit square (ratio of points in unit circle to points in unit square)
• Use input file with sequence of 134,217,728 uniformly distributed random numbers in range 0-1; unformatted, 8-byte floating point numbers (1 gigabyte of data)
![Page 4: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/4.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 4Libraries and Their Performance
A first Fortran code
% cat estpi1.f
implicit none
integer i,points,circle
real*8 x,y
read(*,*)points
open(10,file="runiform1.dat",status="old",form="unformatted")
circle = 0
c repeat for each (x,y) data point: read and compute
do i=1,points
read(10)x
read(10)y
if (sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5) circle = circle + 1
enddo
write(*,*)"Estimated pi using ",points," points as ", . ((4.*circle)/points)
end
![Page 5: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/5.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 5Libraries and Their Performance
Compile and Run with hpmcount
% cat jobestpi1
#@ class = debug
#@ shell = /usr/bin/csh
#@ wall_clock_limit = 00:29:00
#@ notification = always
#@ job_type = serial
#@ output = jobestpi1.out
#@ error = jobestpi1.out
#@ environment = COPY_ALL
#@ queue
setenv FC "xlf_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 "
$FC -o estpi1 estpi1.f
echo "10000" > estpi1.dat
hpmcount ./estpi1 <estpi1.dat
exit
![Page 6: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/6.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 6Libraries and Their Performance
Performance of first code
Points Pi Wall Clock
(sec.)
Mflips/sec.
10 3.56000 0.055 0.007
100 3.36000 0.030 0.033
1,000 3.196000 0.038 0.189
01,000 3.15000 0.120 0.587
100,000 3.14700 0.936 0.748
1,000,000 3.14099 8.979 0.780
10,000,000 3.14199 89.194 0.785
![Page 7: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/7.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 7Libraries and Their Performance
Performance of first code
0.01
0.1
1
10
100
10 100 1000 104 105 106 107
Wall Clock(sec.)
# Points
![Page 8: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/8.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 8Libraries and Their Performance
Some Observations
• Performance is not very good at all, less than 1 Mflip/s
(peak is 1,500 Mflip/s per processor)
• Scalar approach to computation
• Scalar I/O mixed with scalar computation
Suggestions: Separate I/O from computation Use vector operations on dynamically allocated vector data
structures
![Page 9: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/9.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 9Libraries and Their Performance
A second code, Fortran 90% cat estpi2.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y
c dynamically allocated vector data structures read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) end
![Page 10: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/10.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 10Libraries and Their Performance
Performance of second code
Points Pi Wall Clock (sec.)
Mflips/sec.
10 3.56000 0.090 0.004
100 3.36000 0.030 0.034
1,000 3.19000 0.039 0.197
10,000 3.15000 0.120 0.612
100,000 3.14700 0.967 0.755
1,000,000 3.14099 9.152 0.798
10,000,000 3.14199 91.170 0.801
![Page 11: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/11.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 11Libraries and Their Performance
Performance of second code
0.01
0.1
1
10
100
10 100 1000 104 105 106 107
Wall Clock(sec.)
# Points
![Page 12: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/12.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 12Libraries and Their Performance
Observations on Second Code
• Operations on whole vectors should be faster, but
• No real improvement in performance of total code was observed.
• Suspect that most time is being spent on I/O.
• I/O is now separate from computation, so the code is easy to instrument in sections
![Page 13: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/13.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 13Libraries and Their Performance
Instrument code sections with HPM Toolkit
Four sections to be separately measured:
• Data structure initialization
• Read data
• Estimate • Write output
Calls to f_hpmstart and f_hpmstop around each section.
![Page 14: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/14.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 14Libraries and Their Performance
Instrumented Code (1 of 2)
%cat estpi3.f
implicit none
integer :: i, points, circle
integer, allocatable, dimension(:) :: ones
real(kind=8), allocatable, dimension(:) :: x,y
#include "f_hpm.h"
call f_hpminit(0,"Instrumented code")
call f_hpmstart(1,"Initialize data structures")
read(*,*)points
allocate (x(points))
allocate (y(points))
allocate (ones(points))
ones = 1
call f_hpmstop(1)
![Page 15: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/15.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 15Libraries and Their Performance
Instrumented Code (2 of 2) call f_hpmstart(2,"Read data") open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo call f_hpmstop(2) call f_hpmstart(3,"Estimate pi") circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) call f_hpmstop(3) call f_hpmstart(4,"Write output") write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) call f_hpmstop(4) call f_hpmterminate(0) end
![Page 16: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/16.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 16Libraries and Their Performance
Notes on Instrumented Code
• Entire executable code enclosed between hpm_init and hpm_terminate
• Code sections enclosed between hpm_start and hpm_stop
• Descriptive text labels appear in output file(s)
![Page 17: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/17.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 17Libraries and Their Performance
Compile and Run with HPM Toolkit% cat jobestpi3#@ class = debug #@ shell = /usr/bin/csh#@ wall_clock_limit = 00:29:00#@ notification = always#@ job_type = serial#@ output = jobestpi3.out#@ error = jobestpi3.out#@ environment = COPY_ALL#@ queue module load hpmtoolkitsetenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT
-qsuffix=cpp=f"$FC -o estpi3 estpi3.f echo "10000000" > estpi3.dat./estpi3 <estpi3.dat exit
![Page 18: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/18.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 18Libraries and Their Performance
Notes on Use of HPM Toolkit
• Must load module hpmtoolkit• Need to include header file f_hpm.h in Fortran code, and
give preprocessor directions to compiler with -qsuffix• Performance output in a file named like
perfhpmNNNN.MMMMM
where NNNN is the task id
and MMMMM is the process id
• Message from sample executable:libHPM output in perfhpm0000.21410
![Page 19: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/19.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 19Libraries and Their Performance
Comparison of Code Sections
Section Wall Clock
(sec.)
% Time Mflips/sec.
Init Data Structs 0.248 0.27 0.000
Read Data 89.933 99.02 0.000
Estimate 0.641 0.71 114.327
Write Output 0.001 0.00 0.381
Total 90.823 100.00 0.801
10,000,000 points
![Page 20: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/20.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 20Libraries and Their Performance
Observations on Sections
• Optimization of the estimation of has little effect because
• The code spends 99% of the time reading the data
• Can the I/O be optimized?
![Page 21: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/21.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 21Libraries and Their Performance
Reworking the I/O
• Whole arrary I/O versus scalar I/O• Scalar I/O (one number per record) file is twice as big
(8 bytes for number, 8 bytes for end of record)• Whole array I/O file has only one end of record marker• Only one call for Fortran read routine for whole array I/O
read(10)xy• Need to use some fancy array footwork to sort out x(1), y(1),
x(2), y(2), … x(n), y(n) from xy array.x = xy(1::2)
y = xy(2::2)
![Page 22: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/22.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 22Libraries and Their Performance
Revised Data Structures and I/O% cat estpi4.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x, y, xy#include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (xy(2*points)) allocate (ones(points)) ones = 1 call f_hpmstop(1) call f_hpmstart(2,"Read data") open(10,file="runiform.dat",status="old",form="unformatted") read(10)xy x = xy(1::2) y = xy(2::2) call f_hpmstop(2)
![Page 23: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/23.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 23Libraries and Their Performance
Vector I/O Code Sections
Section Wall Clock
(sec.)
% Time Mflips/sec.
Init Data Structs 0.252 6.00 0.000
Read Data 3.162 75.34 0.000
Estimate Pi 0.771 18.37 94.053
Write Output 0.001 0.02 0.393
Total 4.197 100.00 15.4
10,000,000 points
![Page 24: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/24.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 24Libraries and Their Performance
Observations on New Sections
• The time spent reading the data as a vector rather than a scalar was reduced from 89.9 to 3.16 seconds, a reduction of 96% of the I/O time.
• There was no performance penalty for the additional data structure complexity.
• I/O design can have very significant performance impacts!
• Total code performance with hpmcount is now 15.4 Mflip/s, 20 times improved from the 0.801 Mflip/s of the scalar I/O code.
![Page 25: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/25.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 25Libraries and Their Performance
Automatic Shared-Memory (SMP) Parallelization
• IBM Fortran provides a –qsmp option for automatic, shared-memory parallelization, allowing multithreaded computation within a node.
• Default number of threads is 16; the number of threads is controlled by OMP_NUM_THREADS environment variable
• Allows use of the SMP version of the ESSL library,
-lesslsmp
![Page 26: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/26.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 26Libraries and Their Performance
Compiler Options
• The source code is the same as the previous, vector operation example, estpi4.f
• Compiler options –qsmp and –lesslsmp enable automatic shared-memory parallelism (SMP)
• Compiler command line: xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3
$HPMTOOLKIT -qsuffix=cpp=f -qsmp –lesslsmp
-o estpi5 estpi4.f
![Page 27: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/27.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 27Libraries and Their Performance
SMP Code Sections
Section Wall Clock
(sec.)
% Time Mflips/sec.
Init Data Structs 0.534 10.87 0.000
Read Data 4.311 87.78 0.000
Estimate 0.064 1.30 1100.
(up from 94)
Write Output 0.002 0.04 0.117
Total 4.911 100.00 15.4
10,000,000 points
![Page 28: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/28.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 28Libraries and Their Performance
Observations on SMP Code
• Computational section is now showing 1,100 Mflip/sec, or 4.6% of theoretical peak of 24,000 Mflip/sec on 16 processor node.
• Computational section is now 12 times faster, with no changes to source code
• Recommendation: always use thread-safe compilers (with _r suffix) and –qsmp unless there is a good reason to do otherwise.
• There are no explicit parallelism directives in the source code; all threading is within the library.
![Page 29: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/29.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 29Libraries and Their Performance
Too Many Threads Can Spoil Performance
• Each node has 16 processors, and usually having more threads than processors will not improve performance
0
200
400
600
800
1000
1200
0 4 8 12 16 20 24 28
Threads
Computation Mflip/s
![Page 30: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/30.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 30Libraries and Their Performance
Sidebar: Cost of Misaligned Common Block
• User code with Fortran77 style common blocks may receive an innocuous warning:
1514-008 (W) Variable … is misaligned. This may affect the efficiency of the code.
• How much can this affect the efficiency of the code?
• Test: put arrays x and y in misaligned common, with a 1-byte character in front of them
![Page 31: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/31.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 31Libraries and Their Performance
Potential Cost of Misaligned Common Blocks
• 10,000,000 points used for computing Pi;
• Properly aligned, dynamically allocated x and y
used 0.064 seconds at 1,100 Mflip/s
• Misaligned, statically allocated x and y in common block
used 0.834 seconds at 88.4 Mflip/s
• Common block alignment slowed computation by
a factor of 12
![Page 32: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/32.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 32Libraries and Their Performance
Part I Conclusion
• hpmcount can be used to measure the performance of the total code
• HPM Toolkit can be used to measure the performance of discrete code sections
• Optimization effort must be focused effectively
• Fortran90 vector operations are generally faster than Fortran77 scalar operations
• Use of automatic SMP parallelization may provide an easy performance boost
• I/O may be the largest factor in “whole code” performance
• Misaligned common blocks can be very expensive
![Page 33: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/33.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 33Libraries and Their Performance
Part II: Comparing Libraries
• In the rich user environment on seaborg, there are many alternative ways to do the same computation
• The HPM Toolkit provides the tools to compare alternative approaches to the same computation
![Page 34: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/34.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 34Libraries and Their Performance
Dot Product Functions
• User coded scalar computation
• User coded vector computaiton
• Single processor ESSL ddot• Multi-threaded SMP ESSL ddot• Single processor IMSL ddot
• Single processor NAG f06eaf• Multi-threaded SMP NAG f06eaf
![Page 35: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/35.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 35Libraries and Their Performance
Sample Problem
• Test Cauchy-Schwartz inequality for N vectors of length N
(X•Y)2 <= (X•X)(Y•Y)
• Generate 2N random numbers (array x2)
• Use 1st N for X; (X•X) computed once
• Vary vector Yfor i=1,n
y = 2.0*x2(i:n+(i-1))
First Y is 2X, second Y is 2(X2(2:N+1)), etc.
• Compute (2*N)+1 dot products of length N
![Page 36: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/36.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 36Libraries and Their Performance
Instrumented Code Section for Dot Products
call f_hpmstart(1,"Dot products")
xx = ddot(n,x,1,x,1)
do i=1,n
y = 2.0*x2(i:n+(i-1))
yy = ddot(n,y,1,y,1)
xy = ddot(n,x,1,y,1)
diffs(i) = (xx*yy)-(xy*xy)
enddo
call f_hpmstop(1)
![Page 37: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/37.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 37Libraries and Their Performance
Two User Coded Functions
real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n),dp dp = 0. do i=1,n dp = dp + x(i)*y(i) ! User scalar loop enddo myddot = dp returnend real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n) myddot = sum(x*y) ! User vector computation returnend
![Page 38: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/38.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 38Libraries and Their Performance
Compile and Run User Functions
module load hpmtoolkit
echo "100000" > libs.dat
setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT-qsuffix=cpp=f"
$FC -o libs0 libs0.f
./libs0 <libs.dat
$FC -o libs0a libs0a.f
./libs0a <libs.dat
![Page 39: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/39.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 39Libraries and Their Performance
Compile and Run ESSL Versions
setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f
-lessl"
$FC -o libs1 libs1.f
./libs1 <libs.dat
setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp
-lesslsmp"
$FC -o libs1smp libs1.f
./libs1smp <libs.dat
![Page 40: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/40.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 40Libraries and Their Performance
Compile and Run IMSL Version
module load imsl
setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $IMSL"
$FC -o libs1imsl libs1.f
./libs1imsl <libs.dat
module unload imsl
![Page 41: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/41.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 41Libraries and Their Performance
Compile and Run NAG Versions
module load nag_64
setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG"
$FC -o libs1nag libsnag.f
./libs1nag <libs.dat
module unload nag
module load nag_smp64
setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG_SMP6
-qsmp=omp -qnosave "
$FC -o libs1nagsmp libsnag.f
./libs1nagsmp <libs.dat
module unload nag_smp64
![Page 42: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/42.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 42Libraries and Their Performance
First Comparison of Dot Product(N=100,000)
Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 246 203 1.72
User Vector249 201 1.74
ESSL 145 346 1.01
ESSL-SMP 408 123 2.85 Slowest
IMSL 143 351 1.00 Fastest
NAG 250 200 1.75
NAG-SMP 180 278 1.26
![Page 43: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/43.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 43Libraries and Their Performance
Comments on First Comparisons
• The best results, by just a little, were obtained using the IMSL library, with ESSL a close second
• Third best was the NAG-SMP routine, with benefits from multi-threaded computation
• The user coded routines and NAG were about 75% slower than the ESSL and IMSL routines. In general, library routines are highly optimized and better than user coded routines.
• The ESSL-SMP library did very poorly on this computation; this unexpected result may be due to data structures in the library, or perhaps the number of threads (default is 16).
![Page 44: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/44.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 44Libraries and Their Performance
ESSL-SMP Performance vs. Number of Threads• All for N=100,000
• Number of threads controlled by environment variable OMP_NUM_THREADS
0
200
400
600
800
1000
1200
0 4 8 12 16 20
Threads
ddot Mflip/s
![Page 45: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/45.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 45Libraries and Their Performance
Revised First Comparison of Dot Product(N=100,000)
Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 246 203 4.9
User Vector249 201 5.0
ESSL 145 346 2.9
ESSL-SMP 50 1000 1.0 Fastest
4 threads
IMSL 143 351 2.9
NAG 250 200 5.0 Slowest
NAG-SMP 180 278 3.6
Tuning for Number of Threads is Very, Very Important for SMP codes !
![Page 46: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/46.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 46Libraries and Their Performance
Scaling up the Problem
• The first comparisons were for N=100,000 computing 200,001 dot products of vectors of length 100,000
• Second comparison for N=200,000 computes 400,001 dot products of vectors of length 200,000
• Increase computational complexity by a factor of 4.
![Page 47: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/47.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 47Libraries and Their Performance
Second Comparison of Dot Product(N=200,000)
Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 1090 183 2.17
User Vector1180 169 2.35 Slowest
ESSL 739 271 1.47
ESSL-SMP 503 398 1.00 Fastest
IMSL 725 276 1.44
NAG 1120 179 2.23
NAG-SMP 864 231 1.72
![Page 48: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/48.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 48Libraries and Their Performance
Comments on Second Comparisons (N=200,000)
• Now the best results are from the ESSL-SMP library, with the default 16 threads
• The next best group is ESSL, IMSL and NAG-SMP, taking 50-75% longer than the ESSL-SMP routine.
• The worst results were seen from NAG (single thread) and the user code routines.
What is the impact of the number of threads on the ESSL-SMP library performance? It is already the best.
![Page 49: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/49.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 49Libraries and Their Performance
ESSL-SMP Performance vs. Number of Threads• All for N=200,000
• Number of threads controlled by environment variable OMP_NUM_THREADS
0
200
400
600
800
1000
1200
1400
1600
0 4 8 12 16 20
Threads
ddot Mflip/s
![Page 50: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/50.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 50Libraries and Their Performance
Revised Second Comparison of Dot Product(N=200,000)
Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 1090 183 7.5
User Vector1180 169 8.1 Slowest
ESSL 739 271 5.1
ESSL-SMP 146 1370 1.0 Fastest
(6 threads)
IMSL 725 276 5.0
NAG 1120 179 7.7
NAG-SMP 864 231 5.9
![Page 51: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/51.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 51Libraries and Their Performance
Scaling with Problem Size?(N1=100,000; N2=200,000; Complexity ratio approx. 4)
Version N2/N1 Wall Clock(sec) N2/N1 Mflip/s
User Scalar 4.45 0.90
User Vector 4.75 0.84
ESSL 5.10 0.78
ESSL-SMP 2.92 1.37
(4 threads for N1;
6 threads for N2)
IMSL 5.07 0.79
NAG 4.48 0.90
NAG-SMP 4.80 0.83
![Page 52: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/52.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 52Libraries and Their Performance
Comments on Scaling Problem Size
• The ESSL-SMP performance, when tuned for the optimal number of threads, increased by almost 40% with the increased problem size.
• The untuned ESSL-SMP performance increased by a factor of 3.2 with the increased problem size.
• The user codes, ESSL, IMSL, NAG and NAG-SMP routines all showed 10%-22% decreases in performance with the larger problem size.
• It is not possible to determine, a priori, how the performance of different, functionally equivalent routines will scale with problem size.
![Page 53: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/53.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 53Libraries and Their Performance
Matrix Multiplication
• User coded scalar computation
• Fortran intrinsic matmul• Single processor ESSL dgemm• Multi-threaded SMP ESSL dgemm• Single processor IMSL dmrrrr (32-bit)
• Single processor NAG f01ckf• Multi-threaded SMP NAG f01ckf
![Page 54: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/54.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 54Libraries and Their Performance
Sample Problem
• Multiply two dense N by N matrixes,
A and B
• A(i,j) = i + j
• B(i,j) = j – i
• Output C(N,N) to verify result
![Page 55: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/55.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 55Libraries and Their Performance
Kernel of user matrix multiply
do i=1,n do j=1,n a(i,j) = real(i+j) b(i,j) = real(j-i) enddo enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo call f_hpmstop(1)
![Page 56: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/56.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 56Libraries and Their Performance
Comparison of Matrix Multiply(N1=5,000)
Version Wall Clock(sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 1,490 168 106 Slowest
Intrinsic 1,477 169 106 Slowest
ESSL 195 1,280 13.9
ESSL-SMP 14 17,800 1.0 Fastest
IMSL 194 1,290 13.8
NAG 195 1,280 13.9
NAG-SMP 14 17,800 1.0 Fastest
![Page 57: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/57.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 57Libraries and Their Performance
Observations on Matrix Multiply
• Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance
• All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor
• Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries
![Page 58: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/58.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 58Libraries and Their Performance
Comparison of Matrix Multiply(N2=10,000)
Version Wall Clock(sec) Mflip/s Scaled TimeESSL-SMP 101 19,800 1.01
NAG-SMP 100 19,900 1.00
• Scaling with Problem Size (Complexity increase approx. 8 times)
Version Wall Clock(N2/N1) Mflip/s (N2/N1)ESSL-SMP 7.2 1.10NAG-SMP 7.1 1.12
Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.
![Page 59: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/59.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 59Libraries and Their Performance
Observations on Scaling
• Scaling of problem size was only done for the SMP libraries, to fit into reasonable times.
• Doubling N results in 8 times increase of computational complexity for dense matrix multiplication
• Performance actually increased for both routines for larger problem size.
![Page 60: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/60.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 60Libraries and Their Performance
ESSL-SMP Performance vs. Number of Threads• All for N=10,000
• Number of threads controlled by environment variable OMP_NUM_THREADS
0
4000
8000
12000
16000
20000
0 4 8 12 16 20 24 28 32 36
Threads
dgemm Mflip/s
![Page 61: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/61.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 61Libraries and Their Performance
Part II Conclusion
• The NERSC user environment provides a rich variety of mathematical libraries
• Performance can vary widely for the same computation, sometimes even for the same function name, from library to library; performance also varies with problem size and, for the SMP libraries, the number of threads
• It is not possible to know, a priori, which library will provide the best performance for a given function and problem size
• The HPM Toolkit provides a way to compare library routine performance and make informed choices
![Page 62: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/62.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 62Libraries and Their Performance
Part III: Moving to Multi-node Parallelism
• The examples so far have all been of single processor or multi-processor, shared-memory (SMP style) parallelism on a single 16 processor node
• The poe+ command is the multi-node equivalent of hpmcount, and poe+ can be used with MPI codes or multi-node, distributed memory parallel libraries such as PESSL and ScaLAPACK.
• poe+ is a perl script developed by David Skinner of the NERSC User Services Group which aggregates hpmcount results for each of distributed-memory process
![Page 63: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/63.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 63Libraries and Their Performance
Kernel of PESSL/ScaLAPACK matrix multiply
! Call PESSL library routine
call f_hpminit((me+1),"Instrumented code")
call f_hpmstart((me+1),"Matrix multiply")
call pdgemm('T','T',n,n,n,1.0d0, myA,1,1,ides_a, &
myB,1,1,ides_b,0.d0, &
myC,1,1,ides_c )
call f_hpmstop(me+1)
call f_hpmterminate(me+1)
![Page 64: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/64.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 64Libraries and Their Performance
Comments on PESSL/ScaLAPACK Code
• Although the kernel on the previous slide looks like a simple progression from the ESSL version, actually there is a lot of work involved in understanding PESSL/ScaLAPACK for new users
• There are a number of data structure complexities which do not exist for the single-node libraries
• The “complete” matrix does not exist on any processor, but is block-cyclic distributed among processors
• There are added parameters of processor geometry and data distribution parameters.
• New users should study the ScaLAPACK tutorial on the Web at http://www.netlib.org/scalapack/tutorial/
![Page 65: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/65.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 65Libraries and Their Performance
Prolog for PESSL/ScaLAPACK matrix multiply
! Initialize blacs processor grid
call blacs_pinfo (me,procs)
call blacs_get (0, 0, icontxt)
call blacs_gridinit(icontxt, 'R', prow, pcol)
call blacs_gridinfo(icontxt, prow, pcol, myrow, mycol)
![Page 66: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/66.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 66Libraries and Their Performance
More Prolog for PESSL/ScaLAPACK! Construct local arrays myArows = numroc(n, nb, myrow, 0, prow) myAcols = numroc(n, nb, mycol, 0, pcol)! Initialize local arrays allocate(myA(myArows,myAcols)) allocate(myB(myArows,myAcols)) allocate(myC(myArows,myAcols)) do i=1,n call g2l(i,n,prow,nb,iproc,myi) if (myrow==iproc) then do j=1,n call g2l(j,n,pcol,nb,jproc,myj) if (mycol==jproc) then myA(myi,myj) = real(i+j) myB(myi,myj) = real(i-j) myC(myi,myj) = 0.d0 endif enddo endif enddo
![Page 67: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/67.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 67Libraries and Their Performance
Still More Prolog for PESSL/ScaLAPACK
! Prepare array descriptors for PESSL (ScaLAPACK style)
ides_a(1) = 1 ! descriptor type
ides_a(2) = icontxt ! blacs context
ides_a(3) = n ! global number of rows
ides_a(4) = n ! global number of columns
ides_a(5) = nb ! row block size
ides_a(6) = nb ! column block size
ides_a(7) = 0 ! initial process row
ides_a(8) = 0 ! initial process column
ides_a(9) = myArows ! leading dimension of local array
do i=1,9
ides_b(i) = ides_a(i)
ides_c(i) = ides_a(i)
enddo
![Page 68: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/68.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 68Libraries and Their Performance
Compile Uninstrumented Codes and Run with poe+
setenv FC "mpxlf90 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 -bmaxdata:0x80000000 -bmaxstack:0x80000000 "
$FC -o ABCp -lblacs -lpessl ABCp.f
module load scalapack$FC -o ABCs -qfree $PBLAS $BLACS $SCALAPACK -lessl
ABCp.f
poe+ ./ABCp ! PESSL versionpoe+ ./ABCs ! ScaLAPACK version
![Page 69: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/69.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 69Libraries and Their Performance
Four Runs for PESSL and ScaLAPACK Codes
• N=5000, 16 processors (one node) in 4x4 processor array
• N=10,000, 16 processors (one node) in 4x4 processor array
• N=5000, 64 processors (four nodes) in 8x8 processor array
• N=10000, 64 processors (four nodes) in 8x8 processor array
• Compare “whole code” performance using poe+ with “whole code” results for single-node ESSL-SMP routine using hpmcount.
• poe+ returns average wall clock time across all processes, and aggregate Mflip/s of all processes
![Page 70: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/70.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 70Libraries and Their Performance
Comparison of PESSL/ScaLAPACK dgemm(n=5000, 16 processors, “whole code” performance)
Section Wall Clock
(sec.)
Mflips/sec. Scaled Time (1.00=ESSL-SMP, 22 s)
PESSL 28.3 8,850 1.30
ScaLAPACK 30.4 8,240 1.40
ESSL-SMP achieved 47% of theoretical peak performance for one node
PESSL achieved 37%, and ScaLAPACK achieved 34%.
![Page 71: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/71.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 71Libraries and Their Performance
Comparison of PESSL/ScaLAPACK dgemm(n=10000, 16 processors, “whole code”)
Section Wall Clock
(sec.)
Mflips/sec. Scaled Time (1.00=ESSL-SMP, 120 s)
PESSL 141. 14,230 1.20
ScaLAPACK 160. 12,500 1.30
ESSL-SMP achieved 70% of theoretical peak performance for one node
PESSL achieved 59%, and ScaLAPACK achieved 52%.
![Page 72: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/72.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 72Libraries and Their Performance
Comparison of PESSL/ScaLAPACK dgemm(n=5000, 64 processors, “whole code”)
Section Wall Clock
(sec.)
Mflips/sec. Scaled Time (1.00=ESSL-SMP, 22 s)
PESSL 15.3 16,400 0.70
ScaLAPACK 14.2 17,600 0.65
PESSL achieved 17% of the theoretical peak for 4 nodes (96,000 Mflip/s), and ScaLAPACK achieved 18%.
![Page 73: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/73.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 73Libraries and Their Performance
Comparison of PESSL/ScaLAPACK dgemm(n=10000, 64 processors, “whole code”)
Section Wall Clock
(sec.)
Mflips/sec. Scaled Time (1.00=ESSL-SMP, 120 s)
PESSL 51.5 38,900 0.43
ScaLAPACK 58.3 34,400 0.49
PESSL achieved 41% of the theoretical peak for 4 nodes (96,000 Mflip/s), and ScaLAPACK achieved 36%.
![Page 74: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/74.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 74Libraries and Their Performance
Comments PESSL and ScaLAPACK Codes
• For problem sizes that fit within one node, the shared-memory, SMP libraries may give better performance than the distributed-memory, parallel libraries because of differences in data communication costs
• As the number of nodes and processors is increased, wall-clock time for distributed-memory libraries may drop below shared-memory SMP libraries for the same problem size, but per-processor efficiency may also drop.
• For problems which cannot fit in a node, the distributed-memory parallel libraries provide the best solution
![Page 75: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/75.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 75Libraries and Their Performance
Comments on using HPM Toolkit with PESSL and ScaLAPACK Codes
• HPM Toolkit generates two output files per task (one for statistics, one for visualization).
• Performance statistics for each task are found in files with names perfhpmNNNN.PPPPP where NNNN is the task id (or processor number), and PPPPP is the AIX process id
• Performance variations between processors and nodes can be observed.
![Page 76: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/76.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 76Libraries and Their Performance
PESSL dgemm results for Small Instrumented Section
• For N=5,000, 16 processors (one node), PESSL pdgemm– average time of 16.9 seconds
– aggregate 14,800 Mflip/s
– 62% of the theoretical peak performance for a node
• For N=10,000, 64 processors (four nodes) PESSL pdgemm– average time of 40.1 seconds
– aggregate 50,000 Mflip/s
– 52% of the theoretical peak performance for four nodes
![Page 77: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/77.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 77Libraries and Their Performance
Variability in PESSL dgemm Small Instrumented Section
• For N=5,000, 16 processors (one node), PESSL pdgemm– Wall clock for each processor varies from 16.4 to 17.4 sec
– Mflip/s for each processor varies from 850 to 1000
• For N=10,000, 64 processors (four nodes) PESSL pdgemm– Wall clock for each processor varies from 39.25 to 40.75
sec
– Mflip/s for each processor varies from 730 to 830
![Page 78: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/78.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 78Libraries and Their Performance
PESSL dgemm Task Variation(n=5000, 16 processors)
840
860
880
900
920
940
960
980
1000
1020
16.2 16.4 16.6 16.8 17 17.2 17.4 17.6
Wall Clock (s)
Mflip/s
![Page 79: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/79.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 79Libraries and Their Performance
PESSL dgemm Task Variation(n=10000, 64 processors)
720
740
760
780
800
820
840
39 39.5 40 40.5 41
Wall Clock (s)
Mflip/s
![Page 80: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/80.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 80Libraries and Their Performance
Part III Conclusion
• NERSC provides a variety of distributed-memory, multi-node mathematical libraries (PESSL, ScaLAPACK and NAG Parallel).
• Performance of these libraries can be measured using “whole code” approaches with poe+, similar to hpmcount for single node codes
• The HPM Toolkit can be used to instrument small sections of codes for more detailed analysis, include variation between tasks; but a number of output files are produced and must be analyzed by the user.
![Page 81: Libraries and Their Performance](https://reader035.vdocument.in/reader035/viewer/2022070416/56815044550346895dbe4420/html5/thumbnails/81.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 81Libraries and Their Performance
References
• Information on hpmcount and poe+ for whole code performance measurement is available on the NERSC Website at http://hpcf.nersc.gov/software/ibm/hpmcount/
• Detailed information about the HPM Toolkit for measuring performance of discrete code sections is available on the NERSC Website at http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_2_4_2.html
• The list of mathematical libraries available on seaborg can be found on the NERSC Website at http://hpcf.nersc.gov/software/ibm/#mathlibs