characterization and modeling of finite difference stencil
TRANSCRIPT
![Page 1: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/1.jpg)
3/5/2012
1
Software & Services Group, Energy Engineering Team 1
> Optimization notice
Characterization and modeling of finite
difference stencil performance
Philippe Thierry, Leo Borges and Jamel Tayeb
Intel Corp
Software & Services Group, Energy Engineering Team 2
> Optimization notice
Perf projection from future kernels
Hypothesis :
Numerical schemes won’t change so much
only bytes/flops and memory needs will differ
Present Mid Term Long Term Future
Application
Computing System
Present Mid Term Long Term Future
Tun
ing
wo
rk
![Page 2: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/2.jpg)
3/5/2012
2
Software & Services Group, Energy Engineering Team 3
> Optimization notice
How far is this goal …
Simulator level
Core
level
Socket level
Nodelevel
Large upscaling Small upscaling Small upscaling Large upscaling
A global model should include information from simulator to cluster levels
Cluster
level
Application traces Hardware counters Communication & topology
Impact of micro-u
(IPC for ex.)Impact of BW, misses, SIMD, Affinity
Impact of
Interconnect BW and latency
T_total = f (all this information)
Software & Services Group, Energy Engineering Team 4
> Optimization notice
Multi level performance analysis
(in the case of time domain seismic imaging)
3D FD3D FD
Halo Halo exchange
i/o i/o scheme
The 3DFD kernel + Halo + BCs + the io = The whole RTM application
The 3DFD kernel + Halo exchange
3DFD kernel Only
The 3DFD kernel + Halo + BCs
BoundaryBoundaryconditions
Core level optimization
Affinity , Numa effect
Load balancing, Domain decomp.
Cluster and application architecture and topology
Fwi ~ Several RTM iterations
![Page 3: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/3.jpg)
3/5/2012
3
Software & Services Group, Energy Engineering Team 5
> Optimization notice
What results to look for
Gflops /sec: Are we far from the peak of the algorithm & of the machine ?
(Mpts/s) -> (Flops/pts) -> (Flops/s)
Gbytes/sec : Are we bandwith limited or latency limited ?
Flops/ joule: Can we see the impact of any given implementation ?
Total nb Ops per point per iteration 292
Total nb of point 2229969960.00000
Total (sec) = 9.598
Total/ite (sec) = 0.479
Speed 1: 232.313 M points / sec
Speed 2: 6.968 E-002 Points / cycle
Speed 3: 67.835 Gflops / sec
=> 42.390 % of peak, 1 threads, 6 MPIs
By hands to get max theoretical Perf
Get approximate values using HDW counters
Computational kernel
• Flops / Int
• Read/write Dram
Software & Services Group, Energy Engineering Team 6
> Optimization notice
Reminder about the peak Flops
In terms of « Flops » : Two 128 bits SIMD per cycle
4 MUL (32b) and 4 ADD (32b): 8 Single Precision Flops / cycle
2 MUL (64b) and 2 ADD (64b): 4 Double Precision Flops / cycle
Theoretical peak for a 2-sockets X5680 (6 cores @ 3.33GHz)
8 [Flops/cycle ]*12*3.33 [Gcycles/sec] = 319.68 [Gflops/sec] SP
4 [Flops/cycle ]*12*3.33 [Gcycles/sec] = 159.84 [Gflops/sec] DP
6 operations / cycle:
• 3 memory ops
• 3 computational operations
Nehalem Micro-u (OOO execution engine)
![Page 4: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/4.jpg)
3/5/2012
4
Software & Services Group, Energy Engineering Team 7
> Optimization notice
Max theoretical Flops/s by hands (isotropic kernel)
Total number of flops per point: nF = 7K+5 ,
made of ADD: nA = 6K +2 and MUL: nM = K + 3
-> max (nA,nM) is the limiting factor
Achievable peak (with infinite bandwidth) is the ratio of
nF done at 8 Flops/cy (2 SSE ports)
by the max(nA, nM) done at 4 Flops/cy (1 SSE port)
%Peak(K)= 100.0 * [ nF / 2* max (nA,nM) ]
K : half stencil length
Software & Services Group, Energy Engineering Team 8
> Optimization notice
Flops/s Wall Kernel specifications: FP operations
Nb of tables implementation ADD MUL Total flops
Iso 3 6*K +2 K+3 7*K+5
Iso, rho 6 36*K – 95 39/2 *K - 46 (111/2)*K -141
VTI 7 Duveneck et al, 2011 6*K + 7 3*K + 8 9*K + 15
VTI, rho 10 Duveneck et al, 2011 36*K -90 39/2 *K - 42 (111/2)*K - 132
TTI 24 Fletcher et al., 2009 30*K + 38 18*K + 62 48*K+100
Visco n/a Komatisch et al., 2009 n/a n/a n/a
50,00
60,00
70,00
80,00
90,00
100,00
2 7 12 17 22 27 32
% o
f p
ea
k F
lop
s/s
half stencil length (K)
iso
iso+rho
vti
vti+rho
tti
Max % of SP peak as function of kernel length
![Page 5: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/5.jpg)
3/5/2012
5
Software & Services Group, Energy Engineering Team 9
> Optimization notice
Memory demand and BW wall
The « ideal - bandwidth limited » nb of grid points per second will be S_BW= B / (4* 4) [Gpts]/[sec]
Ideal case
1 store « v » ,
3 loads « u,v,w »
Worst case
6*K +3 Load
1 Store
LOAD STORE Total 4-Bytes
Iso 3 1 4
Iso, rho 6 4 10
VTI 7 2 9
VTI, rho 7 5 12
TTI 14 12 26
Visco.E n/a n/a n/a
Software & Services Group, Energy Engineering Team 10
> Optimization notice
Real numbers when we can’t count anymore
Flops/s : Collect FP related counters
(FP_COMP_OPS_EXE.x87
+ FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION + FP_COMP_OPS_EXE.SSE_SINGLE_PRECISION
+ FP_COMP_OPS_EXE.SSE_FP_SCALAR) *1E-09 / Elapsed_time
GB/s : Collect ‘UNCore read and write events’
(UNC_IMC_WRITES for each socket + UNC_IMC_READS for each socket) *1E-09 * Cache line size / E_time
(Elapsed time = CPU_CLK_UNHALTED/ Processor Frequency / Nb of Cores )
+ WATT and Energy as a function of time
![Page 6: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/6.jpg)
3/5/2012
6
Software & Services Group, Energy Engineering Team 11
> Optimization notice
Application
Energy Checker API
Energy Checker
TSRV
ESRV
Consumed Energy Counter(s)
Temperature, RH, etc. Counter(s)
Instrument code
Counters exposed through simple EC API
API delivered as source code
Collect data and compute Energy Efficiency (EE)
EE = “Work” done / Energy Consumed
Run time ability to report
Analyze
System productivity
Application’s energy profile
Intel® Energy Checker SDK
Measure energy consumed for a workload
Software & Services Group, Energy Engineering Team 12
> Optimization notice
3d-tti-fd- Observed versus calculated BW
tti_no_cb 24th
tti_cb 24th
Stream triad
0
5
10
15
20
25
30
35
40
45
0 2 4 6 8 10 12 14
GB
/s
nb of mpi processes
LLC misses(K) ~= (Observed memory demand – Calculated memory demand ) / 4
tti 24th Calc
(weak scalability => same amount of work per process)
![Page 7: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/7.jpg)
3/5/2012
7
Software & Services Group, Energy Engineering Team 13
> Optimization notice
Mean % of total LLC misses vs Total DRAM reads
Total DRAM reads / 100 Inst. Ret.
Mem boundCPU bound
Iso CB
Iso no-CB
TTI CB
TTI no-CB
Visco
12 mpi processes
16th24th
8th
100.0
50.0
0.00.0 1.0 2.0 3.0 4.0 5.0 6.0
% o
f LL
C m
isse
s
Software & Services Group, Energy Engineering Team 14
> Optimization notice
DRAM BW vs Total DRAM Reads + Writes
Total DRAM reads + write / 100 Inst. Ret.
Me
mB
W (
GB
/S)
Iso CB
Iso no-CB
TTI CB
TTI no-CB
Visco
16th24th
8th
12 mpi processes
Mem boundCPU bound
35.0
30.0
25.0
20.0
15.0
10.0
5.0
0.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0
![Page 8: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/8.jpg)
3/5/2012
8
Software & Services Group, Energy Engineering Team 15
> Optimization notice
B/F graph
GFLops
100 200 300 400 500 600
GB
yte
s
100.0
50.0
0.0
1 Gpts/sec
0.5 Gpts/sec
Mem. bound
CP
U
bo
un
dC
PU
b
ou
nd
8th TTI CB8th TTI CB
8th TTI no-CB8th TTI no-CB
Stream / LinpackStream / Linpack
8th TTI CB
8th TTI no-CB
Stream / Linpack
Software & Services Group, Energy Engineering Team 16
> Optimization notice
Impact on power consumption = f(elapsed time)
Wall clock time (sec)
Po
we
r co
nsu
mp
tio
n(W
AT
TS
)
350
300
250
200
150
100
50
0.0 50 100 150 200 250 300 350 400 500
8th TTI CB
4th Visco
8th TTI no-CB 16th TTI CB 16th TTI no-CB
![Page 9: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/9.jpg)
3/5/2012
9
Software & Services Group, Energy Engineering Team 17
> Optimization notice
Perf per Watt
Higher is better
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
Mpts/sec
Gflops / Joule
Software & Services Group, Energy Engineering Team 18
> Optimization notice
Visco Elastic Kernel: DRAM demand and power consumption
0
WA
TT
S
300
250
200
150
100
50
Gig
a.
Inst
ruct
ion
s /
sec
Ba
nd
wid
th(G
B/s)
Wall clock time (sec)
0.0 50 100 150 200 250
35.0
30.0
25.0
20.0
15.0
10.0
5.0
0.00
6
5
4
3
2
1
Read DRAM BW
Write DRAM BW
Power
G.Inst /sec
![Page 10: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/10.jpg)
3/5/2012
10
Software & Services Group, Energy Engineering Team 19
> Optimization notice
Conclusions
• Use of the right metrics for performance measurements
• Know the max theoretical performance of your implementation
• Don’t forget system configuration and its impact on measurements
• Simple projections are usefull
• Final goal of perf modelling must be clearly defined
- Short term optimization with current kernel and hdw
- Short / mid term extrapolation for future hardware
- Long term extrapolation with future kernels and future hardwares
Software & Services Group, Energy Engineering Team 20
> Optimization notice
Acknowledgements
Thanks to Intel for permission to publish the results.
Thanks to Hugh Caffey for fruitfull discussions
Software & Services Group, Energy Engineering Team
“Prediction is very difficult, especially about the future”
by Niels Bohr, Physicist, 1885-1962
![Page 11: Characterization and modeling of finite difference stencil](https://reader031.vdocument.in/reader031/viewer/2022011913/61d7af32487fa8228725eb65/html5/thumbnails/11.jpg)
3/5/2012
11
Software & Services Group, Energy Engineering Team 21
> Optimization notice
Optimization Notice
Optimization Notice
Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize forinstruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), butdo not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers,including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detaileddescription of Intel compiler options, including the instruction sets and specific microprocessors they implicate, pleaserefer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are partof Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. Whilethe compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatiblemicroprocessors, depending on the options you select, your code and other factors, you likely will get extra performanceon Intel microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree fornon-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations includeIntel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), andSupplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does notguarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured byIntel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance onIntel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determinewhich best meet your requirements. We hope to win your business by striving to offer the best performance of anycompiler or library; please let us know if you find we do not.Notice revision #20101101
Software & Services Group, Energy Engineering Team 22
> Optimization notice
Legal Disclaimer
Legal Disclamer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OROTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NOLIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THISINFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect theapproximate performance of Intel products as measured by those tests. Any difference in system hardware or softwaredesign or configuration may affect actual performance. Buyers should consult other sources of information to evaluatethe performance of systems or components they are considering purchasing. For more information on performance testsand on the performance of Intel products, reference www.intel.com/software/products.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2010. Intel Corporation.