kirill rogozhin intel - core(s) 1 2 4 6 8 12 18 >18 threads 2 2 8 12 16 24 36 >36 simd width...
TRANSCRIPT
Vectorization
Kirill Rogozhin
Intel
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Motivation
2
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
The “Free Lunch” is over, reallyProcessor clock rate growth halted around 2005
3
Source: © 2014, James Reinders, Intel, used with permission
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Moore’s Law Is STILL Going StrongHardware performance potential continues to grow
4
“We think we can continue Moore's Law for at least another 10 years."
Intel Senior Fellow Mark Bohr, 2015
1980 1990 2000 2010
1e
+0
01
e+
02
1e
+0
41
e+
06
Processor scaling trends
dates
Re
lative
sca
lin
g
Transistors
Clock
Power
Performance
Performance/W
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Xeon®
processor
64-bit
Intel® Xeon®
processor
5100 series
Intel® Xeon®
processor
5500 series
Intel® Xeon®
processor
5600 series
Intel® Xeon®
processor code-named
Sandy Bridge EP
Intel® Xeon®
processor code-named
Ivy Bridge EP
Intel® Xeon®
processor code-named
HaswellEP
Future Xeon
Core(s) 1 2 4 6 8 12 18 >18
Threads 2 2 8 12 16 24 36 >36
SIMD Width
128 128 128 128 256 256 256 512
Intel® Xeon Phi™ coprocessor
Knights Corner
Intel® Xeon Phi™ processor & coprocessor
Knights Landing1
61 70+
244 280+
512 512
*Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning.
More cores . More Threads . Wider vectors
5
High performance software must exploit both:• Threading parallelism• Vector data parallelism
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Untapped Potential Can Be Huge!
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
6
Configurations for Binomial Options SP
at the end of this presentation
The Difference Is Growing With Each New Generation of Hardware
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Mandelbrot: ~2000x Speedup on Xeon Phi™ --Isn’t it Cool?
#pragma omp parallel for schedule(guided) for (int32_t y = 0; y < ImageHeight; ++y) {
float c_im = max_imag - y * imag_factor;#pragma omp simd safelen(32)for (int32_t x = 0; x < ImageWidth; ++x) {
fcomplex in_vals_tmp = (min_real + x * real_factor) + (c_im * 1.0iF);count[y][x] = mandel(in_vals_tmp, max_iter);
}}
Intel Xeon Phi™ system, Linux64, 61 cores running 244 threads at 1GHz, 32 KB L1, 512 KB L2 per core. Intel C/C++ Compiler 1internal build.
#pragma omp declare simd uniform(max_iter), simdlen(32) uint32_t mandel(fcomplex c, uint32_t max_iter){ uint32_t count = 1; fcomplex z = c;
while ((cabsf(z) < 2.0f) && (count < max_iter)) {z = z * z + c; count++;
}return count;
}
6
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Untapped Potential Can Be Huge!
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
8
Configurations for Binomial Options SP
at the end of this presentation
The Difference Is Growing With Each New Generation of Hardware
Many codes are still here
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Don’t use a single Vector lane/thread!Un-vectorized and un-threaded software will under perform
9
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Permission to Design for All LanesThreading and Vectorization needed to fully utilize modern hardware
10
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Vector SIMD parallelism, vectorization.
11
Vector Processing
Ci
+
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
VL
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
62
294 318 378471 485
1109
1557
~3800
SSE SSE2 SSE3 SSSE3 SSE4 SSE42 AVX AVX2 AVX512
Sandy Bridge
Haswell
Next Xeon/KNL
Nehalem
2015+201320112008
Cumulative (app.) # of Vector Instructions
How can customers use these new instructions?
4
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Why SIMD vector parallelism?
13
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
14
Vectorization of Code
for(i = 0; i <= MAX;i++)
c[i] = a[i] + b[i];
+
a[i]
b[i]
c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
vector data operations:data operations done in parallel
void v_add (float *c,
float *a,
float *b)
{
for (int i=0; i<= MAX; i++)
c[i]=a[i]+b[i];
}
15
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
vector data operations:data operations done in parallel
void v_add (float *c,
float *a,
float *b)
{
for (int i=0; i<= MAX; i++)
c[i]=a[i]+b[i];
}
16
Loop:1. LOAD a[i] -> Ra2. LOAD b[i] -> Rb3. ADD Ra, Rb -> Rc4. STORE Rc -> c[i]5. ADD i + 1 -> i
16
Scalar Processing
A B
C
+
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
vector data operations:data operations done in parallel
void v_add (float *c,
float *a,
float *b)
{
for (int i=0; i<= MAX; i++)
c[i]=a[i]+b[i];
}
Loop:1. LOAD a[i] -> Ra2. LOAD b[i] -> Rb3. ADD Ra, Rb -> Rc4. STORE Rc -> c[i]5. ADD i + 1 -> i
17
Scalar Processing
A B
C
+
Loop:1. LOADv4 a[i:i+3] -> Rva2. LOADv4 b[i:i+3] -> Rvb3. ADDv4 Rva, Rvb -> Rvc4. STOREv4 Rvc -> c[i:i+3]5. ADD i + 4 -> i
Vector Processing
Ci
+
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
VL
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
vector data operations:data operations done in parallel
void v_add (float *c,
float *a,
float *b)
{
for (int i=0; i<= MAX; i++)
c[i]=a[i]+b[i];
}
Loop:1. LOAD a[i] -> Ra2. LOAD b[i] -> Rb3. ADD Ra, Rb -> Rc4. STORE Rc -> c[i]5. ADD i + 1 -> i
18
Scalar Processing
A B
C
+
Loop:1. LOADv4 a[i:i+3] -> Rva2. LOADv4 b[i:i+3] -> Rvb3. ADDv4 Rva, Rvb -> Rvc4. STOREv4 Rvc -> c[i:i+3]5. ADD i + 4 -> i
Vector Processing
Ci
+
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
VL
We call this “vectorization”
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
19
Intel® SSE and AVX-128 Data Types
4x floatsSSE
16x bytes
8x 16-bit shorts
4x 32-bit integers
2x 64-bit integers
1x 128-bit(!) integer
2x doubles
SSE-2
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
20
AVX-256 Data Types
Intel®
AVX2
8x floats
4x doublesIntel®
AVX
32x bytes
16x 16-bit shorts
8x 32-bit integers
4x 64-bit integers
2x 128-bit(!) integer
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
AVX-512 data types
16x floats
8x doubles
16x 32-bit integers
AVX-
512
3/16/2017 21
...
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
16x DP speed-up over scalar. 8x DP speed-up over SSEwith Advanced Vector Extensions 512 (AVX-512)
Higher performance for the most demanding computational tasks
- Significant leap to 512-bit SIMD support for processors
- Intel® Compilers and Intel® Math Kernel Library include AVX-512 support
- Strong compatibility with AVX
- Added EVEX prefix enables additional functionality
- Appears first in Intel® Xeon Phi™ coprocessor, code named Knights Landing
x
x
x
22
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
threaded
scalarvector
serial
scalar
threaded
serial
vector
Parallel, Fast Serial
Multicore + Vector
Leadership Today and Tomorrow
Most Commonly UsedParallel Processor*
Many Core
Support for 512 bit vectors
Higher memory bandwidth
Common SW programming
Optimized for Highly-Vectorizable Parallel Apps
*Based on highest volume CPU in the IDC HPC Qview Q1’13
Next generation Intel Xeon Phi (Knights Landing)
Targeted for Highly-Vectorizable, Parallel Apps
+
Single Source Code
Optimization
23
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Knights Landing Architectural Diagram
Diagram is for conceptual purposes only and only illustrates a CPU and memory – it is not to scale and does not
DMI
MCDRAM MCDRAM MCDRAM
MCDRAM
MCDRAM
MCDRAM MCDRAM MCDRAM
DDR4
DDR4
DDR4
Wellsburg
PCH
Up to 72 cores
HFI
DDR4
DDR4
DDR4
PCIe Gen3
x36
6 channelsDDR4
Up to
384GB
Common with
Grantley PCH
2 ports Storm Lake
Integrated Fabric
On-package
50 GB/s bi-directional
Up to 16GB high-bandwidth on-package memory (MCDRAM)
Exposed as NUMA node
~500 GB/s sustained BW
Up to 72 cores
2D mesh architecture
Over 3 TF DP peak
Full Xeon ISA compatibility through AVX-512
~3x single-thread vs. compared to Knights Corner
Core Core
2 VPU
2VPU
1M
B L
2H
UB
Tile
Mic
ro-C
oa
x C
ab
le
(IF
P)
Mic
ro-C
oa
x C
ab
le
(IF
P)
2x 512b VPU per core (Vector Processing Units)
Based on Intel® Atom Silvermont processor with many HPC enhancements
Deep out-of-order buffers
Gather/scatter in hardware
Improved branch predition
4 threads/core
High cache bandwidth
& more24
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
(re-cap) parallel Programming for multi-core and manycore processors
25
B
C
A
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
How could we program these parallel machines?
26
B
C
A “Three Layer Cake”
“abstracts” common hybrid parallelism
programming approaches
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
How could we program these parallel machines?
27
B
C
A A – MPI, tbb::flow,
PGAS
B – OpenMP4.x, Cilk Plus, TBB
C - OpenMP4.x,Cilk Plus
Programming models Software tools
Cluster Edition
Professional Edition
Implementing the Cake
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
How could we program these parallel machines?
28
B
C
• Different methods exist• OpenMP4.x:
• Industry standard
• C/C++ and Fortran
• Supported by Intel Compiler (14, 15, 16), GCC 4.9+, …
• Both levels of microprocessor parallelism
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
#pragma omp parallel for
for (int y = 0; y < ImageHeight; ++y){
#pragma omp simd
for (int x = 0; x < ImageWidth; ++x){count[y][x] = mandel(in_vals[y][x]);
}}
2 level parallelism decomposition with OpenMP4.x: image processing example
B
C
29
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
#pragma omp parallel for
for (int i = 0; i < X_Dim; ++i){
#pragma omp simd
for (int m = 0; x < n_velocities; ++m){next_i = f(i, velocities(m));X[i] = next_i;
}}
B
C
2L parallelism decomposition with OpenMP4.x: fluid dynamics example
30
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Programming for vector SIMD parallelism
31
Vector Processing
Ci
+
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
VL
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Many Ways to Vectorize
Ease of use
Compiler: Auto-vectorization (no change of code)
Programmer control
Compiler: Auto-vectorization hints (#pragma vector, …)
SIMD intrinsic class(e.g.: F32vec, F64vec, …)
Vector intrinsic(e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …)
Assembler code(e.g.: [v]addps, [v]addss, …)
Explicit (user mandated) Vector Programming:
OpenMP4.x, Intel Cilk Plus
3/16/2017 3232
Cilk Plus Array Notation (CEAN )
(a[:] = b[:] + c[:])
Use Performance Libraries
(MKL, IPP)
explicit
instructionaware
implicit
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Explicit Vector Programming with OpenMP 4.x
33
Inp
ut:
C/C
++
/FO
RT
RA
N s
ou
rce
co
de
Vectorizer
Intel® SSE Intel® AVX Intel® MIC
Map vector parallelism to vector ISA
Ve
cto
r p
art
of
Op
en
MP
* 4
.0 e
xte
nsi
on
Inp
ut:
C/C
++
/FO
RT
RA
N s
ou
rce
co
de
Vectorizer
Intel® SSE Intel® AVX Intel® MIC
Optimize and Code GenerationOptimization and Code Generation
Vectorizer makesretargeting easy!
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
> icc –O2 –xcore-avx2 src.cpp –o test.exe
Intel® AVX2; Haswell CPU
> icc –O2 –xcore-avx2 –axCOMMON-AVX512 src.cpp –o test.exe
Default is AVX2
If AVX512 is available, use this “code path”.
Math libraries may target SSE/AVX2/AVX512 automatically at runtime
34
Compiling for Intel® AVX
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Ignore data dependencies, indirectly mitigate control flow dependence & assert alignment:
void vec1(float *a, float *b, int off, int len)
{
#pragma omp simd safelen(32) aligned(a:64, b:64)
for(int i = 0; i < len; i++)
{
a[i] = (a[i] > 1.0) ?
a[i] * b[i] :
a[i + off] * b[i];
}
}
35
Pragma SIMD Example
Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
SIMD-enabled functions
Write a function for one element and add pragma as follows
Call the scalar version:
Call vector version via SIMD loop:
36
#pragma omp declare simd
float foo(float a, float b, float c, float d)
{
return a * b + c * d;
}
#pragma omp simd
for(i = 0; i < n; i++) {
A[i] = foo(B[i], C[i], D[i], E[i]);
}
A[:] = foo(B[:], C[:], D[:], E[:]);
e = foo(a, b, c, d);