kirill rogozhin intel - core(s) 1 2 4 6 8 12 18 >18 threads 2 2 8 12 16 24 36 >36 simd width...

Vectorization

Kirill Rogozhin

Intel

Copyright © 2015 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Motivation

2


The “Free Lunch” is over, reallyProcessor clock rate growth halted around 2005

3

Source: © 2014, James Reinders, Intel, used with permission


Moore’s Law Is STILL Going StrongHardware performance potential continues to grow

4

“We think we can continue Moore's Law for at least another 10 years."

Intel Senior Fellow Mark Bohr, 2015

1980 1990 2000 2010

1e

+0

01

e+

02

1e

+0

41

e+

06

Processor scaling trends

dates

Re

lative

sca

lin

g

Transistors

Clock

Power

Performance

Performance/W


Intel® Xeon®

processor

64-bit

Intel® Xeon®

processor

5100 series

Intel® Xeon®

processor

5500 series

Intel® Xeon®

processor

5600 series

Intel® Xeon®

processor code-named

Sandy Bridge EP

Intel® Xeon®


Ivy Bridge EP

Intel® Xeon®


HaswellEP

Future Xeon

Core(s) 1 2 4 6 8 12 18 >18

Threads 2 2 8 12 16 24 36 >36

SIMD Width

128 128 128 128 256 256 256 512

Intel® Xeon Phi™ coprocessor

Knights Corner

Intel® Xeon Phi™ processor & coprocessor

Knights Landing1

61 70+

244 280+

512 512

*Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning.

More cores . More Threads . Wider vectors

5

High performance software must exploit both:• Threading parallelism• Vector data parallelism


Untapped Potential Can Be Huge!

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

6

Configurations for Binomial Options SP

at the end of this presentation

The Difference Is Growing With Each New Generation of Hardware

http://www.intel.com/performance


Mandelbrot: ~2000x Speedup on Xeon Phi™ --Isn’t it Cool?

#pragma omp parallel for schedule(guided) for (int32_t y = 0; y < ImageHeight; ++y) {

float c_im = max_imag - y * imag_factor;#pragma omp simd safelen(32)for (int32_t x = 0; x < ImageWidth; ++x) {

fcomplex in_vals_tmp = (min_real + x * real_factor) + (c_im * 1.0iF);count[y][x] = mandel(in_vals_tmp, max_iter);

}}

Intel Xeon Phi™ system, Linux64, 61 cores running 244 threads at 1GHz, 32 KB L1, 512 KB L2 per core. Intel C/C++ Compiler 1internal build.

#pragma omp declare simd uniform(max_iter), simdlen(32) uint32_t mandel(fcomplex c, uint32_t max_iter){ uint32_t count = 1; fcomplex z = c;

while ((cabsf(z) < 2.0f) && (count < max_iter)) {z = z * z + c; count++;

}return count;

}

6


Untapped Potential Can Be Huge!

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

8

Configurations for Binomial Options SP

at the end of this presentation

The Difference Is Growing With Each New Generation of Hardware

Many codes are still here

http://www.intel.com/performance


Don’t use a single Vector lane/thread!Un-vectorized and un-threaded software will under perform

9


Permission to Design for All LanesThreading and Vectorization needed to fully utilize modern hardware

10


Vector SIMD parallelism, vectorization.

11

Vector Processing

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL


62

294 318 378471 485

1109

1557

~3800

SSE SSE2 SSE3 SSSE3 SSE4 SSE42 AVX AVX2 AVX512

Sandy Bridge

Haswell

Next Xeon/KNL

Nehalem

2015+201320112008

Cumulative (app.) # of Vector Instructions

How can customers use these new instructions?

4


Why SIMD vector parallelism?

13


14

Vectorization of Code

for(i = 0; i <= MAX;i++)

c[i] = a[i] + b[i];

+

a[i]

b[i]

c[i]

+

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]


vector data operations:data operations done in parallel

void v_add (float *c,

float *a,

float *b)

{

for (int i=0; i<= MAX; i++)

c[i]=a[i]+b[i];

}

15




float *a,

float *b)

{


c[i]=a[i]+b[i];

}

16

Loop:1. LOAD a[i] -> Ra2. LOAD b[i] -> Rb3. ADD Ra, Rb -> Rc4. STORE Rc -> c[i]5. ADD i + 1 -> i

16

Scalar Processing

A B

C

+




float *a,

float *b)

{


c[i]=a[i]+b[i];

}


17

Scalar Processing

A B

C

+

Loop:1. LOADv4 a[i:i+3] -> Rva2. LOADv4 b[i:i+3] -> Rvb3. ADDv4 Rva, Rvb -> Rvc4. STOREv4 Rvc -> c[i:i+3]5. ADD i + 4 -> i

Vector Processing

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL




float *a,

float *b)

{


c[i]=a[i]+b[i];

}


18

Scalar Processing

A B

C

+

Loop:1. LOADv4 a[i:i+3] -> Rva2. LOADv4 b[i:i+3] -> Rvb3. ADDv4 Rva, Rvb -> Rvc4. STOREv4 Rvc -> c[i:i+3]5. ADD i + 4 -> i

Vector Processing

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL

We call this “vectorization”


19

Intel® SSE and AVX-128 Data Types

4x floatsSSE

16x bytes

8x 16-bit shorts

4x 32-bit integers

2x 64-bit integers

1x 128-bit(!) integer

2x doubles

SSE-2


20

AVX-256 Data Types

Intel®

AVX2

8x floats

4x doublesIntel®

AVX

32x bytes

16x 16-bit shorts

8x 32-bit integers

4x 64-bit integers

2x 128-bit(!) integer


AVX-512 data types

16x floats

8x doubles

16x 32-bit integers

AVX-

512

3/16/2017 21

...


16x DP speed-up over scalar. 8x DP speed-up over SSEwith Advanced Vector Extensions 512 (AVX-512)

Higher performance for the most demanding computational tasks

- Significant leap to 512-bit SIMD support for processors

- Intel® Compilers and Intel® Math Kernel Library include AVX-512 support

- Strong compatibility with AVX

- Added EVEX prefix enables additional functionality

- Appears first in Intel® Xeon Phi™ coprocessor, code named Knights Landing

x

x

x

22


threaded

scalarvector

serial

scalar

threaded

serial

vector

Parallel, Fast Serial

Multicore + Vector

Leadership Today and Tomorrow

Most Commonly UsedParallel Processor*

Many Core

Support for 512 bit vectors

Higher memory bandwidth

Common SW programming

Optimized for Highly-Vectorizable Parallel Apps

*Based on highest volume CPU in the IDC HPC Qview Q1’13

Next generation Intel Xeon Phi (Knights Landing)

Targeted for Highly-Vectorizable, Parallel Apps

+

Single Source Code

Optimization

23


Knights Landing Architectural Diagram

Diagram is for conceptual purposes only and only illustrates a CPU and memory – it is not to scale and does not

DMI

MCDRAM MCDRAM MCDRAM

MCDRAM

MCDRAM

MCDRAM MCDRAM MCDRAM

DDR4

DDR4

DDR4

Wellsburg

PCH

Up to 72 cores

HFI

DDR4

DDR4

DDR4

PCIe Gen3

x36

6 channelsDDR4

Up to

384GB

Common with

Grantley PCH

2 ports Storm Lake

Integrated Fabric

On-package

50 GB/s bi-directional

Up to 16GB high-bandwidth on-package memory (MCDRAM)

Exposed as NUMA node

~500 GB/s sustained BW

Up to 72 cores

2D mesh architecture

Over 3 TF DP peak

Full Xeon ISA compatibility through AVX-512

~3x single-thread vs. compared to Knights Corner

Core Core

2 VPU

2VPU

1M

B L

2H

UB

Tile

Mic

ro-C

oa

x C

ab

le

(IF

P)

Mic

ro-C

oa

x C

ab

le

(IF

P)

2x 512b VPU per core (Vector Processing Units)

Based on Intel® Atom Silvermont processor with many HPC enhancements

Deep out-of-order buffers

Gather/scatter in hardware

Improved branch predition

4 threads/core

High cache bandwidth

& more24


(re-cap) parallel Programming for multi-core and manycore processors

25

B

C

A


How could we program these parallel machines?

26

B

C

A “Three Layer Cake”

“abstracts” common hybrid parallelism

programming approaches



27

B

C

A A – MPI, tbb::flow,

PGAS

B – OpenMP4.x, Cilk Plus, TBB

C - OpenMP4.x,Cilk Plus

Programming models Software tools

Cluster Edition

Professional Edition

Implementing the Cake



28

B

C

• Different methods exist• OpenMP4.x:

• Industry standard

• C/C++ and Fortran

• Supported by Intel Compiler (14, 15, 16), GCC 4.9+, …

• Both levels of microprocessor parallelism


#pragma omp parallel for

for (int y = 0; y < ImageHeight; ++y){

#pragma omp simd

for (int x = 0; x < ImageWidth; ++x){count[y][x] = mandel(in_vals[y][x]);

}}

2 level parallelism decomposition with OpenMP4.x: image processing example

B

C

29


#pragma omp parallel for

for (int i = 0; i < X_Dim; ++i){

#pragma omp simd

for (int m = 0; x < n_velocities; ++m){next_i = f(i, velocities(m));X[i] = next_i;

}}

B

C

2L parallelism decomposition with OpenMP4.x: fluid dynamics example

30


Programming for vector SIMD parallelism

31

Vector Processing

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL


Many Ways to Vectorize

Ease of use

Compiler: Auto-vectorization (no change of code)

Programmer control

Compiler: Auto-vectorization hints (#pragma vector, …)

SIMD intrinsic class(e.g.: F32vec, F64vec, …)

Vector intrinsic(e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …)

Assembler code(e.g.: [v]addps, [v]addss, …)

Explicit (user mandated) Vector Programming:

OpenMP4.x, Intel Cilk Plus

3/16/2017 3232

Cilk Plus Array Notation (CEAN )

(a[:] = b[:] + c[:])

Use Performance Libraries

(MKL, IPP)

explicit

instructionaware

implicit


Explicit Vector Programming with OpenMP 4.x

33

Inp

ut:

C/C

++

/FO

RT

RA

N s

ou

rce

co

de

Vectorizer

Intel® SSE Intel® AVX Intel® MIC

Map vector parallelism to vector ISA

Ve

cto

r p

art

of

Op

en

MP

* 4

.0 e

xte

nsi

on

Inp

ut:

C/C

++

/FO

RT

RA

N s

ou

rce

co

de

Vectorizer

Intel® SSE Intel® AVX Intel® MIC

Optimize and Code GenerationOptimization and Code Generation

Vectorizer makesretargeting easy!


> icc –O2 –xcore-avx2 src.cpp –o test.exe

Intel® AVX2; Haswell CPU

> icc –O2 –xcore-avx2 –axCOMMON-AVX512 src.cpp –o test.exe

Default is AVX2

If AVX512 is available, use this “code path”.

Math libraries may target SSE/AVX2/AVX512 automatically at runtime

34

Compiling for Intel® AVX


Ignore data dependencies, indirectly mitigate control flow dependence & assert alignment:

void vec1(float *a, float *b, int off, int len)

{

#pragma omp simd safelen(32) aligned(a:64, b:64)

for(int i = 0; i < len; i++)

{

a[i] = (a[i] > 1.0) ?

a[i] * b[i] :

a[i + off] * b[i];

}

}

35

Pragma SIMD Example


SIMD-enabled functions

Write a function for one element and add pragma as follows

Call the scalar version:

Call vector version via SIMD loop:

36

#pragma omp declare simd

float foo(float a, float b, float c, float d)

{

return a * b + c * d;

}

#pragma omp simd

for(i = 0; i < n; i++) {

A[i] = foo(B[i], C[i], D[i], E[i]);

}

A[:] = foo(B[:], C[:], D[:], E[:]);

e = foo(a, b, c, d);

kirill rogozhin intel - core(s) 1 2 4 6 8 12 18 >18 threads 2 2 8 12 16 24 36 >36 simd width...

Documents