programming, intel performance, & efficiency architecture · 2018-01-29 · optimization notice...

51
Intel Architecture 1 Rama Malladi Intel, Bangalore Programming, Performance, & Efficiency

Upload: vukhue

Post on 26-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

IntelArchitecture

1

Rama MalladiIntel, Bangalore

Programming,Performance,& Efficiency

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

2

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Intel Technical Computing & HPC Solutions Portfolio

Intel®ClusterReady

Network& Fabric

Software& Services

ComputeProcessing

Systems& Boards

I/O & Storage

2015

SiPh

Intel®Omni-PathArchitecture

Intel®True ScaleFabric

NVM

Intel®EthernetNetworking

3

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice4

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice5

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

• The problem statement!

• Intel® Xeon® Processor Architecture Basics

• Intel® Xeon Phi™ (Knights Landing)

• Software Tools for Developers

• Summary

6

Agenda

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Problem:Economical operation frequency of (CMOS) transistors is limited. No free lunch anymore!

Solution:More transistors allow more gates/logic on the same die space and power envelop, improving parallelism:

Thread level parallelism (TLP):Multi- and many-core

Data level parallelism (DLP):Wider vectors (SIMD)

Instruction level parallelism (ILP):Microarchitecture improvements, e.g.threading, superscalarity, …

7

Parallel & Vector Execution

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

8

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

9

recap

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

How to determine if we got the best / peak performance?

• Run GEMM?

• LINPACK?

• STREAM bandwidth tests?

• Latency benchmarks?

• …

• Get theoretical peak possible for the code?

10

Roofline Analysis?

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

11

Roofline Analysis“paper-pen” exercise

float *A, *B, *C, d;

for(i=0; i<n; i++)

{

A[i] = B[i] + d * C[i];

}

The above code on Intel Xeon Phi is bound by:

• Compute?

• Bandwidth?

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

• The problem statement!

• Intel® Xeon® Processor Architecture Basics

• Intel® Xeon Phi™ (Knights Landing)

• Software Tools for Developers

• Summary

12

Agenda

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

• 1978: 8086 and 8088 processors• 1982: Intel® 286 processor • 1985: Intel386™ processor• 1989: Intel486™ processor• 1993: Intel® Pentium®• 1995-1999: P6 family:

Intel Pentium Pro processor Intel Pentium II [Xeon] processor Intel Pentium III [Xeon] processor Intel Celeron® processor

• 2000-2006: Intel® Pentium® 4 processor• 2005: Intel® Pentium® processor Extreme Edition• 2001-2007: Intel® Xeon® processor• 2003-2006: Intel® Pentium® M processor• 2006/2007: Intel® Core™ Duo and Intel® Core™ Solo processors• 2006/2007: Intel® Core™2 processor• 2008: Intel® Atom™ processor and Intel® Core™ i7 processor family• 2010: Intel® Core™ processor family• 2011: Second generation Intel® Core™ processor family• 2012: Third generation Intel® Core™ processor family• 2013: Fourth generation Intel® Core™ processor family• 2013: Intel® Atom™ processor family based on Silvermont microarchitecture

13

HistoryIntel® 64 and IA-32 Architecture

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Computation of instructions requires several stages:

1. Fetch:Read instruction (bytes) from memory

2. Decode:Translate instruction (bytes) to microarchitecture

3. Execute:Perform the operation with a functional unit

4. Memory:Load (read) or store (write) data, if required

5. Commit:Retire instruction and update micro-architectural state

14

Processor Architecture BasicsPipeline

Front End Back End

Fetch Decode Execute CommitLoad or

Store

Instructions

MemoryRegisters

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Impact of pipeline stalls can be reduced by:• Branch prediction• Superscalarity + multiple issue fetch & decode• Out of Order execution• Cache• Non-temporal stores• Prefetching• Line fill buffers• Load/Store buffers• Alignment• Simultaneous Multithreading (SMT)

Characteristics of the architecture that might require user action!

15

Processor Architecture BasicsReducing Impact of Pipeline Stalls

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Cache-hierarchy:• For data and instructions• Usually inclusive caches• Races for resources• Can improve access speed• Cache-misses & cache-hits

Cache-line:• Always full 64 byte block• Minimal granularity of every load/store• Modifications invalidate entire cache-line (dirty bit)

16

Processor Architecture BasicsCache-Hierarchy & Cache-Line

L2 Cache

L3 Cache

I DL1 Caches

Memory

Fe

tch

Lo

ad

or

Sto

re

De

cod

e

Ex

ecu

te

Co

mm

it

0x12340x5678

Instructions

Instructions

Instructions

Instructions

1: mov 0x1234, %rbx

2: mov 0x5678, %rcx

3: mov %rcx, 0x1234

4: …

5: …

6: …Instructions

65432

1

0x1234

0x1234

0x5678

0x5678

0x12340x5678

0x1234

0x1234

Cache-miss

Cache-miss

Cache-miss

Cache-miss

Cache-miss

Cache-miss

Cache-miss

Cache-miss

Cache-miss

Cache-miss

Cache-miss

Cache-hit

65432

1

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

From Intel® 64 and IA-32 Architectures Optimization Reference Manual:

17

Processor Architecture BasicsExample: 4th Generation Intel® Core™

Fetch(Pre-Decode)I I-Queue

Decode & Dispatch

BPU

BTB

ROB, L/S

Exe[Int][FP]

DL2 Cache

Exe[Int]

Exe[Int][FP]

Exe[Int][FP] S

tore

Sto

re A

dd

ress

Lo

ad

&S

tore

Ad

dre

ss

Lo

ad

&S

tore

Ad

dre

s

LFB

SchedulePort 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

• Core:Processor core’s logic: Execution units Core caches (L1/L2) Buffers & registers …

• Uncore:All outside a processor core: Memory controller/channels (MC) and Intel® QuickPath Interconnect (QPI) L3 cache shared by all cores Type of memory Power management and clocking Optionally: Integrated graphics

Only uncore is differentiation within same processor family!18

Processor Architecture BasicsCore vs. Uncore

Processor

Co

re 1

Co

re2

Co

re n

QPIMCClock

&Power

L3 Cache

Gra

ph

ics

Core

Uncore

DDR3or

DDR4

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

• The problem statement!

• Intel® Xeon® Processor Architecture Basics

• Intel® Xeon Phi™ (Knights Landing)

• Software Tools for Developers

• Summary

19

Agenda

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice20

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

21

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice22

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice23

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice24

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice25

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice26

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice27

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice28

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice29

Random Access Latency

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice30

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

AVX-512 Designed for HPC

Quadwordinteger

arithmetic

Including gather/scatter with D/Qword

indices

Math support

IEEE division and square root

DP transcendental

primitives

New transcendental

support instructions

New permutation primitives

Two source shuffles

Compress & Expand

Bit manipulation

Vector rotate

Universal ternary logical operation

New mask instructions

• Promotions of many AVX and AVX2 instructions to AVX-512

− 32-bit and 64-bit floating-point instructions from AVX

− Scalar and 512-bit

− 32-bit and 64-bit integer instructions from AVX2

• Many new instructions to speedup HPC workloads

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice32

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice33

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Gather & Scatter

VMOVDQU64 zmm1, Q[rsi]

VMOVDQU64 zmm2, R[rsi]

VGATHERQQ zmm0 {k2}, [rax+zmm1*8]

VSCATTERQQ [rax+zmm2*8] {k3}, zmm0

D/Q/SP/DP element typesD/Q indicesInstruction can partially execute

k-reg Mask used as completion mask

for(j=0, i=0; i<N; i++)

{

B[R[i]] = A[Q[i]];

}

Q RA B

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

• The problem statement!

• Intel® Xeon® Processor Architecture Basics

• Intel® Xeon Phi™ (Knights Landing)

• Software Tools for Developers

• Summary

35

Agenda

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Petascale Today

Intel® Cluster Studio XE

36

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Intel® Cluster Studio XE 2018

37

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

38

Leading Supporter of

Industry StandardsC++, C, Fortran, OpenMP,

MPI, TBB

More Efficient

Software Development

Programmer Productivity

Preservation of Investment

Intel

Common model method

Scaling Consistency Multicore to Many-core

same programming models, programming languages, tools

Intel® Parallel Studio XE

Tools to CreateHighly optimizing Compilers,

High Performance Libraries,Designed for Performance and Analysis

Intel® Parallel Studio XE

Tools for AnalysisThreading Analysis & Debug

MPI Analysis & Debug

Vector Analysis & Debug

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

39

Compiler Support

Intel Compilers

GNU Compilers

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

40

Intel MKL library

Intel Math Kernel Library

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

41

Small Code Example – Writing Code

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

42

MPI Library…

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

43

Scaling tests of NPB – Pure MPI

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

44

Scaling tests of NPB – OpenMP

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

45

Scaling of HYDRO Code

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Scaling of GROMACS

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

47

Compiler, Runtime Flags: GROMACS

Compiler Flags

Runtime Settings

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

48

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

49

Over 40 applications optimized for Intel® Xeon Phi™ processor family are available, with up to 6.48X (1.99X average) performance improvement1

Intel® Xeon Phi™ processor 7250 relative performance normalized to baseline (1) of a 2 socket Intel® Xeon® processor E5-2697 v4)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components,

software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the

performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance

UPDATED

No

rma

lize

d R

esu

lts

11

.56

1.6

01

.70 2.1

83

.89

0.9

61

.11

1.2

01

.35

1.3

81

.56

2.9

43

.34

1.2

21

.23

1.3

21

.36

1.4

31

.75

1.8

3 2.3

62

.50

2.6

7

1.1

61

.23

1.4

11

.41

1.5

31

.70

1.7

11

.77

2.0

02

.10

1.2

81

.37

1.4

51

.49

1.5

01

.59

1.6

91

.71

1.8

01

.81

2.0

02

.13

2.3

42

.37

2.4

52

.50

2.8

0

1.1

71

.38

1.5

81

.58

1.8

2 2.5

4

1.2

31

.35

1.3

81

.50

1.6

31

.67

1.7

11

.72

1.7

73

.65

1.2

71

.31

3.3

6 4.0

34

.25

4.6

56

.48

0

1

2

3

4

5

6

2s

Xe

on

® E

5-2

69

7 v

4

Lin

pa

ck

DG

EM

M

SG

EM

M

HP

CG

ST

RE

AM

Tri

ad

Tin

ity

- S

NA

P

Tri

nit

y -

GT

C

Tri

nit

y -

min

iDF

T

Tri

nit

y -

MIL

C

Tri

nit

y -

AM

G

Tri

nit

y -

UM

T

Tri

nit

y -

min

iFE

Tri

nit

y -

min

iGh

ost

GR

OM

AC

S

NA

MD

ap

oa

1

RE

LIO

N

NA

MD

stm

v

Co

ars

e-G

rain

Wa

ter

Sim

ula

tio

n…

Am

be

r N

ucl

eo

som

e

Am

be

r P

oli

o V

iru

s

RO

ME

/SM

L

Am

be

r 2

w4

9

Am

be

r R

ub

isco

IFS

PA

PS

14

MP

AS

-O

MA

SN

UM

Wa

ve

PO

P

HO

MM

E

DM

I H

IRO

MB

-BO

OS

WR

F C

ON

US

12

KM

GN

AQ

PM

S

NIM

NE

MO

SP

EC

FE

M_

3D

GL

OB

E 6

no

de

s -

55

K…

SP

EC

FE

M_

3D

GL

OB

E 2

5 n

od

es

-…

Se

isS

ol

- M

7.2

19

92

La

nd

ers

Se

isS

ol

- M

ou

nt

Me

rap

i LT

S M

R2

Op

en

LB

Se

isS

ol

- M

ou

nt

Me

rap

i GT

S

MIL

C

iso

3D

FD

PE

TS

c

So

ft S

ph

ere

VL

PL

-S

Qp

hiX

Wil

son

DS

LA

SH

Clo

ve

rLe

af

3D

Clo

ve

rLe

af

2D

Qp

hiX

CG

YA

SK

- is

o3

DF

D

YA

SK

- A

WP

-OD

C

Qu

an

tum

ES

PR

ES

SO

Be

rke

ley

GW

PW

MA

T -

Ga

As1

60

PW

MA

T -

Ga

As6

4

VA

SP

CP

2K

GE

Ta

com

a

HiF

UN

Op

en

FO

AM

Mo

torB

ike

4M

Ce

lls

Op

en

LB

Op

en

FO

AM

Mo

torB

ike

20

M C

ell

s

Op

en

FO

AM

Dri

vAe

r ca

r 1

0M

Ce

lls

:…

Op

en

FO

AM

Mo

torB

ike

11

M C

ell

s

Op

en

FO

AM

Dri

vAe

r ca

r 1

0M

Ce

lls

:…

NA

SA

Ov

erf

low

TA

CC

LB

3D

Bin

om

ial

Op

tio

ns

Bla

ckS

cho

les

SP

Bla

ckS

cho

les

DP

BA

W A

me

rica

n O

pti

on

s…

Mo

nte

Ca

rlo

Eu

rop

ea

n O

pti

on

s S

P

Mo

nte

Ca

rlo

Eu

rop

ea

n O

pti

on

s D

P

BA

W A

me

rica

n O

pti

on

s…

Benchmarks Life Science

Weather and Climate Physics / Geophysics / Energy

Manufacturing

Financial Services

Material Science

Trinity Benchmarks

1 – As demonstrated by respective proof points in this presentation

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

50

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice

Thank you!

51