programming, intel performance, & efficiency architecture · 2018-01-29 · optimization notice...
TRANSCRIPT
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
2
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Intel Technical Computing & HPC Solutions Portfolio
Intel®ClusterReady
Network& Fabric
Software& Services
ComputeProcessing
Systems& Boards
I/O & Storage
2015
SiPh
Intel®Omni-PathArchitecture
Intel®True ScaleFabric
NVM
Intel®EthernetNetworking
3
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice4
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice5
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
• The problem statement!
• Intel® Xeon® Processor Architecture Basics
• Intel® Xeon Phi™ (Knights Landing)
• Software Tools for Developers
• Summary
6
Agenda
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Problem:Economical operation frequency of (CMOS) transistors is limited. No free lunch anymore!
Solution:More transistors allow more gates/logic on the same die space and power envelop, improving parallelism:
Thread level parallelism (TLP):Multi- and many-core
Data level parallelism (DLP):Wider vectors (SIMD)
Instruction level parallelism (ILP):Microarchitecture improvements, e.g.threading, superscalarity, …
7
Parallel & Vector Execution
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
8
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
9
recap
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
How to determine if we got the best / peak performance?
• Run GEMM?
• LINPACK?
• STREAM bandwidth tests?
• Latency benchmarks?
• …
• Get theoretical peak possible for the code?
10
Roofline Analysis?
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
11
Roofline Analysis“paper-pen” exercise
float *A, *B, *C, d;
for(i=0; i<n; i++)
{
A[i] = B[i] + d * C[i];
}
The above code on Intel Xeon Phi is bound by:
• Compute?
• Bandwidth?
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
• The problem statement!
• Intel® Xeon® Processor Architecture Basics
• Intel® Xeon Phi™ (Knights Landing)
• Software Tools for Developers
• Summary
12
Agenda
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
• 1978: 8086 and 8088 processors• 1982: Intel® 286 processor • 1985: Intel386™ processor• 1989: Intel486™ processor• 1993: Intel® Pentium®• 1995-1999: P6 family:
Intel Pentium Pro processor Intel Pentium II [Xeon] processor Intel Pentium III [Xeon] processor Intel Celeron® processor
• 2000-2006: Intel® Pentium® 4 processor• 2005: Intel® Pentium® processor Extreme Edition• 2001-2007: Intel® Xeon® processor• 2003-2006: Intel® Pentium® M processor• 2006/2007: Intel® Core™ Duo and Intel® Core™ Solo processors• 2006/2007: Intel® Core™2 processor• 2008: Intel® Atom™ processor and Intel® Core™ i7 processor family• 2010: Intel® Core™ processor family• 2011: Second generation Intel® Core™ processor family• 2012: Third generation Intel® Core™ processor family• 2013: Fourth generation Intel® Core™ processor family• 2013: Intel® Atom™ processor family based on Silvermont microarchitecture
13
HistoryIntel® 64 and IA-32 Architecture
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Computation of instructions requires several stages:
1. Fetch:Read instruction (bytes) from memory
2. Decode:Translate instruction (bytes) to microarchitecture
3. Execute:Perform the operation with a functional unit
4. Memory:Load (read) or store (write) data, if required
5. Commit:Retire instruction and update micro-architectural state
14
Processor Architecture BasicsPipeline
Front End Back End
Fetch Decode Execute CommitLoad or
Store
Instructions
MemoryRegisters
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Impact of pipeline stalls can be reduced by:• Branch prediction• Superscalarity + multiple issue fetch & decode• Out of Order execution• Cache• Non-temporal stores• Prefetching• Line fill buffers• Load/Store buffers• Alignment• Simultaneous Multithreading (SMT)
Characteristics of the architecture that might require user action!
15
Processor Architecture BasicsReducing Impact of Pipeline Stalls
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Cache-hierarchy:• For data and instructions• Usually inclusive caches• Races for resources• Can improve access speed• Cache-misses & cache-hits
Cache-line:• Always full 64 byte block• Minimal granularity of every load/store• Modifications invalidate entire cache-line (dirty bit)
16
Processor Architecture BasicsCache-Hierarchy & Cache-Line
L2 Cache
L3 Cache
I DL1 Caches
Memory
Fe
tch
Lo
ad
or
Sto
re
De
cod
e
Ex
ecu
te
Co
mm
it
0x12340x5678
Instructions
…
Instructions
Instructions
Instructions
1: mov 0x1234, %rbx
2: mov 0x5678, %rcx
3: mov %rcx, 0x1234
4: …
5: …
6: …Instructions
65432
1
0x1234
0x1234
0x5678
0x5678
0x12340x5678
0x1234
0x1234
Cache-miss
Cache-miss
Cache-miss
Cache-miss
Cache-miss
Cache-miss
Cache-miss
Cache-miss
Cache-miss
Cache-miss
Cache-miss
Cache-hit
65432
1
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
From Intel® 64 and IA-32 Architectures Optimization Reference Manual:
17
Processor Architecture BasicsExample: 4th Generation Intel® Core™
Fetch(Pre-Decode)I I-Queue
Decode & Dispatch
BPU
BTB
ROB, L/S
Exe[Int][FP]
DL2 Cache
Exe[Int]
Exe[Int][FP]
Exe[Int][FP] S
tore
Sto
re A
dd
ress
Lo
ad
&S
tore
Ad
dre
ss
Lo
ad
&S
tore
Ad
dre
s
LFB
SchedulePort 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
• Core:Processor core’s logic: Execution units Core caches (L1/L2) Buffers & registers …
• Uncore:All outside a processor core: Memory controller/channels (MC) and Intel® QuickPath Interconnect (QPI) L3 cache shared by all cores Type of memory Power management and clocking Optionally: Integrated graphics
Only uncore is differentiation within same processor family!18
Processor Architecture BasicsCore vs. Uncore
Processor
Co
re 1
Co
re2
Co
re n
QPIMCClock
&Power
L3 Cache
Gra
ph
ics
Core
Uncore
DDR3or
DDR4
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
• The problem statement!
• Intel® Xeon® Processor Architecture Basics
• Intel® Xeon Phi™ (Knights Landing)
• Software Tools for Developers
• Summary
19
Agenda
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice20
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
21
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice22
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice23
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice24
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice25
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice26
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice27
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice28
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice29
Random Access Latency
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice30
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
AVX-512 Designed for HPC
Quadwordinteger
arithmetic
Including gather/scatter with D/Qword
indices
Math support
IEEE division and square root
DP transcendental
primitives
New transcendental
support instructions
New permutation primitives
Two source shuffles
Compress & Expand
Bit manipulation
Vector rotate
Universal ternary logical operation
New mask instructions
• Promotions of many AVX and AVX2 instructions to AVX-512
− 32-bit and 64-bit floating-point instructions from AVX
− Scalar and 512-bit
− 32-bit and 64-bit integer instructions from AVX2
• Many new instructions to speedup HPC workloads
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice32
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice33
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Gather & Scatter
VMOVDQU64 zmm1, Q[rsi]
VMOVDQU64 zmm2, R[rsi]
VGATHERQQ zmm0 {k2}, [rax+zmm1*8]
VSCATTERQQ [rax+zmm2*8] {k3}, zmm0
D/Q/SP/DP element typesD/Q indicesInstruction can partially execute
k-reg Mask used as completion mask
for(j=0, i=0; i<N; i++)
{
B[R[i]] = A[Q[i]];
}
Q RA B
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
• The problem statement!
• Intel® Xeon® Processor Architecture Basics
• Intel® Xeon Phi™ (Knights Landing)
• Software Tools for Developers
• Summary
35
Agenda
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Petascale Today
Intel® Cluster Studio XE
36
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Intel® Cluster Studio XE 2018
37
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
38
Leading Supporter of
Industry StandardsC++, C, Fortran, OpenMP,
MPI, TBB
More Efficient
Software Development
Programmer Productivity
Preservation of Investment
Intel
Common model method
Scaling Consistency Multicore to Many-core
same programming models, programming languages, tools
Intel® Parallel Studio XE
Tools to CreateHighly optimizing Compilers,
High Performance Libraries,Designed for Performance and Analysis
Intel® Parallel Studio XE
Tools for AnalysisThreading Analysis & Debug
MPI Analysis & Debug
Vector Analysis & Debug
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
39
Compiler Support
Intel Compilers
GNU Compilers
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
40
Intel MKL library
Intel Math Kernel Library
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
41
Small Code Example – Writing Code
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
42
MPI Library…
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
43
Scaling tests of NPB – Pure MPI
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
44
Scaling tests of NPB – OpenMP
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
45
Scaling of HYDRO Code
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
Scaling of GROMACS
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
47
Compiler, Runtime Flags: GROMACS
Compiler Flags
Runtime Settings
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
48
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
49
Over 40 applications optimized for Intel® Xeon Phi™ processor family are available, with up to 6.48X (1.99X average) performance improvement1
Intel® Xeon Phi™ processor 7250 relative performance normalized to baseline (1) of a 2 socket Intel® Xeon® processor E5-2697 v4)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components,
software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance
UPDATED
No
rma
lize
d R
esu
lts
11
.56
1.6
01
.70 2.1
83
.89
0.9
61
.11
1.2
01
.35
1.3
81
.56
2.9
43
.34
1.2
21
.23
1.3
21
.36
1.4
31
.75
1.8
3 2.3
62
.50
2.6
7
1.1
61
.23
1.4
11
.41
1.5
31
.70
1.7
11
.77
2.0
02
.10
1.2
81
.37
1.4
51
.49
1.5
01
.59
1.6
91
.71
1.8
01
.81
2.0
02
.13
2.3
42
.37
2.4
52
.50
2.8
0
1.1
71
.38
1.5
81
.58
1.8
2 2.5
4
1.2
31
.35
1.3
81
.50
1.6
31
.67
1.7
11
.72
1.7
73
.65
1.2
71
.31
3.3
6 4.0
34
.25
4.6
56
.48
0
1
2
3
4
5
6
2s
Xe
on
® E
5-2
69
7 v
4
Lin
pa
ck
DG
EM
M
SG
EM
M
HP
CG
ST
RE
AM
Tri
ad
Tin
ity
- S
NA
P
Tri
nit
y -
GT
C
Tri
nit
y -
min
iDF
T
Tri
nit
y -
MIL
C
Tri
nit
y -
AM
G
Tri
nit
y -
UM
T
Tri
nit
y -
min
iFE
Tri
nit
y -
min
iGh
ost
GR
OM
AC
S
NA
MD
ap
oa
1
RE
LIO
N
NA
MD
stm
v
Co
ars
e-G
rain
Wa
ter
Sim
ula
tio
n…
Am
be
r N
ucl
eo
som
e
Am
be
r P
oli
o V
iru
s
RO
ME
/SM
L
Am
be
r 2
w4
9
Am
be
r R
ub
isco
IFS
PA
PS
14
MP
AS
-O
MA
SN
UM
Wa
ve
PO
P
HO
MM
E
DM
I H
IRO
MB
-BO
OS
WR
F C
ON
US
12
KM
GN
AQ
PM
S
NIM
NE
MO
SP
EC
FE
M_
3D
GL
OB
E 6
no
de
s -
55
K…
SP
EC
FE
M_
3D
GL
OB
E 2
5 n
od
es
-…
Se
isS
ol
- M
7.2
19
92
La
nd
ers
…
Se
isS
ol
- M
ou
nt
Me
rap
i LT
S M
R2
Op
en
LB
Se
isS
ol
- M
ou
nt
Me
rap
i GT
S
MIL
C
iso
3D
FD
PE
TS
c
So
ft S
ph
ere
VL
PL
-S
Qp
hiX
Wil
son
DS
LA
SH
Clo
ve
rLe
af
3D
Clo
ve
rLe
af
2D
Qp
hiX
CG
YA
SK
- is
o3
DF
D
YA
SK
- A
WP
-OD
C
Qu
an
tum
ES
PR
ES
SO
Be
rke
ley
GW
PW
MA
T -
Ga
As1
60
PW
MA
T -
Ga
As6
4
VA
SP
CP
2K
GE
Ta
com
a
HiF
UN
Op
en
FO
AM
Mo
torB
ike
4M
Ce
lls
Op
en
LB
Op
en
FO
AM
Mo
torB
ike
20
M C
ell
s
Op
en
FO
AM
Dri
vAe
r ca
r 1
0M
Ce
lls
:…
Op
en
FO
AM
Mo
torB
ike
11
M C
ell
s
Op
en
FO
AM
Dri
vAe
r ca
r 1
0M
Ce
lls
:…
NA
SA
Ov
erf
low
TA
CC
LB
3D
Bin
om
ial
Op
tio
ns
Bla
ckS
cho
les
SP
Bla
ckS
cho
les
DP
BA
W A
me
rica
n O
pti
on
s…
Mo
nte
Ca
rlo
Eu
rop
ea
n O
pti
on
s S
P
Mo
nte
Ca
rlo
Eu
rop
ea
n O
pti
on
s D
P
BA
W A
me
rica
n O
pti
on
s…
Benchmarks Life Science
Weather and Climate Physics / Geophysics / Energy
Manufacturing
Financial Services
Material Science
Trinity Benchmarks
1 – As demonstrated by respective proof points in this presentation
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Optimization Notice
50