the vectorization of the terso multi-body potentialhpac.cs.umu.se/ipcc/sc16_talk.pdfthe...
TRANSCRIPT
The Vectorization of the Tersoff Multi-Body PotentialAn Exercise in Performance Portability
Markus Hohnerbach Ahmed E. Ismail Paolo Bientinesi
SC’16
1 / 31
2 / 31
2 / 31
3 / 31
LAMMPS
CPU, Xeon Phi GPU
DefaultUSER-OMP
USER-INTEL
KOKKOS GPU
Tersoff(Default)
Tersoff(Ref)
Tersoff(Ref)
Tersoff(Ref)
4 / 31
LAMMPS
CPU, Xeon Phi GPU
DefaultUSER-OMP
USER-INTEL
KOKKOS GPU
Tersoff(Default)
Tersoff(Ref)
Tersoff(Ref)
Tersoff (Opt)
Vectorization Layer
4 / 31
5 / 31
V =∑i
∑j :??
V (i , j)
6 / 31
V =∑i
∑j
V (i , j)
7 / 31
V =∑i
∑j∈Ni
V (i , j)
j ∈ Ni ⇔: rij < r cutoffij ; V (i , j) = 0 ∀j /∈ Ni
8 / 31
V =∑i
∑j∈Si
V (i , j)
j ∈ Si ⇔: rij < r cutoffij + r skin,Ni ⊂ Si
9 / 31
V two−body =∑i
∑j∈Si
V (i , j)
VTersoff =∑i
∑j∈Si
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
V (i , j , ζ) = fC (rij)[fR(rij) + (1 + βνζν)−
12ν fA(rij)
]. . .
10 / 31
V two−body =∑i
∑j∈Si
V (i , j)
VTersoff =∑i
∑j∈Si
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
V (i , j , ζ) = fC (rij)[fR(rij) + (1 + βνζν)−
12ν fA(rij)
]. . .
10 / 31
V two−body =∑i
∑j∈Si
V (i , j)
VTersoff =∑i
∑j∈Si
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
V (i , j , ζ) = fC (rij)[fR(rij) + (1 + βνζν)−
12ν fA(rij)
]. . .
10 / 31
Fi = −∇xiV
11 / 31
VTersoff =∑i
∑j∈Sj
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
Fi = −∇xiV
for i dofor j ∈ Si do
ζij ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);for k ∈ Si \ {j} do
Fi ← Fi−δζ · ∂xi ζ(i , j , k);Fj ← Fj−δζ · ∂xj ζ(i , j , k);
Fk ← Fk−δζ · ∂xk ζ(i , j , k);
12 / 31
VTersoff =∑i
∑j∈Sj
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
Fi = −∇xiV
for i dofor j ∈ Si do
ζij ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);for k ∈ Si \ {j} do
Fi ← Fi−δζ · ∂xi ζ(i , j , k);Fj ← Fj−δζ · ∂xj ζ(i , j , k);
Fk ← Fk−δζ · ∂xk ζ(i , j , k);
Loop over all atoms
12 / 31
VTersoff =∑i
∑j∈Sj
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
Fi = −∇xiV
for i dofor j ∈ Si do
ζij ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);for k ∈ Si \ {j} do
Fi ← Fi−δζ · ∂xi ζ(i , j , k);Fj ← Fj−δζ · ∂xj ζ(i , j , k);
Fk ← Fk−δζ · ∂xk ζ(i , j , k);
Loop over all atoms
Loop over atoms “closeby” to i
12 / 31
VTersoff =∑i
∑j∈Sj
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
Fi = −∇xiV
for i dofor j ∈ Si do
ζij ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);for k ∈ Si \ {j} do
Fi ← Fi−δζ · ∂xi ζ(i , j , k);Fj ← Fj−δζ · ∂xj ζ(i , j , k);
Fk ← Fk−δζ · ∂xk ζ(i , j , k);
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
12 / 31
VTersoff =∑i
∑j∈Sj
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
Fi = −∇xiV
for i dofor j ∈ Si do
ζij ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);for k ∈ Si \ {j} do
Fi ← Fi−δζ · ∂xi ζ(i , j , k);Fj ← Fj−δζ · ∂xj ζ(i , j , k);
Fk ← Fk−δζ · ∂xk ζ(i , j , k);
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
Compute potential
12 / 31
VTersoff =∑i
∑j∈Sj
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
Fi = −∇xiV
for i dofor j ∈ Si do
ζij ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);for k ∈ Si \ {j} do
Fi ← Fi−δζ · ∂xi ζ(i , j , k);Fj ← Fj−δζ · ∂xj ζ(i , j , k);
Fk ← Fk−δζ · ∂xk ζ(i , j , k);
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
Compute potential
12 / 31
VTersoff =∑i
∑j∈Sj
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
Fi = −∇xiV
for i dofor j ∈ Si do
ζij ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);for k ∈ Si \ {j} do
Fi ← Fi−δζ · ∂xi ζ(i , j , k);Fj ← Fj−δζ · ∂xj ζ(i , j , k);
Fk ← Fk−δζ · ∂xk ζ(i , j , k);
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
Compute potential
Forces directly due to V
12 / 31
VTersoff =∑i
∑j∈Sj
V (i , j ,∑
k∈Si\{j}
ζ(i , j , k))
Fi = −∇xiV
for i dofor j ∈ Si do
ζij ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);for k ∈ Si \ {j} do
Fi ← Fi−δζ · ∂xi ζ(i , j , k);Fj ← Fj−δζ · ∂xj ζ(i , j , k);
Fk ← Fk−δζ · ∂xk ζ(i , j , k);
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
Compute potential
Forces directly due to V
Chain Rule for ζ
12 / 31
WM SB HW HW2 BW0
2
4
6
ns/
day
CPU: Single Node Execution (512 000 atoms)
Ref-1N Opt-M-1N
Name Processor Cores Vector ISAWM Intel Xeon X5675 2 × 6 SSE4.2SB Intel Xeon E5-2450 2 × 8 AVXHW Intel Xeon E5-2680v3 2 × 12 AVX2HW2 Intel Xeon E5-2697v3 2 × 14 AVX2BW Intel Xeon E5-2697v4 2 × 18 AVX2 13 / 31
WM SB HW HW2 BW0
2
4
6
3.18x5.00x
3.15x
2.69x
2.95x
ns/
day
CPU: Single Node Execution (512 000 atoms)
Ref-1N Opt-M-1N
Name Processor Cores Vector ISAWM Intel Xeon X5675 2 × 6 SSE4.2SB Intel Xeon E5-2450 2 × 8 AVXHW Intel Xeon E5-2680v3 2 × 12 AVX2HW2 Intel Xeon E5-2697v3 2 × 14 AVX2BW Intel Xeon E5-2697v4 2 × 18 AVX2 13 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
And derivates of ζ
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
And derivates of ζ
Compute potential
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
And derivates of ζ
Compute potential
Forces directly due to V
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
And derivates of ζ
Compute potential
Forces directly due to V
Chain Rule for ζ
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
And derivates of ζ
Compute potential
Forces directly due to V
Chain Rule for ζ
Updates for Fi and Fj
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
And derivates of ζ
Compute potential
Forces directly due to V
Chain Rule for ζ
Updates for Fi and Fj
Updates for Fk
14 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
Loop over all atoms
Loop over atoms “closeby” to i
Compute ζ
And derivates of ζ
Compute potential
Forces directly due to V
Chain Rule for ζ
Updates for Fi and Fj
Updates for Fk
14 / 31
Option A
#pragma omp parallel for
for (int i = ...
#pragma omp simd
for (int j = ...
...
15 / 31
Option A
#pragma omp parallel for
for (int i = ...
#pragma omp simd
for (int j = ...
...
15 / 31
Option A
#pragma omp parallel for
for (int i = ...
#pragma omp simd
for (int j = ...
...
15 / 31
Option A
#pragma omp parallel for
for (int i = ...
#pragma omp simd
for (int j = ...
...
15 / 31
Option A
#pragma omp parallel for
for (int i = ...
#pragma omp simd
for (int j = ...
...
15 / 31
Option A
#pragma omp parallel for
for (int i = ...
#pragma omp simd
for (int j = ...
...
15 / 31
Option B
#pragma omp parallel for simd collapse(2)
for (int i = ...
for (int j = ...
...
16 / 31
Option B
#pragma omp parallel for simd collapse(2)
for (int i = ...
for (int j = ...
...
16 / 31
Option B
#pragma omp parallel for simd collapse(2)
for (int i = ...
for (int j = ...
...
16 / 31
Option B
#pragma omp parallel for simd collapse(2)
for (int i = ...
for (int j = ...
...
16 / 31
Option B
#pragma omp parallel for simd collapse(2)
for (int i = ...
for (int j = ...
...
16 / 31
Option C
#pragma omp parallel for simd
for (int i = ...
for (int j = ...
...
17 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k ;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂kζ + ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
18 / 31
for i dofor j ∈ Si do
ζij ← 0; ∂kζ ← 0 ∀k ;∂iζ ← 0; ∂jζ ← 0;for k ∈ Si \ {j} do
ζij ← ζij + ζ(i , j , k);∂iζ ← ∂iζ + ∂xi ζ(i , j , k);∂jζ ← ∂jζ + ∂xj ζ(i , j , k);
∂kζ ← ∂kζ + ∂xk ζ(i , j , k);
V ← V + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);
δζ ← ∂ζV (i , j , ζij);Fi ← Fi − δζ · ∂iζ;Fj ← Fj − δζ · ∂jζ;for k ∈ Si \ {j} do
Fk ← Fk − δζ · ∂kζ;
18 / 31
0 5 10 15#Lane
←Ti
me
0 5 10 15#Lane
←Ti
me
Not ready to compute Ready to computeComputing Lane done All lanes done
19 / 31
0 5 10 15#Lane
←Ti
me
0 5 10 15#Lane
←Ti
me
Not ready to compute Ready to computeComputing Lane done All lanes done
20 / 31
Implementation
21 / 31
I Vector-Wide conditionals: Do all elements in a vector satisfy a condition?
int o = 1;
for (int i = 0; i < VL; i++) if (! a[i]) o = 0;
I Reductions: Sum up all elements in a vector.
double o = 0;
for (int i = 0; i < VL; i++) o += a[i];
I Conflict write handling: Accumulate values from vector into memory with indices fromvector. for (int i = 0; i < VL; i++) mem[b[i]] += a[i];
I Adjacent gather optimizations: Write data from indices i1, i2, . . . into vector a, anddata from i1 + 1, i2 + 1, . . . into b.
for (i = 0; i < VL; i++) { j = idx[i];
a[i] = mem[j]; b[i] = mem[j+1]; }
22 / 31
I Vector-Wide conditionals: Do all elements in a vector satisfy a condition?
int o = 1;
for (int i = 0; i < VL; i++) if (! a[i]) o = 0;
I Reductions: Sum up all elements in a vector.
double o = 0;
for (int i = 0; i < VL; i++) o += a[i];
I Conflict write handling: Accumulate values from vector into memory with indices fromvector. for (int i = 0; i < VL; i++) mem[b[i]] += a[i];
I Adjacent gather optimizations: Write data from indices i1, i2, . . . into vector a, anddata from i1 + 1, i2 + 1, . . . into b.
for (i = 0; i < VL; i++) { j = idx[i];
a[i] = mem[j]; b[i] = mem[j+1]; }
22 / 31
I Vector-Wide conditionals: Do all elements in a vector satisfy a condition?
int o = 1;
for (int i = 0; i < VL; i++) if (! a[i]) o = 0;
I Reductions: Sum up all elements in a vector.
double o = 0;
for (int i = 0; i < VL; i++) o += a[i];
I Conflict write handling: Accumulate values from vector into memory with indices fromvector. for (int i = 0; i < VL; i++) mem[b[i]] += a[i];
I Adjacent gather optimizations: Write data from indices i1, i2, . . . into vector a, anddata from i1 + 1, i2 + 1, . . . into b.
for (i = 0; i < VL; i++) { j = idx[i];
a[i] = mem[j]; b[i] = mem[j+1]; }
22 / 31
I Vector-Wide conditionals: Do all elements in a vector satisfy a condition?
int o = 1;
for (int i = 0; i < VL; i++) if (! a[i]) o = 0;
I Reductions: Sum up all elements in a vector.
double o = 0;
for (int i = 0; i < VL; i++) o += a[i];
I Conflict write handling: Accumulate values from vector into memory with indices fromvector. for (int i = 0; i < VL; i++) mem[b[i]] += a[i];
I Adjacent gather optimizations: Write data from indices i1, i2, . . . into vector a, anddata from i1 + 1, i2 + 1, . . . into b.
for (i = 0; i < VL; i++) { j = idx[i];
a[i] = mem[j]; b[i] = mem[j+1]; }
22 / 31
I Vector-Wide conditionals: Do all elements in a vector satisfy a condition?
int o = 1;
for (int i = 0; i < VL; i++) if (! a[i]) o = 0;
I Reductions: Sum up all elements in a vector.
double o = 0;
for (int i = 0; i < VL; i++) o += a[i];
I Conflict write handling: Accumulate values from vector into memory with indices fromvector. for (int i = 0; i < VL; i++) mem[b[i]] += a[i];
I Adjacent gather optimizations: Write data from indices i1, i2, . . . into vector a, anddata from i1 + 1, i2 + 1, . . . into b.
for (i = 0; i < VL; i++) { j = idx[i];
a[i] = mem[j]; b[i] = mem[j+1]; }
22 / 31
for (; kk < numneigh_i; kk++) {
int k = firstneigh[kk + cnumneigh_i] & NEIGHMASK;
fvec vx_k(x[k].x);
fvec vy_k(x[k].y);
fvec vz_k(x[k].z);
int w_k = x[k].w;
fvec vdx_ik = vx_k - vx_i;
fvec vdy_ik = vy_k - vy_i;
fvec vdz_ik = vz_k - vz_i;
fvec vrsq = vdx_ik * vdx_ik + vdy_ik * vdy_ik
+ vdz_ik * vdz_ik;
...
if (! v::mask_testz(veff_mask)) {
fvec vzeta_contrib = ...;
vzeta = v::acc_mask_add(vzeta, veff_mask,
vzeta, vzeta_contrib);
}
}23 / 31
Results
24 / 31
ARM WM SB HW0
1
2
3
ns/
day
CPU: Single-Threaded Execution (32 000 atoms)
Ref-1T Opt-D-1T Opt-S-1T Opt-M-1T
Name Processor Cores Vector ISA
ARM ARM Cortex-A15 2 × 41 NEONWM Intel Xeon X5675 2 × 6 SSE4.2SB Intel Xeon E5-2450 2 × 8 AVXHW Intel Xeon E5-2680v3 2 × 12 AVX2
25 / 31
WM SB HW HW2 BW0
2
4
6
3.18x5.00x
3.15x
2.69x
2.95x
ns/
day
CPU: Single Node Execution (512 000 atoms)
Ref-1N Opt-M-1N
Name Processor Cores Vector ISAWM Intel Xeon X5675 2 × 6 SSE4.2SB Intel Xeon E5-2450 2 × 8 AVXHW Intel Xeon E5-2680v3 2 × 12 AVX2HW2 Intel Xeon E5-2697v3 2 × 14 AVX2BW Intel Xeon E5-2697v4 2 × 18 AVX2 26 / 31
K20x K400
0.5
1
1.5
ns/
day
GPU (256 000 atoms)
Ref-GPU-D Ref-GPU-S Ref-GPU-MRef-KK-D Opt-KK-D
Name CPU Cores ISA AcceleratorK20X Intel Xeon E5-2650 2 × 8 AVX Nvidia Tesla K20xK40 Intel Xeon E5-2650 2 × 8 AVX Nvidia Tesla K40
27 / 31
KNC KNL0
2.5
5
7.5
4.71x
5.94x
ns/
day
Native Execution on Xeon Phi Systems (512 000 atoms)
Ref Opt-M
Name CPU Cores ISA Accelerator Cores ISAIV+2KNC Intel Xeon E5-2650v2 2 × 8 AVX Intel Xeon Phi 5110P 2 × 60 IMCIKNL – – – Intel Xeon Phi 7250 68 AVX-512
28 / 31
1 2 4 80
1.5
3
4.5
#Nodes
ns/
day
SuperMIC: Strong Scalability (2 million atoms)
Ref (IV) Opt-D (IV) Opt-D (IV+2KNC)
Name CPU Cores ISA Accelerator Cores ISAIV+2KNC Intel Xeon E5-2650v2 2 × 8 AVX Intel Xeon Phi 5110P 2 × 60 IMCI
29 / 31
Conclusion
I Optimized Tersoff on a range of architectures
I Handling short loops (neighbor list)
I Identified necessary primitives
I Implemented using C++ abstraction
I Experimented on many architectures
I Achieved performance
I Next up: REBO/AIREBO
30 / 31
Done. Time for your questions...
Markus Hohnerbach Ahmed E. Ismail Paolo Bientinesi
http://github.com/hpac/lammps-tersoff-vector
31 / 31