Download - Presentation
Numerical reproducibility for exascale HPC
Chemseddine CHOHRA
Université de Perpignan Via Domitia (UPVD)
16 Juin 2014
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 1 / 67
Introduction and problematic
Introduction and problematic
Limited machine precision.Using �oating point number as approximation.x −→ X = �(x) if x /∈ F or x si x ∈ F.X + Y 6= X ⊕ Y = �(X + Y).
Non-associativity of addition.A ⊕ (B ⊕ C) 6= (A ⊕ B) ⊕ C.For instance : M = 253; (-M ⊕ M) ⊕ 1 6= -M ⊕ (M ⊕ 1)
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 2 / 67
Introduction and problematic
Introduction and problematic
Figure 1.1 : No reproducibility of summation
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 3 / 67
Introduction and problematic
Introduction and problematic
Figure 1.1 : No reproducibility of summation
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 3 / 67
Introduction and problematic
Introduction and problematic
Figure 1.1 : No reproducibility of summation
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 3 / 67
Introduction and problematic
Introduction and problematic
Figure 1.1 : No reproducibility of summation
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 3 / 67
Introduction and problematic
Introduction and problematic
Non-reproducibility of summation on parallel systems.Problem for debuging.Problem for validating results.
Guarantee the reproducibility for BLAS.Level 1 : max, min, scal, axpy, norm, asum, dot.dot can be transformed to a summation problem.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 4 / 67
Introduction and problematic
Introduction and problematic
Non-reproducibility of summation on parallel systems.Problem for debuging.Problem for validating results.
Guarantee the reproducibility for BLAS.Level 1 : max, min, scal, axpy, norm, asum, dot.dot can be transformed to a summation problem.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 4 / 67
Introduction and problematic
Sommaire
1 Introduction and problematic
2 Solution
3 Optimization
4 Parallelism
5 Conclusion and Work in progress
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 5 / 67
Solution
Sommaire
1 Introduction and problematic
2 SolutionAccSumFastAccSumiFastSumHybridSumOnlineExactCompare algorithms
3 Optimization
4 Parallelism
5 Conclusion and Work in progress
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 6 / 67
Solution
Solution
Ensure reproducibility.1 Static scheduling and deterministic reduction.2 Demmel and Nguyen's solutions (2013).
Use an exact summation algorithm (Always reproducible).
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 7 / 67
Solution
Exact summation
How to calculate a RTN (rounded to nearest) sum.But we know how since 1970s.
Several algorithms have been proposed.FastSum (2006).AccSum (2008).FastAccSum (2008).iFastSum (2009).HybridSum (2009).OnlineExact (2010).
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 8 / 67
Solution
Exact summation
How to calculate a RTN (rounded to nearest) sum faster.But we know how since 1970s.
Several algorithms have been proposed.FastSum (2006).AccSum (2008).FastAccSum (2008).iFastSum (2009).HybridSum (2009).OnlineExact (2010).
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 8 / 67
Solution
Error free transformation of two �oats
Algorithms to compute rounding errors.
TwoSum : requires 6 �op.
FastTwoSum : requires 3 �op but ordered summands.
TwoSum(A, B) = (S, E) such as A ⊕ B = S and A + B = S + E.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 9 / 67
Solution
TwoSum and FastTwoSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 10 / 67
Solution
Exact summation
Given a vector of n �oating-point numbers with. We present somme exact summationalgorithms.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 11 / 67
Solution AccSum
AccSum
Iterative algorithm.
Based on vector error free transformation.
Adapts automatically to the condition number of the sum.
Extract and sum the high order parts in each iteration.
Requires 4n �op for each itération.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 12 / 67
Solution AccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67
Solution AccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67
Solution AccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67
Solution AccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67
Solution AccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67
Solution AccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67
Solution AccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67
Solution AccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 13 / 67
Solution AccSum
ExtractScalar
Figure 2.1 : ExtractScalar
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 14 / 67
Solution AccSum
ExtractScalar
Figure 2.1 : ExtractScalar
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 14 / 67
Solution AccSum
n cond Iterations
103 108 2
103 1024 3
106 1024 4
Table 2.1 : Number of iterations of the algorithm AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 15 / 67
Solution FastAccSum
FastAccSum
Improvement for AccSum.
FastAccSum requires only 3n �op for each iteration.theorically 25% faster.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 16 / 67
Solution FastAccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67
Solution FastAccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67
Solution FastAccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67
Solution FastAccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67
Solution FastAccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67
Solution FastAccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67
Solution FastAccSum
AccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 17 / 67
Solution FastAccSum
FastAccSum VS AccSum
Table 2.2 : Ratio of computing times AccSum / FastAccSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 18 / 67
Solution iFastSum
iFastSum
Pure distillation algorithm.
Delete zeros at the end of each iteration to reduce the size of vector.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 19 / 67
Solution iFastSum
iFastSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67
Solution iFastSum
iFastSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67
Solution iFastSum
iFastSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67
Solution iFastSum
iFastSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67
Solution iFastSum
iFastSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 20 / 67
Solution HybridSum
HybridSum
Splits the summands so the standard �oating-point numbers can be considered as ahigh accumulators.
Accumulate the summands with the same exponent in an appropriate accumulator.
Use iFastSum to sum the intermediate accumulators.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 21 / 67
Solution HybridSum
HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 22 / 67
Solution HybridSum
HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 22 / 67
Solution HybridSum
HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 22 / 67
Solution OnlineExact
OnlineExact
Use the same idea of HybridSum but using two �oating-point numbers asaccumulator instead of spliting summands.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 23 / 67
Solution OnlineExact
OnlineExact
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 24 / 67
Solution OnlineExact
OnlineExact
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 24 / 67
Solution OnlineExact
OnlineExact
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 24 / 67
Solution Compare algorithms
Hardware
Hardware
Two sockets.
Xeon E5 (2,2 Ghz, 8 cores).
Cache :
L1 : 32 KB.L2 : 256 KB.L3 : 20 MB Shared.
Memory max bandwidth 51,2 GB/s.
Turbo boost and multithreading are turned o�.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 25 / 67
Solution Compare algorithms
Compiler
Compiler
ICC -O3 -axCORE-AVX-I -fp-model double -fp-model strict
-funroll-all-loops
-axCORE-AVX-I : To indicate instruction set.
-fp-model double : Rounds intermediate results to 53-bit precision.
-fp-model strict : Disable contractions.
-funroll-all-loops : Unroll loops.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 26 / 67
Solution Compare algorithms
Compare algorithms for cond = 108
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 27 / 67
Solution Compare algorithms
Compare algorithms for cond = 1032
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 28 / 67
Solution Compare algorithms
Compare algorithms for de�rent condition numbers
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 29 / 67
Solution Compare algorithms
Runtime of AccSum for entries with di�erent condition numbers
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 30 / 67
Solution Compare algorithms
Runtime of FastAccSum for entries with di�erent condition numbers
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 31 / 67
Solution Compare algorithms
Runtime of HybridSum for entries with di�erent condition numbers
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 32 / 67
Solution Compare algorithms
Runtime of OnlineExact for entries with di�erent condition numbers
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 33 / 67
Optimization
Sommaire
1 Introduction and problematic
2 Solution
3 OptimizationHybridSumOnlineExact
4 Parallelism
5 Conclusion and Work in progress
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 34 / 67
Optimization HybridSum
HybridSum
ALGORITHM HybridSum.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.
1 Declare an intermediate array C.
2 FOREACH element of A as a do.
1 split(a, ah, al).2 i = exponent(ah).3 Ci += ah.4 i = exponent(al).5 Ci += al.
END FOREACH.
3 RETURN iFastSum(C).
END.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 35 / 67
Optimization HybridSum
HybridSum
ALGORITHM HybridSum.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.
1 Declare an intermediate array C.
2 FOREACH 8 element of A as a do.
1 split(a, ah, al).2 i = exponent(ah).3 Ci += ah.4 i = exponent(al).5 Ci += al.
END FOREACH.
3 RETURN iFastSum(C).
END.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 35 / 67
Optimization HybridSum
HybridSum
ALGORITHM HybridSum.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.
1 Declare an intermediate array C.
2 FOREACH 8 element of A as a do.
1 prefetch data for the next loops.2 split(a, ah, al).3 i = exponent(ah).4 Ci += ah.5 i = exponent(al).6 Ci += al.
END FOREACH.
3 RETURN iFastSum(C).
END.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 35 / 67
Optimization HybridSum
HybridSum
ALGORITHM HybridSum.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.
1 Declare an intermediate array C.
2 FOREACH 8 element of A as a do.
1 prefetch data for the next loops.2 split(a, ah, al).3 i = exponent(ah).4 Ci += ah.5 i = i - 27.6 Ci += al.
END FOREACH.
3 RETURN iFastSum(C).
END.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 35 / 67
Optimization HybridSum
Progress in optimization of HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 36 / 67
Optimization HybridSum
Progress in optimization of HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 37 / 67
Optimization OnlineExact
OnlineExact
ALGORITHM OnlineExact.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.
1 Declare two intermediate arrays C1, C2.
2 FOREACH element of A as a do.
1 i = exponent(a).2 (C1i, a) = 2Sum(C1i, a).3 C2i += a.
END FOREACH.
3 RETURN iFastSum(C1 ∪ C2).
END.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 38 / 67
Optimization OnlineExact
OnlineExact
ALGORITHM OnlineExact.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.
1 Declare two intermediate arrays C1, C2.
2 FOREACH 8 element of A as a do.
1 i = exponent(a).2 (C1i, a) = 2Sum(C1i, a).3 C2i += a.
END FOREACH.
3 RETURN iFastSum(C1 ∪ C2).
END.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 38 / 67
Optimization OnlineExact
OnlineExact
ALGORITHM OnlineExact.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.
1 Declare two intermediate arrays C1, C2.
2 FOREACH 8 element of A as a do.
1 prefetch data for the next loops.2 i = exponent(a).3 (C1i, a) = 2Sum(C1i, a).4 C2i += a.
END FOREACH.
3 RETURN iFastSum(C1 ∪ C2).
END.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 38 / 67
Optimization OnlineExact
OnlineExact
ALGORITHM OnlineExact.INPUT : A, an array of floating point summands.OUTPUT : S, the correctly rounded sum of A.BEGIN.
1 Declare an intermediate arrays C.
2 FOREACH 8 element of A as a do.
1 prefetch data for the next loops.2 i = exponent(a).3 (C2∗i, a) = 2Sum(C2∗i, a).4 C2∗i+1 += a
END FOREACH.
3 RETURN iFastSum(C).
END.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 38 / 67
Optimization OnlineExact
Progress in optimization of OnlineExact
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 39 / 67
Optimization OnlineExact
Progress in optimization of OnlineExact
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 40 / 67
Parallelism
Sommaire
1 Introduction and problematic
2 Solution
3 Optimization
4 ParallelismImplementation and tests
5 Conclusion and Work in progress
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 41 / 67
Parallelism
Architecture
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 42 / 67
Parallelism
OpenMP
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 43 / 67
Parallelism
OpenMP
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 43 / 67
Parallelism
OpenMP
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 43 / 67
Parallelism
OpenMP
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 43 / 67
Parallelism
latency
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 44 / 67
Parallelism
bandwidth
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 45 / 67
Parallelism
MPI
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 46 / 67
Parallelism
MPI
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 46 / 67
Parallelism
MPI
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 46 / 67
Parallelism
MPI
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 46 / 67
Parallelism
Hybrid
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 47 / 67
Parallelism
Hybrid
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 47 / 67
Parallelism Implementation and tests
Parallel HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67
Parallelism Implementation and tests
Parallel HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67
Parallelism Implementation and tests
Parallel HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67
Parallelism Implementation and tests
Parallel HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67
Parallelism Implementation and tests
Parallel HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67
Parallelism Implementation and tests
Parallel HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67
Parallelism Implementation and tests
Parallel HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 48 / 67
Parallelism Implementation and tests
Parallel OnlineExact
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 49 / 67
Parallelism Implementation and tests
HybridSum
Figure 4.1 : Scaling of HybridSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 50 / 67
Parallelism Implementation and tests
OnlineExact
Figure 4.2 : Scaling of OnlineExact
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 51 / 67
Parallelism Implementation and tests
Scaling
Algorithm 1 core 2 cores 4 cores 8 cores 16 cores
HybridSum (cycles) 249855884 141459028 72300156 37207844 19066140
OnlineExact (cycles) 259167668 156856764 91386036 46004832 23156420
Hybrid / seq 1 0,5661 0,2893 0,1489 0,0763
Online / seq 1 0,6052 0,3526 0,1775 0,0893
Online / Hybrid 1,0372 1,1088 1,2639 1,2364 1,2145
Table 4.1 : HybridSum vs OnlineExact
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 52 / 67
Parallelism Implementation and tests
HybridSum weak scalingData size = 220 * number of cores
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 53 / 67
Parallelism Implementation and tests
OnlineExact weak scalingData size = 220 * number of cores
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 54 / 67
Parallelism Implementation and tests
Compare to other algorithms
Optimized sum.dasum : Optimized by Intel in the library MKL.
reproducible solutions.ReprodSum : Guarantee reproducibility of results (based on "AccSum").FastReprodSum : Faster than ReprodSum but requires direct rounding mode (basedon "FastAccSum").
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 55 / 67
Parallelism Implementation and tests
ReprodSum
Figure 4.3 : How does it work ?
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 56 / 67
Parallelism Implementation and tests
ReprodSum
Figure 4.3 : How does it work ?
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 56 / 67
Parallelism Implementation and tests
Sequential results
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 57 / 67
Parallelism Implementation and tests
Sequential results
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 57 / 67
Parallelism Implementation and tests
Sequential results
1 FastReprodSum is 2,2times slower than dasum.
2 ReprodSum is 2,8 timesslower than dasum.
3 Hybrid and Online are 4times slower than dasum.
4 Hybrid and Online are 2times slower thanFastReprodSum.
5 Hybrid and Online are 1,5times slower thanReprodSum.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 57 / 67
Parallelism Implementation and tests
4 cores parallel results
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 58 / 67
Parallelism Implementation and tests
4 cores parallel results
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 58 / 67
Parallelism Implementation and tests
4 cores parallel results
1 The same ratios exceptfor OnlineExact.
2 Poor scaling ofOnlineExact.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 58 / 67
Parallelism Implementation and tests
16 cores parallel results
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 59 / 67
Parallelism Implementation and tests
16 cores parallel results
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 59 / 67
Parallelism Implementation and tests
16 cores parallel results
1 HybridSum is as fast asReprodSum.
2 Due to limit of bandwidth.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 59 / 67
Parallelism Implementation and tests
ParallelismScaling
Algorithm 1 core 2 cores 4 cores 8 cores 16 cores
HybridSum 1 1.72 3.46 6.72 13.11
OnlineExact 1 1,66 2,84 5,63 11.20
FastReprodSum 1 1.92 3.46 6.51 8.68
ReprodSum 1 1,95 2,84 5,63 9.43
dasum 1 1.89 3.47 5.21 7.80
Table 4.2 : Scaling of algorithms relatively to number of cores
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 60 / 67
Parallelism Implementation and tests
Scaling of ReprodSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 61 / 67
Parallelism Implementation and tests
Scaling of FastReprodSum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 62 / 67
Parallelism Implementation and tests
Scaling of dasum
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 63 / 67
Conclusion and Work in progress
Sommaire
1 Introduction and problematic
2 Solution
3 Optimization
4 Parallelism
5 Conclusion and Work in progress
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 64 / 67
Conclusion and Work in progress
Conclusion
More precision requires more computing time.
The Fastest algorithms are neither precise nor reproducibe.
We are trying to develop a reproducibe BLAS thar guarantees the best reportPrecision / Performance.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 65 / 67
Conclusion and Work in progress
Work in progress
Tests on machines with more sockets and cores.
Generalize to dot.
Auto tuning.
Chemseddine CHOHRA (Université de Perpignan Via Domitia (UPVD))Numerical reproducibility for exascale HPC 16 Juin 2014 66 / 67