cis 602-01: computational reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • summary...

33
D. Koop, CIS 602-01, Fall 2016 CIS 602-01: Computational Reproducibility Numerical Reproducibility Dr. David Koop

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

D. Koop, CIS 602-01, Fall 2016

CIS 602-01: Computational Reproducibility

Numerical Reproducibility

Dr. David Koop

Page 2: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Tools• We have seen specific tools that address particular topics:

- Versioning and Sharing: Git, Github - Data Availability and Citation: DOIs, Dryad, DataONE, figshare - Virtual Machines and the Cloud: Xen, Parallels, EC2 - Containers: Docker - Scientific Workflows: Pegasus, Kepler, VisTrails, Taverna - Provenance: PDIFF, Analogies, VisComplete

• More at ReproMatch • Putting them together?

• Packaging: CDE, ReproZip • Versioning: Sumatra, noWorkflow

2D. Koop, CIS 602-01, Fall 2016

Page 3: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Sumatra Reproducibility Issues• Complexity:

- dependence on small details - small changes have big effects

• Entropy: - computing environment - library versions change over time

• Human memory limitations: - forgetting - implicit knowledge not passed on

3D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Page 4: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Sumatra Reproducibility Solutions• Complexity: use/teach good software-engineering practices

- loose coupling - testing

• Entropy: plan for reproducibility from the start - run in different environments - write tests - record dependencies

• Human memory limitations - Record everything

4D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Page 5: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Running Sumatra$ python main.py default.param

$ smt run --executable=python --main=main.py default.param

$ smt configure --executable=python --main=main.py

$ smt run default.param

• Configure sets default executable for iterative development • Detect every change & prompt user (or record diff automatically)

$ smt run default.param

Code has changed, please commit your changes.

$ smt configure --on-changed=store-diff

$ smt run default.param

5D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Page 6: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

create new record find dependencies

get platform information

run simulation/analysis

record time taken

find new files

add tags

save record

has the code changed?

store diff

code change policy

raise exception

yes

no

diff

error

Steps in capturing experiments in Sumatra

6D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Page 7: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Record annotations and tags• Important to include meaningful comments and tags for recall later $ smt run --label=haggling --reason="determine whether

the

gourd is worth 3 or 4 shekels" romans.param

$ smt comment "apparently, it is worth NaN shekels."

$ smt tag "Figure 6"

7D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Page 8: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Repeating past experiment$ smt repeat haggling

The new record exactly matches the original.

8D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Page 9: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Browser interfaceReviewing Provenance

9D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Page 10: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Reviewing Provenance

10D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Page 11: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Sumatra Components• Specific plugins for dependency finding, version control

11D. Koop, CIS 602-01, Fall 2016

[Davison, 2012]

Sumatra components

Page 12: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

noWorkflow• Goal: Provenance analysis for scripts • Python-specific tool, approach general • Like Sumatra, tracks changes in files and scripts • Like CDE/ReproZip, intercepts file access & environment info • Also does profiling and tracing to track runtime information

• python my_script.py → now run my_script.py

12D. Koop, CIS 602-01, Fall 2016

Page 13: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

noWorkflow

13D. Koop, CIS 602-01, Fall 2016

[Murta et al., 2014]

Page 14: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

noWorkflow Provenance Analysis

14D. Koop, CIS 602-01, Fall 2016

[Murta et al., 2014]

Page 15: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Assignment 2• http://www.cis.umassd.edu/~dkoop/cis602/assignment2.html • Keep your project on Github, keep images on Docker Hub • Put a link to your Docker Hub images in your Github README.md • Questions?

- Running inside of Docker versus on your local machine - MacOS X stdout: start VisTrails with VisTrails.command

15D. Koop, CIS 602-01, Fall 2016

Page 16: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Projects• Report

- Reproducibility: • Summary of paper's main contributions • Reproducibility overview (what is available?, does it work?) • Experiments: what experiments are being reproduced and what

is the ground truth • Results: what was reproducible, what wasn't, why

- Research: • Like a standard research paper (see Lecture 2)

16D. Koop, CIS 602-01, Fall 2016

Page 17: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Projects• Presentation (8-10 minutes):

- Reproducibility: • Short summary of the paper • Summary about what is available, what works, etc. • Best: show reproducibility in action • Good: show side-by-side results

- Research (focus on reproducibility aspects!): • Short motivation and introduction • Reproducibility requirements • Discuss Results • Best: show tool/reproducibility in action

17D. Koop, CIS 602-01, Fall 2016

Page 18: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Reproducibility Levels• Data Availability • Script / Workflow • Lower-level code • Environment • Actual Hardware

18D. Koop, CIS 602-01, Fall 2016

Page 19: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

What happens at the hardware level that may not be reproducible?

19D. Koop, CIS 602-01, Fall 2016

Page 20: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Hardware and Reproducibility

20D. Koop, CIS 602-01, Fall 2016

[Pentium Chip with FDIV Bug, Photo by Konstantin Lanzet, CC BY-SA 3.0]

Page 21: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

66 COMPUTING IN SCIENCE & ENGINEERING

To simulate the deformation of a plane sheet to a real car body part with such a deep drawing pro-cess, my colleagues and I at GNS use the Indeed program (www.gns-mbh.com/?id=indeed), a soft-ware system specially designed to simulate sheet metal forming based on the finite element method (FEM). As an example, consider the part depicted in Figure 2, which is supposed to be manufac-tured from a 0.6-millimeter-thick sheet of steel. The part is about 350-mm long and 400-mm wide. The finite elements used for the simulation have an edge length of 4 mm.

At GNS, we repeatedly performed the simula-tion with different degrees of parallelism. Specifi-cally, we varied the number of processors between one and four, and executed the simulation with four processors twice. All other input data were kept constant. We then observed the program’s

behavior and compared the output data. In this case, we chose the output data to be a variable that’s particularly important in practice—namely, the local change of the sheet thickness. This vari-able is a crucial criterion in the process of deciding whether the manufactured part will in practice be suitable for its intended application. The user will typically have two threshold values for the sheet thickness. If the computed result is below the smaller of these values, the part will be considered insufficient and the deformation process must be modified to change this behavior. If the computed values are above the larger of the two values, the workpiece will be accepted, and if the computa-tion produces results between the two given val-ues then additional investigations are required to determine how to proceed.

Even though the algorithms used don’t contain any stochastic elements, the results always dif-fered from each other in all respects. In particu-lar, it’s not uncommon for results to move from one side of one threshold value to the other as the simulation is repeated. Thus, the way in which the engineers must proceed in developing their tools and the details of the tool control can also vary, depending on how the algorithms are executed on the given hardware.

To illustrate the variation of the program’s behavior, let’s look at two specific aspects. The first is the way in which its automatic time-step control works. The program decomposes the pe-riod of time required for the process in reality into individual time steps, and the simulation is performed in these discrete time steps. The sizes of the first two time steps are determined by the user (we always choose the same values for all our experiments); the following step lengths are au-tomatically chosen in a deterministic way by the program depending on the convergence behavior of the previous step.

Figure 3 shows a visualization of the correspond-ing data in a graph that illustrates the share of the already-simulated process time with respect to the total process time over the number of time steps. It’s evident that the behavior differs from program run to program run. In particular, the two curves corresponding to the two runs with four proces-sors (that is, for two executions of the same pro-gram with absolutely identical input data in every respect) significantly differ from each other.

We’ve also compared the computed changes in sheet thickness. Here, I present only some selected values. Table 1 shows the computed extreme values of the sheet thickness change. Even in this small dataset, the significant differences are

Figure 2. The part to be manufactured. Manufactured from a 0.6-mm thick sheet of steel, the part is about 350-mm long and 400-mm wide. The finite elements used for the simulation have an edge length of 4 mm.

Figure 3. Behavior of the time-step control. The behavior differs across program runs.

100

90

80

70

60

50

Shar

e of

tot

al p

roce

ss t

ime

(%)

40

30

10

00 30 60

Time steps90 120 150 180

20

1 processor2 processors3 processors4 processors (1st attempt)4 processors (2nd attempt)

CISE-14-1-Diet.indd 66 12/21/11 11:13 AM

Example: Sheet Metal Forming• Simulate in software (Indeed) • Uses a finite element method • Manufacture from 0.6mm thick steel • 350mm long, 400mm wide • Care about local change in sheet

thickness (threshold values)

21D. Koop, CIS 602-01, Fall 2016

Page 22: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

JANUARY/FEBRUARY 2012 67

clearly visible. In particular, the maximal values vary more than the minimal values. The locations where these extremal values arise on the blank are exemplifi ed in Figure 4 by a comparison of only two program runs. The fi gure shows the same section of the blank in both cases, and it’s evident that not only the extremal value itself changes but also its location.

A detailed investigation of the program source and the intermediate results revealed that the rea-son for the differences that we observed was the nondeterministic behavior of the Pardiso sub-routine10 that was linked to the program code as a component of the Intel Math Kernel Library (http://software.intel.com/en-us/intel-mkl). Pard-iso is a parallel direct (that is, noniterative) solver for sparse linear systems that, due to its reliability and effi ciency, is popular and frequently used. In particular, it’s a natural subroutine for a fi nite ele-ment program such as Indeed, where many such systems must be solved.

In a complex simulation, such a situation is common. The required code is so voluminous and consists of subroutines from so many different ar-eas that most researchers or research teams don’t

have enough time and/or specialized know-how to develop and implement the complete code on their own. Thus, users typically have to rely on more or less well-documented library routines. Of course, end users typical don’t have any infl uence on the way in which these routines were written, and so in a case like this, the reproducibility is at least doubtful. Therefore, to be positively sure about reproducibility, a close cooperation with the developers of the employed libraries is essential.

I’ll now discuss how such effects can arise and what can be done—primarily by the researchers who write the subroutine libraries—to avoid them.

The Mathematical BackgroundTo illustrate the mathematical background of the nonreproducibility described earlier, it’s not necessary to consider an extremely complicated high-performance algorithm. Rather, it’s suf-fi cient to look at a completely obvious approach for the solution of a simple problem that, strictly speaking, can be interpreted as an approximation method only because it’s executed on a computer in fi nite precision arithmetic. As will be immedi-ately clear, the phenomenon under investigation

Table 1. Computed extrema of the sheet thickness change.

Description of the simulation (no. of processors)

Minimal value of the sheet thickness change (%)

Maximal value of the sheet thickness change (%)

1 −29.21 +9.54

2 −29.34 +9.14

3 −29.04 +9.02

4 (1st attempt) −29.04 +8.98

4 (2nd attempt) −29.09 +8.96

Figure 4. Location of the computed maxima of the sheet thickness change. (a) The simulation with one processor. (b) The second run of the simulation with four processors. The darker the element is colored, the larger the corresponding sheet-thickness change. Elements colored in white have a sheet thickness change of less than 8.5 percent.

Max = 9.54

Max = 8.96

(a) (b)

CISE-14-1-Diet.indd 67 12/21/11 11:13 AM

Results

22D. Koop, CIS 602-01, Fall 2016

Page 23: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Why does the number of processors matter?

23D. Koop, CIS 602-01, Fall 2016

Page 24: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Investigation• Indeed uses a Pardiso subroutine • Part of the Intel Math Kernel Library (MKL) • Paradise is a non iterative solver for sparse linear systems that is

reliable and efficient • Pardiso is nondeterministic?

24D. Koop, CIS 602-01, Fall 2016

Page 25: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Generally• How do we deal with this? • Do we expect all scientists to drill down and figure this out? • What solutions are there that ensure reproducibility?

25D. Koop, CIS 602-01, Fall 2016

Page 26: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Scalar Product Example• Computation of scalar product

• Now assume we have a 10^6-dimension vector • Parallelize! Divide among p processors the n summands; l-th

processor computes the following partial sum:

• Then, compute the final sum:

26D. Koop, CIS 602-01, Fall 2016

68 COMPUTING IN SCIENCE & ENGINEERING

is based on absolutely elementary properties of computer arithmetic. Therefore, it’s inherent to almost every numerical software system unless explicit countermeasures are applied.

A Simple Example ProblemOur prototypical example problem is the compu-tation of the scalar product:

s x yi ii

n==∑

1 (1)

of two n-dimensional vectors x and y. This prob-lem can be interesting in its own right, but it’s also a part of many advanced procedures of ap-plied mathematics, such as the conjugate gradi-ent method11 for the solution of linear systems. For us, it’s a good example because its simple structure doesn’t hide the problem’s true essence.

Imagine that the dimension n of the vectors is quite large. This is a common situation in scientific computing; values of the order n > 106 aren’t a rarity. Thus, the calculation of the scalar product is associated with a non-negligible computational effort. However, this task is highly suitable for a parallelization. To keep the description as simple as possible and to avoid losing the focus on the relevant aspects by introducing unnecessary tech-nical detail, we assume all processors work equally fast. It then makes sense to divide the n summands of the sum to be computed in Equation 1 into p parts, where p is the number of available proces-sors. Without loss of generality, we let p be a divi-sor of n. (Otherwise, we could extend the vectors x and y by adding more components, all of which having the value zero.) Then, we can give the lth processor (l = 1, 2, ..., p) the task of computing the partial sum,

s x yl i i

i l n p

p== − +∑

( ) /

/

1 1

l n

(2)

of the required scalar product. Afterward, we’ll ask one of the processors to determine the final result:

s s l

i

p==∑ .

1 (3)

Fixed Number of ProcessorsWith this algorithm, the critical point is the way in which the final sum (Equation 3) is assembled. For reasons of efficiency, a typical implementation won’t add up the summands here in their natu-ral order; rather we’ll use the order in which the individual processors finish the computations of

their respective partial sums. Thus, if, for exam-ple, all processors except processor #2 have com-pleted their calculation, the processor associated with the final summation will already add up their intermediate results. As soon as processor #2 has finished, its result s2 will be added to that result, and the final result will be available. If we chose the natural order of summation, the computa-tion of the final sum (Equation 3) would have to be interrupted until processor #2 had completed its job, and only then could all other sl results be added up. Thus, the time required overall for the computation of the scalar product s would be-come larger because significantly more operations would be required after the end of processor #2’s computations than in the previous procedure.

If we now run through the same algorithm a second time with identical input data, it’s by no means certain that the individual processors com-plete their respective tasks in the same order as in the program’s first execution. This is because each processor typically must perform not only the computational tasks that we’ve prescribed, but also additional tasks, such as those associated to the operating system. For these additional tasks, our proper computations will be interrupted. It’s generally impossible, at least for typical users of the hardware platform in question, to predict both when such an interruption will occur and how long the interruption will last. Thus, we typi-cally won’t be able to control the order in which the partial results from the respective processors will arrive—that is, the order in which the par-tial sums will be assembled to form the final sum. Now, however, standard computer arithmetic uses floating-point numbers with a finite precision. Therefore, the summation is commutative but—because of the potentially different rounding of the intermediate results—not associative. Hence, the final result s of the summation (Equation 3) depends on the order in which the values sl are processed.

For a concrete example, let’s use n = 8 and p = 4, so that each partial sum sl consists of exactly two summands. The xj and yj are given by x1 = 1012,x2 = 0, x3 = 10−8, x4 = 0, x5 = −1012, x6 = 0, x7 = 10−8, x8 = 0, and yj = 1 ( j = 1, 2, …, 8). Thus the partial sums sl amount to s1 = 1012, s2 = s4 = 0, s3 = −1012. If now the sl arrive in the order s1, s3, s2, s4, then we first compute the intermediate result z1 : = s1 + s3 = 1012 − 1012 = 0, then z2 : = z1 + s2 = 0 + 10−8 = 10−8, and finally the final result

s = z2 + s4 = 10−8 + 10−8 = 2 ∙ 10−8. (4)

CISE-14-1-Diet.indd 68 12/21/11 11:13 AM

68 COMPUTING IN SCIENCE & ENGINEERING

is based on absolutely elementary properties of computer arithmetic. Therefore, it’s inherent to almost every numerical software system unless explicit countermeasures are applied.

A Simple Example ProblemOur prototypical example problem is the compu-tation of the scalar product:

s x yi ii

n==∑

1 (1)

of two n-dimensional vectors x and y. This prob-lem can be interesting in its own right, but it’s also a part of many advanced procedures of ap-plied mathematics, such as the conjugate gradi-ent method11 for the solution of linear systems. For us, it’s a good example because its simple structure doesn’t hide the problem’s true essence.

Imagine that the dimension n of the vectors is quite large. This is a common situation in scientific computing; values of the order n > 106 aren’t a rarity. Thus, the calculation of the scalar product is associated with a non-negligible computational effort. However, this task is highly suitable for a parallelization. To keep the description as simple as possible and to avoid losing the focus on the relevant aspects by introducing unnecessary tech-nical detail, we assume all processors work equally fast. It then makes sense to divide the n summands of the sum to be computed in Equation 1 into p parts, where p is the number of available proces-sors. Without loss of generality, we let p be a divi-sor of n. (Otherwise, we could extend the vectors x and y by adding more components, all of which having the value zero.) Then, we can give the lth processor (l = 1, 2, ..., p) the task of computing the partial sum,

s x yl i i

i l n p

p== − +∑

( ) /

/

1 1

l n

(2)

of the required scalar product. Afterward, we’ll ask one of the processors to determine the final result:

s s l

i

p==∑ .

1 (3)

Fixed Number of ProcessorsWith this algorithm, the critical point is the way in which the final sum (Equation 3) is assembled. For reasons of efficiency, a typical implementation won’t add up the summands here in their natu-ral order; rather we’ll use the order in which the individual processors finish the computations of

their respective partial sums. Thus, if, for exam-ple, all processors except processor #2 have com-pleted their calculation, the processor associated with the final summation will already add up their intermediate results. As soon as processor #2 has finished, its result s2 will be added to that result, and the final result will be available. If we chose the natural order of summation, the computa-tion of the final sum (Equation 3) would have to be interrupted until processor #2 had completed its job, and only then could all other sl results be added up. Thus, the time required overall for the computation of the scalar product s would be-come larger because significantly more operations would be required after the end of processor #2’s computations than in the previous procedure.

If we now run through the same algorithm a second time with identical input data, it’s by no means certain that the individual processors com-plete their respective tasks in the same order as in the program’s first execution. This is because each processor typically must perform not only the computational tasks that we’ve prescribed, but also additional tasks, such as those associated to the operating system. For these additional tasks, our proper computations will be interrupted. It’s generally impossible, at least for typical users of the hardware platform in question, to predict both when such an interruption will occur and how long the interruption will last. Thus, we typi-cally won’t be able to control the order in which the partial results from the respective processors will arrive—that is, the order in which the par-tial sums will be assembled to form the final sum. Now, however, standard computer arithmetic uses floating-point numbers with a finite precision. Therefore, the summation is commutative but—because of the potentially different rounding of the intermediate results—not associative. Hence, the final result s of the summation (Equation 3) depends on the order in which the values sl are processed.

For a concrete example, let’s use n = 8 and p = 4, so that each partial sum sl consists of exactly two summands. The xj and yj are given by x1 = 1012,x2 = 0, x3 = 10−8, x4 = 0, x5 = −1012, x6 = 0, x7 = 10−8, x8 = 0, and yj = 1 ( j = 1, 2, …, 8). Thus the partial sums sl amount to s1 = 1012, s2 = s4 = 0, s3 = −1012. If now the sl arrive in the order s1, s3, s2, s4, then we first compute the intermediate result z1 : = s1 + s3 = 1012 − 1012 = 0, then z2 : = z1 + s2 = 0 + 10−8 = 10−8, and finally the final result

s = z2 + s4 = 10−8 + 10−8 = 2 ∙ 10−8. (4)

CISE-14-1-Diet.indd 68 12/21/11 11:13 AM

68 COMPUTING IN SCIENCE & ENGINEERING

is based on absolutely elementary properties of computer arithmetic. Therefore, it’s inherent to almost every numerical software system unless explicit countermeasures are applied.

A Simple Example ProblemOur prototypical example problem is the compu-tation of the scalar product:

s x yi ii

n==∑

1 (1)

of two n-dimensional vectors x and y. This prob-lem can be interesting in its own right, but it’s also a part of many advanced procedures of ap-plied mathematics, such as the conjugate gradi-ent method11 for the solution of linear systems. For us, it’s a good example because its simple structure doesn’t hide the problem’s true essence.

Imagine that the dimension n of the vectors is quite large. This is a common situation in scientific computing; values of the order n > 106 aren’t a rarity. Thus, the calculation of the scalar product is associated with a non-negligible computational effort. However, this task is highly suitable for a parallelization. To keep the description as simple as possible and to avoid losing the focus on the relevant aspects by introducing unnecessary tech-nical detail, we assume all processors work equally fast. It then makes sense to divide the n summands of the sum to be computed in Equation 1 into p parts, where p is the number of available proces-sors. Without loss of generality, we let p be a divi-sor of n. (Otherwise, we could extend the vectors x and y by adding more components, all of which having the value zero.) Then, we can give the lth processor (l = 1, 2, ..., p) the task of computing the partial sum,

s x yl i i

i l n p

p== − +∑

( ) /

/

1 1

l n

(2)

of the required scalar product. Afterward, we’ll ask one of the processors to determine the final result:

s s l

i

p==∑ .

1 (3)

Fixed Number of ProcessorsWith this algorithm, the critical point is the way in which the final sum (Equation 3) is assembled. For reasons of efficiency, a typical implementation won’t add up the summands here in their natu-ral order; rather we’ll use the order in which the individual processors finish the computations of

their respective partial sums. Thus, if, for exam-ple, all processors except processor #2 have com-pleted their calculation, the processor associated with the final summation will already add up their intermediate results. As soon as processor #2 has finished, its result s2 will be added to that result, and the final result will be available. If we chose the natural order of summation, the computa-tion of the final sum (Equation 3) would have to be interrupted until processor #2 had completed its job, and only then could all other sl results be added up. Thus, the time required overall for the computation of the scalar product s would be-come larger because significantly more operations would be required after the end of processor #2’s computations than in the previous procedure.

If we now run through the same algorithm a second time with identical input data, it’s by no means certain that the individual processors com-plete their respective tasks in the same order as in the program’s first execution. This is because each processor typically must perform not only the computational tasks that we’ve prescribed, but also additional tasks, such as those associated to the operating system. For these additional tasks, our proper computations will be interrupted. It’s generally impossible, at least for typical users of the hardware platform in question, to predict both when such an interruption will occur and how long the interruption will last. Thus, we typi-cally won’t be able to control the order in which the partial results from the respective processors will arrive—that is, the order in which the par-tial sums will be assembled to form the final sum. Now, however, standard computer arithmetic uses floating-point numbers with a finite precision. Therefore, the summation is commutative but—because of the potentially different rounding of the intermediate results—not associative. Hence, the final result s of the summation (Equation 3) depends on the order in which the values sl are processed.

For a concrete example, let’s use n = 8 and p = 4, so that each partial sum sl consists of exactly two summands. The xj and yj are given by x1 = 1012,x2 = 0, x3 = 10−8, x4 = 0, x5 = −1012, x6 = 0, x7 = 10−8, x8 = 0, and yj = 1 ( j = 1, 2, …, 8). Thus the partial sums sl amount to s1 = 1012, s2 = s4 = 0, s3 = −1012. If now the sl arrive in the order s1, s3, s2, s4, then we first compute the intermediate result z1 : = s1 + s3 = 1012 − 1012 = 0, then z2 : = z1 + s2 = 0 + 10−8 = 10−8, and finally the final result

s = z2 + s4 = 10−8 + 10−8 = 2 ∙ 10−8. (4)

CISE-14-1-Diet.indd 68 12/21/11 11:13 AM

Page 27: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Fixed number of processors• Processors complete in different orders… • Compute the sum as the solutions come in • n=8, p=4 •x1 = 1012, x2=0,x3=10−8,x4=0,x5=−1012,x6=0, x7=10−8,x8=0, and yj=1(j=1,2,...,8)

•s1 = 1012, s2 = s4 = 0, s3 = −1012 • If we have order s1, s3, s2, s4

z1:=s1+s3=1012−1012=0

z2:= z1 +s2 =0+10−8 =10−8

s=z2 +s4 =10−8 +10−8 =2·10−8.

•Order s1, s2, s3, s4 gives s= zʹ+s =0+10−8=10−8, (because 10^12 + 10^-8 = 10^12)…

27D. Koop, CIS 602-01, Fall 2016

Page 28: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Solution?

28D. Koop, CIS 602-01, Fall 2016

Page 29: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Solution?• Wait for all partial sums to come in, and then add them in a

deterministic manner

29D. Koop, CIS 602-01, Fall 2016

Page 30: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Varied number of processors• What is we have only two processors? s1 = 1012 + 10−8

s2 = −1012 + 10−8,

s∗ = s1∗ + s∗2 = 1012 − 1012 = 0

• Solutions? - Use virtual processors that is fixed no matter what the environment

30D. Koop, CIS 602-01, Fall 2016

Page 31: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Other solutions• Understanding rounding • Use higher precision • Interval arithmetic • Fixed-point arithmetic • Uncertainty quantification • "The 'exact' reproducibility, i.e. identical numerical results on

different number of processors, is impossible in our view, and is not our goal"

31D. Koop, CIS 602-01, Fall 2016

Page 32: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

Ahn Paper• Data-race free: the presence of data races can result in execution-

order-dependent final results; • Determinate (external determinism): a system can be considered

deterministic if the final results are the same regardless of the internal execution details;

• Internal determinism: even the internal execution steps (traces) are required to be the same;

• Functional determinism: this refers to the absence of mutable state updates, inherently giving rise to deter- minate outcomes;

• Synchronous parallelism: some regard synchronously parallel execution models as deterministic, as opposed to asynchronous models.

32D. Koop, CIS 602-01, Fall 2016

Page 33: CIS 602-01: Computational Reproducibilitydkoop/cis602-2016fa/lectures/lecture23.pdf · • Summary about what is available, what works, etc. • Best: show reproducibility in action

33D. Koop, CIS 602-01, Fall 2016