balancing e ciency and accuracy for sediment transport...
TRANSCRIPT
Balancing efficiency and accuracy for sediment
transport simulations
Wenjie Wei1, Stuart R. Clark1, Huayou Su2, Mei Wen2 and
Xing Cai1,3
1Department of Computational Geoscience, Simula Research Laboratory, P.O. Box
134, 1325 Lysaker, Norway2School of Computer Science, National University of Defense Technology, Changsha,
410073, China3Department of Informatics, University of Oslo, P.O. Box 1080 Blindern, 0316 Oslo,
Norway
E-mail: [email protected], [email protected], [email protected],
[email protected], [email protected]
Abstract. Simulating multi-lithology sediment transport requires numerically
solving a fully-coupled system of nonlinear partial differential equations. The most
standard approach is to simultaneously update all the unknown fields. Such a fully-
implicit strategy can be computationally demanding due to the need of Newton-
Raphson iterations, each having to set up and solve a large system of linearized
algebraic equations. Fully-explicit numerical schemes that do not solve linear systems
are possible to devise, but suffer from lower numerical stability and accuracy. If we
count the total number of floating-point operations needed to achieve stable numerical
solutions with a same level of accuracy, the fully-implicit approach probably wins over
its fully-explicit counterpart. However, the latter may nevertheless win in the overall
computation time, because computers achieve higher hardware efficiency for simpler
numerical computations. Adding to this competition, there are semi-implicit numerical
schemes that lie between the two extremes. This paper has two novel contributions.
First, we devise a new semi-implicit scheme that has second-order accuracy in the
temporal direction. Second, and more importantly, we propose a simple prediction
model for the overall computation time on multicore architectures, applicable to many
numerical implementations. Based on performance prediction, appropriate numerical
schemes can be chosen by considering accuracy, stability, and computing speed at
the same time. Our methodology is tested by numerical experiments modeling the
sediment transport in Monterey Bay.
Balancing efficiency and accuracy 2
1. Motivation
Often, numerical accuracy and computing time are two conflicting goals for sediment
transport simulations, like in many other cases of solving nonlinear partial differential
equations (PDEs). Ideally, one would like to spend the least amount of time waiting
for a computer to produce solutions with a desirable level of accuracy. This is however
easily said than done. On one side, due to the absence of analytical solutions, accuracy
of the numerical solutions can in the best case be estimated indirectly. On the other side,
there may exist a variety of numerical strategies, with different properties of efficiency,
stability, and accuracy. We also need to remember that the overall computational
efficiency often arises from a combination of factors: convergence rate, numerical
stability, software implementation, and computer hardware.
While restricting the focus to a specific mathematical model of sediment transport,
this paper presents a general methodology for choosing the best-performing temporal
discretization strategy out of a collection of alternatives. Moreover, as a novel
contribution in the numerical aspect, we propose in this paper a new temporal
discretization strategy, which is based on the Crank-Nicolson method and splits the
two coupled nonlinear PDEs. This new scheme achieves second-order accuracy in
time, without having to solve systems of nonlinear algebraic equations. Our second
novel contribution is a simple mathematical model for predicting the computing time
of any numerical implementation. In connection with sedimentary basin simulations,
different numerical schemes may have very different strengths and weaknesses. Our
performance-prediction model can help to pick a best-performing code out of a collection
of alternatives, taking into account accuracy, stability, algorithmic complexity and
hardware utilization.
2. Introduction and the mathematical model
Dynamic simulations of sedimentary basin filling nowadays may involve a multitude
of equations to represent a host of physical phenomena, from sediment erosion and
water-borne transport to underwater landslides and sediment compaction. There are
many ways to categorize these simulations, see [12], but for siliclastic depositional shelf
environments, the dominating assumption is that all or part of the averaged, long-term
movement of sediment is based on a diffusion equation. In diffusion-based models, the
sediment flux occurs in proportion to the slope between neighboring cells, representing
the averaged effects of water-borne sediment, sediments slumps and erosion. Transport
coefficients represent different levels of effectiveness for the position relative to sea level
and for water-driven and gravity-driven transport. Examples of such diffusion-based
models are DEMOSTRAT [13], DIONISOS [4, 5] and SEDFLUX [15, 7]. Some models,
such as the Stanford/CSIRO model SEDSIM [16, 10], calculate the water-driven part as
advective transport based on a simplified Navier-Stokes calculation. However, SEDSIM
still applies a diffusion equation to represent gravity-driven sediment transport. Thus,
Balancing efficiency and accuracy 3
the diffusion equation is an important component in either of these two model categories.
Although derived from sediment flux along rivers, see [12], it was Jordan and
Flemings [8] who first applied the diffusion model to clastic shelf deposition environments
with eustatic sea level variation. The equation has the basic form:
∂h
∂t= ∇ · (K∇h),
where h(x, y, t) is the height above some arbitrary horizontal surface in the x-y plane, t
is time and K(x, y, t) is the transport or diffusion coefficient giving the effectiveness of
the diffusion.
Using a parameter for the fraction of a given sediment, Rivenæs [13] added a second
equation to calculate the ratio of two sediments in a layer of deposited material. The
modified equations include s(x, y, t) and 1−s(x, y, t) as the fraction of the two sediments.
In particular, we consider in this paper a sedimentation scenario of sand and mud, and
it should be said that the subsequent numerical strategies carry straightforwardly over
to multi-lithology cases. The following two nonlinear PDEs, derived by Rivenæs [14],
constitute our mathematical model:
∂h
∂t=
1
Cs∇ · (αs∇h) +
1
Cm∇ · (β(1− s)∇h), (1)
A∂s
∂t+ s
∂h
∂t=
1
Cs∇ · (αs∇h). (2)
In the above model, α(x, y) and β(x, y) denote the diffusion coefficients for sand
and mud. In addition, Cs and Cm are the compaction ratio of the two sediment types.
Moreover, A is a constant representing the thickness of a top layer, in which sediments
are transported.
The initial conditions are of the form h(x, y, 0) = h0(x, y) and s(x, y, 0) = s0(x, y).
As boundary conditions, most of the boundary has the no-flow condition, i.e., the
homogeneous Neumann boundary condition ∂h/∂n = ∂s/∂n = 0. On the remaining
part of the boundary, the fluxes of sand and mud inflow are prescribed, i.e.,
−αs∂h∂n
= fs, −β(1− s)∂h∂n
= fm. (3)
These boundary conditions model an inflow of sediments due to, e.g., a river crosses the
boundary of the solution domain.
Although Granjeon [4] generalized these equations to handle multiple lithologies by
adding an additional equation, similar to (2), the major additional complexity in solving
these equations is in the nature of coupling (1) with (2), because of their different forms.
While the diffusion equations are prevalent, as presented above, the numerical
methods and efficiency of solving the coupled diffusion-sediment equations (1)-(2) have
not be presented in any detail in the literature. In this paper, we hope to close this
gap and, in addition, present a new semi-implicit scheme and compare it with the more
commonly used approaches.
Balancing efficiency and accuracy 4
3. Numerical strategies
This section is devoted to a description of several numerical methods for solving (1)-
(2). We will first look at different temporal discretization schemes, which constitute
the numerical core of the present paper. Thereafter, details associated with the spatial
discretization are presented.
3.1. Temporal discretization
The time domain 0 < t ≤ T is divided into a number of equal-distanced discrete time
levels, with ∆t as the time step size. Let superscript ` be the time level index, such
that h` denotes h(x, y, `∆t) and s` denotes s(x, y, `∆t). Then, the temporal derivatives
are simply approximated as
∂h
∂t≈ h`+1 − h`
∆t,
∂s
∂t≈ s`+1 − s`
∆t.
The remaining task of temporal discretization is to choose time level ` or ` + 1,
or a combination of both, at which the right-hand-side terms of (1)-(2) are to be
evaluated. Different strategies will give rise to fully-explicit, semi-implicit and fully-
implicit schemes.
3.1.1. Fully-explicit scheme To avoid solving systems of nonlinear algebraic equations,
the right-hand-side terms of (1)-(2) can use the already computed h and s values. More
specifically, (1)-(2) are transformed as follows, by a fully-explicit temporal discretization:
h`+1 − h`
∆t=
1
Cs∇ · (αs`∇h`) +
1
Cm∇ · (β(1− s`)∇h`),
As`+1 − s`
∆t+ s`+1h
`+1 − h`
∆t=
1
Cs∇ · (αs`∇h`+1).
It should be noted that h is to be updated before s during each time step. This is
why the newly computed h`+1 (from the first equation) is immediately used to compute
s`+1 (in the second equation). Another remark is that s`+1, instead of s`, is used in
the s∂h∂t
term on the left-hand side of (2). Numerical experiments show that this simple
trick improves the numerical stability of this fully-explicit scheme, in which both h`+1
and s`+1 are computed straightforwardly. The scheme has first-order accuracy in time.
3.1.2. Semi-implicit scheme 1 (backward Euler version) If h`+1 and s`+1 are used,
respectively, on the right-hand side of (1) and (2), a semi-implicit scheme based on the
backward Euler method can be derived as follows:
h`+1 − h`
∆t=
1
Cs∇ · (αs`∇h`+1) +
1
Cm∇ · (β(1− s`)∇h`+1),
As`+1 − s`
∆t+ s`+1h
`+1 − h`
∆t=
1
Cs∇ · (αs`+1∇h`+1).
Balancing efficiency and accuracy 5
Like the preceding fully-explicit scheme, h`+1 and s`+1 are computed separately
within each time step. However, each substep now requires solving a linear system,
thus giving semi-implicit as name of this scheme. Also like the fully-explicit scheme,
this scheme has first-order accuracy in time. Numerical stability is, however, the strong
feature of this semi-implicit scheme. We remark that these two schemes were proposed
and studied in our earlier work [2].
3.1.3. Semi-implicit scheme 2 (Crank-Nicolson version) To improve the accuracy of
the above semi-implicit scheme, we adopt the Crank-Nicolson strategy and propose in
this paper a new scheme:
h`+1,k − h`
∆t=
1
2
(1
Cs∇ · (αs`+1,k−1∇h`+1,k)
+1
Cm∇ · (β(1− s`+1,k−1)∇h`+1,k)
)+
1
2
(1
Cs∇ · (αs`∇h`) +
1
Cm∇ · (β(1− s`)∇h`)
)
As`+1,k − s`
∆t+s`+1,k + s`
2
h`+1,k − h`
∆t=
1
2
(1
Cs∇ · (αs`+1,k∇h`+1,k) +
1
Cs∇ · (αs`∇h`)
)As can be seen in the above two equations, the h solution is still to be computed
separately from s, and the Crank-Nicolson method is adopted in the temporal direction.
Careful readers will notice that inner iterations, with index k = 1, 2, . . ., are introduced
within each time step. Before starting the inner iterations, we assign s`+1,0 = s`, which
is needed in the first semi-discretized equation for computing h`+1,1. During each inner
iteration k, a linear system has to be solved for computing h`+1,k, and another linear
system is needed for s`+1,k. Numerical experiments suggest that two inner iterations
(k = 2) are sufficient for obtaining second-order accuracy in time. As comparison, one
inner iteration (k = 1) gives only first-order accuracy.
3.1.4. Fully-implicit scheme 1 (backward Euler version) Unlike the above three
schemes, fully-implicit schemes compute h`+1 and s`+1 simultaneously. The standard
fully-implicit scheme, which was already proposed in [14], uses the backward Euler
temporal discretization. That is, h`+1 and s`+1 are used in all the right-hand-side terms
of (1)-(2). Thereafter, a spatial discretization will give rise to the following system of
nonlinear algebraic equations per time step: Fh
(h`+1, s`+1,h`, s`
)= 0,
Fs
(h`+1, s`+1,h`, s`
)= 0.
(4)
Here, Fh denotes the set of nonlinear algebraic equations arising from (1), whereas Fh
arises from (2). Vectors h`+1, s`+1, h`, and s` contain, respectively, values of h`+1, s`+1,
h`, and s` on all the spatial grid points.
Balancing efficiency and accuracy 6
Newton-Raphson iterations can be used to solve the entire system of nonlinear
algebraic equations (4), for which a new system of linear equations needs to be set up
and solved in every Newton-Raphson iteration.
3.1.5. Fully-implicit scheme 2 (Crank-Nicolson version) The preceding fully-implicit
scheme uses the backward Euler method in the temporal discretization, so its accuracy
in time is of first order. To achieve second-order accuracy in time, the Crank-Nicolson
method can be adopted in the temporal discretization. That is, both h`+1, s`+1 and h`,
s` are used and equally weighted in all the right-hand-side terms of (1)-(2). The result
is also a system of nonlinear algebraic equations, of the same form of (4), for computing
h`+1 and s`+1 simultaneously.
3.2. Spatial discretization
We choose finite differences to carry out the spatial discretizations. This is mostly
motivated by the numerical and programming simplicity. It can be mentioned that
other spatial discretization techniques, such as finite elements, can also use the same
temporal discretizations discussed above.
3.2.1. Treatment of diffusion It is standard to use centered difference for the two
diffusion terms on the right-hand side of (1), for obtaining second-order accuracy in
space. For example, centered difference applied to the∇·(αs∇h) term gives the following
discretized form:
αi+ 12,jsi+ 1
2,j (hi+1,j − hi,j)− αi− 1
2,jsi− 1
2,j (hi,j − hi−1,j)
∆x2
+αi,j+ 1
2si,j+ 1
2(hi,j+1 − hi,j)− αi,j− 1
2si,j− 1
2(hi,j − hi,j−1)
∆y2,
where the subscripts i, j are the mesh point index for a 2D uniform grid with mesh
spacing ∆x and ∆y. In the above formula, the half-indexed terms are to be evaluated
as, e.g., αi+ 12,jsi+ 1
2,j = (αi,jsi,j + αi+1,jsi+1,j)/2.
3.2.2. Treatment of convection Equation (2) is a convection equation with respect
to s, because of the ∇ · (αs∇h) term. For the sake of numerical stability, one-sided
upwind finite difference is preferred over centered difference, despite its first-order spatial
accuracy.
To this end, it is customary to move the convection term ∇ · (αs∇h) to the left-
hand side of (2) when checking the flow direction. That is, −∇h gives the convection
velocity. The x-component, −∂h/∂x, is approximated by (hi−1,j − hi+1,j)/2∆x, the
sign of which determines how the x-component of the convection term is discretized by
one-sided upwind difference. More specifically, the
∂
∂x
(αs∂h
∂x
)
Balancing efficiency and accuracy 7
term is approximated by(αi,jsi,j − αi−1,jsi−1,j
∆x
)×(hi+1,j − hi−1,j
2∆x
)if we have hi−1,j > hi+1,j. Otherwise, the following approximation is used:(
αi+1,jsi+1,j − αi,jsi,j∆x
)×(hi+1,j − hi−1,j
2∆x
).
The discretization in the y-direction is done similarly.
3.2.3. Treatment of the boundary conditions Second-order accurate treatment of the
homogeneous Neumann condition ∂h/∂n = ∂s/∂n = 0 follows the standard approach
by using one layer of ghost boundary points. Attention is however needed for the
inhomogeneous influx conditions (3). In fact, we rewrite the two conditions in the
following equivalent form:
∂h
∂n= −fs
α− fm
β, s =
βfsβfs + αfm
.
That is, h has an inhomogeneous Neumann boundary condition, and s takes a Dirichlet
boundary condition.
4. Case Study: Monterey Bay
Monterey Bay is our region of study because of the publicly available, high-resolution
bathymetric data and its interesting features: the largest undersea canyons of the North
American West Coast. Our region covers the area from south of San Fransisco Bay in
the north, to Davidson Seamount in the south and west. To the east, the area takes in
the mouth of the Salinas River (Figure 1). To approximate the fluvial sediment load
of the Salinas River, we use an inflow boundary condition at the river mouth with an
average sediment flux of 1.8 ton/yr‡. This influx value is based on the average for the
years 1932 to 1999, and neglects the four most significant events during that period [3].
For our purposes, we use a rate of 20% sand, which is coarse-grain influx, a little higher
than the 10% value used in [3]. For marine sediment transport rates, we use an average
of approximately 3000 cm2/yr for sand-sized particles and 6000 cm2/yr for silt-sized
partiles. Rather than setting only a depth dependence in the formula as in [13], we
instead increase or reduce the transport coefficient depending on both the depth and
the curvature. The dependence on curvature is designed to capture higher rates of
sediment transport typical for submarine canyons and channels in the region [9]. To
pick out the high wavelength features of the canyons, we filtered the topography in the
Fourier domain, using a high-pass filter. The resulting α values are shown in the colour
overlay of Figure 1. Afterwards, the result was smoothed to improve numerical stability
and we arrived at transport coefficients along the Monterey Canyon of roughly 30,000
and 60,000 cm2/yr for α and β, respectively.
‡ Assuming average sediment density of 2.3 g/cm3.
Balancing efficiency and accuracy 8
Figure 1. Monterey Bay and the surrounding region; the rectangular study region
is bounded by the coordinates from 121.80◦W to 123.16◦W and from 35.50◦N to
37.16◦N. Inside the domain, the subsea bathymetric data are obtained from the
NOAA Autochart web-facility based on 2128 survey lines of ship-gathered bathymetric
data [11]. The colour overlay shows the transport coefficient α for sand and in
particular highlights the canyons. Two of them are indicated; the Monterey Canyon,
which extends from the red star out to the oceanward boundary of the domain at the
bottom of the image; and the Sur Canyon. Subaerial topography (shown in green)
is derived from ETOPO1 [1]. Six east-west and three north-south profiles of the
computational domain are shown, with colours indicating the percentage of silts and
clays. For example, light brown indicates a high sand percentage but low silt or clay.
The mouth of the Salinas River is at the red-starred location. The small reference plot
shows the West Coast of California.
In the analysis of the model, we use results from a model run of 250,000 years, on
a uniform 850× 700 mesh. Figure 1 shows profiles of the model along lines of latitude
and longitude. Our attention is focused on profiles A to F, since these are parallel to
the inflow point (the red star) and the Monterey Canyon. Profiles A and B sit almost
parallel to the shelf with a number of canyons transporting sediment directly into low-
lying areas. The sediment is sandy along much of the profile, particularly in profile A,
with silts filling areas towards the edge of the domain. Profiles C and D show a strong
layering, from well-mixed lower layers to a sandy deposit in the upper layers. Profiles
E and F show significant well-mixed layers only in the rightmost lower layers where the
Monterey Canyon intersects the profiles, and high silt percentages in the middle layers
stretching oceanward. Using the total change in height through the model run, we have
also calculated the average erosion and deposition rates in Figure 2. We find deposition
rates in the Monterey Canyon in the range of 0.25 cm/yr to 0.5 cm/yr.
The well-mixed nature of the lower layers of profiles C and D in Figure 1 shows the
effectiveness of the canyons in transporting both sand and smaller-sized particules, at
Balancing efficiency and accuracy 9
Figure 2. Average rate of sediment deposition (positive values in blue colours) or
erosion (negative values in red colours). The contours show the bathymetry in meters.
Balancing efficiency and accuracy 10
least until the topography becomes relatively flat and the silts are transported further
than the sands, creating a distinct sand to silt transition close to the seaward edge
of the model. The deposition rates of Figure 2 for the Monterey Canyon region are
in good agreement with Farnsworth [3], who estimated sediment accumulation rates
in the canyon at 0.35 cm/yr at 3000m depth. Although we have neglected long-shore
currents, the effect is likely to be limited to the the immediate coast as the topographic
highs surrounding the Monterey Canyon do not allow deep longshore currents. Finally,
although the model allows for higher transport rates in the canyons, these canyons
still fill with sediment and the gradients in h are therefore reduced, changing the
Monterey Bay region from canyon-dominated sediment transport to a prograding shelf-
type deposition, particularly noticible in the upper sand layers profiles B to F in Figure 1.
However, during the length of the model run, 250,000 years, the sea-level variation would
be a significant effect in renewing erosion of the present-day submarine bathymetry as
rivers cut into the landscape again.
5. Comparing the different methods
All the five temporal discretization schemes from Section 3.1 are capable of running
the 250,000-year simulation of the Monterey Bay case, using the rather coarse 850 ×700 mesh. Human eyes cannot detect any differences among the five simulations.
Nevertheless, the five schemes have different strengths and weaknesses, which will be
of great importance for high-resolution simulations. It is the purpose of this section
to take a closer look at their temporal accuracy, numerical stability and computational
speed.
5.1. Temporal accuracy
A very important property of any numerical method is its accuracy, in particular, how
fast the numerical error decreases with refining mesh spacing: ∆x, ∆y, ∆t. Our focus
now is on the relationship between ∆t and accuracy, because this is where the five
schemes differ. A good understanding of this issue can allow more accurate schemes to
use larger ∆t and thus save time.
For studying the temporal accuracy of a numerical method, we assume that its
numerical error has three leading terms:
|utrue − u∆| ≈ Ct∆tγ + Cx∆x
ν + Cy∆yν , (5)
where u∆ denotes the numerical solution of h or s, while Ct, Cx and Cy are constants
independent of ∆t, ∆x, and ∆y. The constant γ gives the order of temporal accuracy,
and similarly ν is for the two spatial directions. Here, we have assumed that the
numerical method achieves the same order of accuracy in x- and y-directions.
Fixing a spatial discretization scheme and the values of ∆x and ∆y, we can
generate a series of numerical solutions with decreasing time step sizes: ∆t, ∆t/2,
∆t/4, and so on. If the difference between two and two consecutive numerical solutions
Balancing efficiency and accuracy 11
Table 1. Temporal error analysis for the five schemes at T = 100 years. The spatial
mesh is fixed as 850× 700, and the value of ∆t is 0.02 yr.
Fully-explicit Semi-implicit 1 Semi-implicit 2 Fully-implicit 1 Fully-explicit 2
‖h∆t − h∆t2‖L2 1.158e+02 1.260e+02 5.024e-02 1.277e+02 1.329e-02
‖h∆t2− h∆t
4‖L2 5.790e+01 6.300e+01 1.251e-02 6.387e+01 3.321e-03
‖h∆t4− h∆t
8‖L2 2.895e+01 3.150e+01 3.024e-03 3.193e+01 8.283e-04
‖h∆t8− h∆t
16‖L2 1.448e+01 1.575e+01 7.961e-04 1.597e+01 2.076e-04
‖h∆t16− h∆t
32‖L2 7.238e+00 7.876e+00 1.937e-04 7.984e+00 5.190e-05
‖s∆t − s∆t2‖L2 4.246e+00 9.922e-02 4.190e-03 9.678e-02 1.262e-03
‖s∆t2− s∆t
4‖L2 2.119e+00 5.018e-02 1.036e-03 4.847e-02 3.152e-04
‖s∆t4− s∆t
8‖L2 1.059e+00 2.527e-02 2.549e-04 2.425e-02 7.816e-05
‖s∆t8− s∆t
16‖L2 5.292e-01 1.268e-02 6.501e-05 1.213e-02 1.974e-05
‖s∆t16− s∆t
32‖L2 2.645e-01 6.351e-03 1.634e-05 6.067e-03 4.927e-06
decreases with ∆t, it indicates that the numerical method is convergent in the temporal
direction. (Consistency is needed in both temporal and spatial discretizations to ensure
convergence toward the true solution.) In particular, if the consecutive differences are
observed to reduce with a factor of 2γ, we can then establish that γ is the order of
temporal accuracy for the numerical method under study.
Using the Monterey Bay case, we experimented with a series of six decreasing time
step sizes: ∆t = 0.02/2k yr, 0 ≤ k ≤ 5, for a 100-year simulation. Table 1 thus shows, for
the five numerical schemes, the differences (in discrete L2-norm) between two and two
consecutive h or s solutions. First-order temporal accuracy can clearly be observed for
the fully-explicit scheme and the backward-Euler version of both semi- and fully-implicit
scheme. The Crank-Nicolson version of semi- and fully-implicit scheme is second-order
accurate in time. In addition, Table 1 also reveals the actual magnitude of the numerical
errors. Semi-implicit scheme 1 and fully-implicit scheme 1 have roughly the same level
of accuracy. The fully-explicit scheme has relatively poor accuracy for the s solutions.
The two second-order schemes are considerably more accurate than all three first-order
schemes. Between the two second-order schemes, the fully-implicit version is about four
times more accurate.
5.2. Stability
Numerical stability is another important property of any scheme. Here, we consider
a numerical solution as unstable if s`i,j exceeds, at any mesh point, the correct value
range of [0, 1]. Typically, stability depends on the mesh spacing ∆x,∆y,∆t. In the
absence of a theoretical analysis of the stability condition, we can also resort to numerical
experiments.
More specifically, we used a 100-year simulation of the Monterey Bay case on four
spatial mesh resolutions. For each spatial resolution, we adopted a binary search to
Balancing efficiency and accuracy 12
Table 2. The largest admissible ∆t values for a 100-year simulation of the Monterey
Bay.
Mesh size 850× 700 1700× 1400 3400× 2800 6800× 5600
Fully-explicit 0.22 0.17 0.04 0.01
Semi-implicit 1 77.19 49.29 24.76 14.15
Semi-implicit 2 2.02 1.43 0.70 0.37
Fully-implicit 1 7.24 3.97 2.76 1.07
Fully-implicit 2 2.00 1.43 0.60 0.37
find the largest admissible ∆t, which maintained 0 ≤ s`i,j ≤ 1 throughout the entire
simulation. As shown in Table 2, the fully-explicit scheme is the least stable, while
semi-implicit scheme 1 is the most stable. We can also see that smaller values of ∆x
and ∆y require smaller ∆t. Although the actual values of admissible ∆t in Table 2
cannot be blindly used in other simulations, the table demonstrates the comparative
stability among the five schemes.
5.3. Computational speed
An understanding of the temporal accuracy and numerical stability is an important
first step toward choosing a best-performing numerical scheme. Once the spatial mesh
resolution is prescribed via ∆x and ∆y, each scheme can in principle estimate its
matching time step size ∆t, so that temporal errors and spatial errors are balanced.
(This issue will be discussed in Section 6.) If the estimated ∆t violates the stability
requirement, the value of ∆t has to be decreased accordingly. Now the following question
arises: Which scheme can finish the simulation most quickly, using its largest admissible
∆t value that satisfies both accuracy and stability?
We therefore turn our focus to predicting the computational time for each numerical
scheme, provided that the numbers of spatial mesh points and time steps are known.
This prediction relies on two types of information: (1) the number of floating-point
operations and volume of data traffic, (2) the main hardware features of an intended
computer.
5.3.1. Work load and data traffic What we are interested in is more than just the
conventional O(N) algorithmic complexity model, which is too crude for predicting
the actual time usage on real-world hardware. Instead, we count the actual numbers
of numerical operations and data read/write operations invoked. The latter factor is
particularly important for understanding the performance of a software implementation.
Let us recall that the spatial domain is gridded into a 2D uniform mesh. Except
for the boundary points, which only count for a very small percentage, the volume of
computation work and data traffic is the same for every mesh point. Therefore, we will
in the following only investigate the point-wise work load and data traffic.
Balancing efficiency and accuracy 13
Noting that all the five numerical schemes carry out several actions per time step,
we count for each action the work load and data traffic per mesh point. The reason for
action-wise counts is because the speed of some actions may be limited by the floating-
point operations, while other actions may be constrained by the data traffic, either
between the registers and L1 cache or between the main memory and the entire cache
hierarchy. Let us consider as example semi-implicit scheme 1. It has four actions per
time step, namely, a linear system is first set up and then solved for computing h`+1,
followed by two similar actions for computing s`+1. While setting up the h linear system
is likely determined by a computer’s floating-point capability, as suggested by Table 3,
the speed of the three other actions likely depends on the data movement capability
within the entire memory-cache hierarchy.
Table 3. Counts of floating-point operations (FLOP), data loads (LD) and stores (ST)
between L1 cache and registers, sum of loads and stores (MEM) that touch the main
memory. All the counts are per time step and per mesh point. The counts associated
with linear system solves are for one CG or GMRES iteration.
Scheme Action FLOP LD ST MEM
Fully Compute hi,j 57 43 1 5
-explicit Compute si,j 37 24 1 5
Semi- Set up h system 62 21 8 10
implicit 1 Solve h system 15 52 9 21
Set up s system 35 35 10 10
Solve s system 15 68 14 21
Semi- Set up h system 262 158 24 20
implicit 2 Solve h system 15 52 9 21
Set up s system 117 125 29 20
Solve s system 15 68 14 21
Fully- Set up h-s system 150 150 92 27
implicit 1 Solve h-s system 46 264 52 62
Fully- Set up h-s system 225 223 138 28
implicit 2 Solve h-s system 46 264 52 62
Counting the number of floating-point operations and volume of data traffic can be
done in two ways. The first approach is to manually accumulate this info by reading
through the computer program line by line. Such a manual count is doable but can
often be cumbersome and even inaccurate. The inaccuracy arises if a compiler, in
order to optimize performance, re-orders computations and introduces new intermediate
variables. For this particular reason, we adopt in this paper another counting approach.
More specifically, the PAPI tool§ is used to profile a compiled code. The actual numbers
of floating-point operations and data transfers between a CPU’s registers and its L1
cache are namely recordable by the CPU’s hardware performance counters.
§ PAPI: http://icl.cs.utk.edu/papi/.
Balancing efficiency and accuracy 14
Table 3 reports the numbers of floating-point and data load/store operations,
needed by the different actions of the five numerical schemes. PAPI-v4.1.4 was used on
an Intel Xeon E5504 processor. All the computer programs were compiled by the GNU
C++ compiler (version 4.4.3) with -O3 optimization. The “FLOP” column of Table 3
reports PAPI’s PAPI FP INS event, the “LD” column is associated with PAPI L1 DCR,
while PAPI L1 DCW gives the “ST” column. In addition, column “MEM” contains
manual counts or estimates of the minimum required volume of bi-directional data traffic
(loads+stores) that touches the main memory. This is because PAPI unfortunately does
not collect this information. We have always assumed that the CPU’s cache hierarchy
is not large enough to hold the entire data structure.
5.3.2. Relating computational speed to hardware To predict the computing time of a
numerical method, it is not enough to only know the numbers of involved floating-
point and data load/store operations, as reported in Table 3. First of all, a computer
program often involves other operations than those reported in the table. An example is
the preparation work of an iterative linear system solver before starting the iterations.
Secondly, the time usage associated with solving linear systems must be estimated
together with the number of iterations needed. Thirdly, for the two fully-implicit
schemes, Newton-Raphson iterations work as an outer loop, where a new linear system is
set up and solved during each Newton-Raphson iteration. An estimate of the number of
Newton-Raphson iterations is thus needed. Fourthly, and most importantly, prediction
of time usage has to consider the hardware capabilities of an intended computer, also
depending on whether the computer is run in serial or parallel mode.
Here, we want to make an attempt at predicting a lower bound of time usage by
the numerical schemes, based on information from Table 3 and the hardware (peak)
capabilities of a multicore-based parallel computer. Our assumptions are as follows:
(i) Due to hardware technologies such as pipelining of operations and prefetching data
into caches, modern CPUs are able to avoid, to a great extent, stall of the data
and/or instruction streams.
(ii) We only focus on three sources of performance limitation: (1) CPU’s clock rate,
(2) data transfer bandwidth between registers and the L1 cache, (3) data transfer
bandwidth between the last-level cache and main memory.
(iii) A lower bound of time usage is thus the maximum value among (1) time needed
by the CPU core(s) to execute the floating-point operations, (2) time needed by
the L1 cache(s) to load data into the registers, (3) time needed by the registers to
store data back to L1, and (4) time needed by the main memory to execute its data
loads and stores.
It should be remarked that the above assumptions are motivated by simplicity. Ideally,
the cache miss rates at different levels and the volumes of data traffic within the cache
hierarchy should be considered. However, accurate counts of the cache misses and
volumes of inter-cache data traffic are in general extremely difficult to quantify. These
Balancing efficiency and accuracy 15
are therefore not included in our modeling philosophy, which is easily put into practice
while still predicting a useful lower bound of the computing time.
As hardware capabilities of a multicore CPU, the following parameters are assumed
known:
• The peak capability of a single CPU core to execute floating-point operations is
denoted by F—max number of floating-point operations per second.
• The bandwidths between a CPU’s private L1 cache and registers are denoted by
BrL1—number of bytes readable from L1 per second, and Bw
L1—number of bytes
writable to L1 per second. The reason for the two bandwidths is because a dedicated
channel is assumed for the data loads, while another is dedicated for the data stores.
• The bandwidth of the main memory is represented by BM—number of bytes
transferred per second. Here we assume that load and store operations share the
same channel(s), which is also shared among multiple cores.
5.3.3. Simple models for predicting computational speed We denote by nFLOP the
number of floating-point operations, nload the number of bytes loaded from L1, and
nstore the number of bytes stored to L1. Similarly, nMload denotes the number of bytes
loaded from the main memory, while nMstore is for the stores.
Serial computing time When only a single CPU core is used, the lower bound of serial
computing time is described by the following simple formula:
max
(nFLOP
F,nload
BrL1
,nstore
BwL1
,nMload + nMstore
BM
). (6)
Since nstore can safely be assumed to be smaller than nload, while we typically have
BrL1 = Bw
L1, the above formula can be further simplified as
max
(nFLOP
F,nload
BrL1
,nMload + nMstore
BM
). (7)
Parallel computing time In a typical multicore architecture, the L1 cache is private to
each CPU core, so the aggregate effect of employing multiple BrL1 and Bw
L1 scales linearly
with the number of CPU cores in use. Using multiple CPU cores also means a linear
expansion of the floating-point capability F . On the other hand, the aggregate value of
the main memory bandwidth BM depends on the actual memory hierarchy, often not
scaling linearly with the number of CPU cores. Suppose p denotes the number of CPU
cores used, we use BpM to denote the aggregate main memory bandwidth. Now, the
lower bound of parallel computing time can be found as
max
(nFLOP
pF,nload
pBrL1
,nMload + nMstore
BpM
). (8)
Some comments are in order here. First, both the prediction models (7) and (8)
are based on a set of simplifications, making them easily applicable, but also with
Balancing efficiency and accuracy 16
Table 4. Total numbers of time steps, Newton-Raphson iterations, CG iterations and
GMRES iterations used by the five schemes for the example of using a Nehalem-EP.
Scheme #time #Newton #CG #GMRES
Fully-explicit 100 N/A N/A N/A
Semi-implicit 1 100 N/A 335 234
Semi-implicit 2 100 N/A 612 485
Fully-implicit 1 100 200 N/A 1676
Fully-implicit 2 100 200 N/A 1375
the possibility of gravely under-estimating the actual computing time. Second, neither
model considers the impact of inter-cache data traffic, i.e., L1↔L2 and L2↔L3. One
reason is that these data traffics are often not the bottleneck. Another reason is that
estimating the actual volumes of inter-cache data traffic will make the prediction models
unbearably complex. Third, the overhead of synchronization and data communication
between cores/sockets/nodes is ignored for simplicity.
5.3.4. The example of using a Nehalem-EP To check the quality of our prediction
models (7) and (8), we ran all the five numerical scheme for 100 time steps on a
1700×1400 mesh. The hardware used is a Nehalem-EP that consists of two sockets, each
being a quad-core Xeon 2.0 GHz E5504 processor. The values of F = 4 Gflops/s (no code
vectorization) and BrL1 = 16 GB/s are deduced from Intel’s hardware specification, while
the values of BpM have adopted the STREAM‖ benchmark’s “copy” rates, measured on
this particular computer. More specifically, we have B1M = 6.71 GB/s, B2
M = 13.18
GB/s, B4M = 16.90 GB/s, and B8
M = 17.11 GB/s.
Using the information given in Tables 3 and 4, we can calculate the values of nFLOP,
nload, and nMload + nMstore, which are needed in the prediction models (7) and (8). Table 5
compares the predicted time usages TP against the actual time usages TA. We remark
that the high-quality Trilinos software package [6] was used to implement all the linear
system solvers.
It can be seen from Table 5 that our simple prediction models (7) and (8)
consistently under-estimate the time usage. This is an expected behavior because the
models are meant to give a lower bound. In general, the prediction accuracy is slightly
better for the fully-explicit scheme, while roughly the same for the four non-explicit
schemes. This means that the predicted TP value is helpful in practice, because the
comparative speed difference between the five schemes is correctly anticipated.
‖ STREAM: Sustainable Memory Bandwidth in High Performance Computers,
http://www.cs.virginia.edu/stream/.
Balancing efficiency and accuracy 17
Table 5. Comparing the actual time TA with predicted time TP on a Nehalem-EP.
Scheme Time 1 core 2 cores 4 cores 8 cores
Fully- TA 13.04 6.73 3.56 1.43
explicit TP 7.97 3.99 1.99 1.20
Semi- TA 82.30 44.44 29.05 25.17
implicit 1 TP 47.52 23.76 15.72 15.52
Semi- TA 178.68 96.92 61.38 50.12
implicit 2 TP 110.79 55.40 34.37 30.21
Fully- TA 625.75 322.20 213.25 176.19
implicit 1 TP 562.23 281.12 140.56 121.64
Fully- TA 537.02 286.37 183.83 145.15
implicit 2 TP 485.04 242.52 121.26 101.50
6. Putting everything together
So far, the reader should have realized that there are many factors that affect the
computing time of a particular numerical scheme:
(i) The spatial problem size in form of the number of mesh points.
(ii) The number of floating-point operations needed per mesh point and per time step
(and per linear solver iteration).
(iii) The volumes of data traffic, which touch the L1 cache and main memory, per mesh
point and per time step (and per linear solver iteration).
(iv) The hardware capabilities of a multicore-based parallel computer, in form of F ,
BrL1, Bw
L1, and BpM .
(v) The total number of time steps needed.
Factor 1 is often prescribed a priori, the second and third factors are static
properties of a numerical scheme, while the fourth factor regarding the hardware is easily
obtainable. The last factor thus deserves our attention, because different numerical
schemes may require very different values of ∆t to achieve a same level of accuracy.
Moreover, numerical stability will impose an additional requirement on ∆t. Predicting
the actual time usage therefore relies on a good estimate of the largest admissible ∆t.
This requires a quantification of the numerical errors as described below.
6.1. Quantifying the error model
Let us recall the model of numerical errors (5) from Section 5.1. There, we have
assumed that the numerical errors have two independent contributions: Ct∆tγ and
Cx∆xν + Cy∆y
ν . In order to find the constant values Ct, γ, Cx, Cy, and ν, numerical
experiments are needed. Table 1 from Section 5.1 gives an example of how to determine
the values of Ct and γ, which depend on the temporal discretization chosen, and which
Balancing efficiency and accuracy 18
also differ for the h and s equations. To determine the values of Cx, Cy, and ν, another
set of numerical experiments is needed. This time, the value of ∆t is fixed, while a series
of different ∆x and ∆y values are tried. We remark that all such numerical experiments
can use a short simulation time length T and relatively coarse mesh spacings, to be
able to quickly establish (5). It should be remarked that the values of Ct, Cx and Cyare typically functions of T . Nevertheless, our hope is that the ratio between the three
contants remains the same, so that we can compare the magnitudes of error between
time and space.
6.2. Finding the largest admissible ∆t
For real-world sediment transport simulations, it is not unusual that the spatial mesh
spacing (∆x, ∆y) is prescribed as the starting point. This can come from earlier
experiences and/or considerations for the capacities of a target computer.
For each temporal discretization scheme, once ∆x and ∆y are given, we can use
the established error model (5) to estimate the largest value of ∆t, such that
Ct∆tγ ≤ Cx∆x
ν + Cy∆yν
holds for both h and s. Then, the already established information about numerical
stability, in form of Table 2, is extrapolated to check whether the estimated ∆t above
satisfies the stability requirement. If not, ∆t is decreased to ensure stability.
6.3. Predicting time usage
So far, we have found for each numerical scheme its largest admissible ∆t, such that
the numerical error contributed by the temporal discretization is guaranteed to not
exceed that of the spatial discretization. The stability condition is satisfied as well.
What remains is to predict the time usage for each numerical scheme. To this end, we
also need to estimate the iteration numbers of Newton-Raphson and/or linear solver(s)
for the non-explicit schemes. These are typically estimated by extrapolating known
iteration counts.
Finally, after obtaining the hardware capability parameters F , BrL1, Bw
L1, and BpM ,
we are ready to apply the prediction models (7) and (8).
6.4. A large-scale example
To synthesize a realistic scenario, we used the case of Monterey Bay again. This time, we
started by prescribing ∆x = ∆y = 20m, which gave a 9206× 6108 spatial mesh. Then,
the largest admissible ∆t value was determined for the fully-explicit scheme and the
two semi-implicit schemes. The two fully-implicit schemes were not considered, because
we knew from before that fully-implicit scheme 1 has no advantage over semi-implicit
scheme 1, while fully-implicit scheme 2 is much slower than semi-implicit scheme 2.
Balancing efficiency and accuracy 19
Table 6. Comparing the actual time usage TA with predicted time usage TP on
Tianhe-1A, for a 100-year simulation on a 9206× 6108 spatial mesh. The rows of F pA
report the achieved Gflops/s rates by using p CPU cores.
Scheme p 240 480 960 1920
Fully- TA 150.76 78.13 36.57 17.92
explicit TP 72.19 35.23 17.41 8.65
F pA 701.20 1353.04 2890.70 5899.16
Semi- TA 273.23 142.52 66.13 32.29
implicit 1 TP 178.79 89.40 44.70 22.35
F pA 41.80 80.14 172.71 353.72
Semi- TA 648.98 356.87 137.54 78.82
implicit 2 TP 429.58 214.79 107.40 53.70
F pA 50.99 92.72 240.58 419.81
Balancing the temporal and spatial errors, the fully-explicit scheme chose ∆t = 0.81
yr, and semi-implicit scheme 1 chose ∆t = 0.74 yr, whereas the second-order semi-
implicit scheme 2 chose ∆t = 22.9 yr. However, both the fully-explicit scheme and
semi-implicit scheme 2 had to decrease their choices of ∆t for the sake of numerical
stability. Finally, while keeping a small safety margin, we decided to use ∆t = 0.005 yr
for the fully-explicit scheme, ∆t = 0.5 yr for semi-implicit scheme 1, and ∆t = 0.25 yr
for semi-implicit scheme 2.
For a 100-year simulation of Monterey Bay on the 9206× 6108 mesh, we estimated
that the total numbers of floating-point operations needed would be 106×1012, 11×1012,
and 33 × 1012 for the fully-explicit scheme, semi-implicit scheme 1, and semi-implicit
scheme 2, respectively.
As a large-scale hardware testbed, we used Tianhe-1A Hunan Solution¶—the
world’s No. 28 supercomputer, according to the Top500 list published in June 2012. Each
compute node of this supercomputer has two six-core Xeon X5670 CPUs and one Nvidia
Tesla M2050 GPU. Since there are no GPU implementations for the two semi-implicit
schemes, only the CPU part of the supercomputer was used for our time measurements.
The hardware parameters needed for the prediction model (8) are F = 5.86 Gflops/s,
BrL1 = Bw
L1 = 23.44 GB/s, and B12M = 32.86 GB/s (i.e., when all the twelve cores per
compute node are in use). The compiler used was icc of version 11.0 using the -O3
optimization flag.
Table 6 lists the actual time usages TA and the achieved Gflops/s rates F pA, which
were measured on Tianhe-1A Hunan Solution. The predicted time usages TP are
also listed for comparison. Despite the fact that the fully-explicit scheme used the
most floating-point operations, its actual time usage was the lowest among the three
candidates. This was correctly anticipated by the prediction model (8). All three parallel
implementations scaled nicely between 240 and 1920 CPU cores. The highest F pA rate
¶ http://i.top500.org/system/177448
Balancing efficiency and accuracy 20
of 5899.16 Gflops/s was, not surprisingly, achieved by the fully-explicit scheme when
using 1920 cores.
7. Concluding remarks
It is not trivial to achieve the best possible computing speed, while maintaining
a desirable level of accuracy and avoiding numerical instability. It becomes more
complicated when there exists a collection of candidate numerical schemes. This paper
has outlined a methodology that aims at a systematic approach, which involves two main
tasks. First, small-scale and short-time-length experiments can be used to establish the
error model (5) and the numerical stability requirements in form of Table 2. Such
information helps choosing a largest admissible ∆t value when the spatial mesh space
is given. Second, the prediction models (7) and (8) can rank the candidate numerical
schemes with respect to the overall computing time. The two performance prediction
models are easy to use, because the needed hardware parameters are readily obtainable
for any computing system. Moreover, the static properties of a particular numerical
scheme, in form of Table 3, can be established by using, e.g., profiling tools such as PAPI.
More importantly, this methodology should be applicable to many other numerical
simulations.
The measurements presented in this paper may give an impression that the fully-
explicit scheme is deemed the winner of the overall computing time. To draw such a
conclusion is wrong, because a balanced relationship between ∆t and ∆x, ∆y will change
from case to case. It may even happen that the ranking of the schemes changes on a
different hardware platform. Therefore, the prediction models (7) and (8) are helpful
when planning really challenging and huge-scale simulations of marine sedimentary basin
filling.
One particular reason for the inferior computing speed of the two semi-implicit
schemes, in comparison with their fully-explicit counterpart, is the relatively large
numbers of CG or GMRES iterations needed to solve the linear systems per time
step. So far, we have not applied any preconditioner to the linear solvers. It remains
to be seen whether suitable preconditioners can sufficiently decrease the number of
CG/GMRES iterations, so that the overall time usage is reduced despite the extra
computing effort incurred by the preconditioners. On the other hand, the fully-explicit
scheme will relatively speaking better suit GPU platforms, because this scheme is easily
implemented and is the least sensitive to data traffic bandwidths.
Acknowledgments
We thank the National Supercomputing Center in Changsha for the access to the
Tianhe-1A Hunan Solution supercomputer. Dr. Nan Wu at the National University
of Defense Technology is acknowledged for his assistance with using the supercomputer.
Computing facilities from the Norwegian Metacenter for Computational Science
Balancing efficiency and accuracy 21
(NOTUR) were used to carry out some of the numerical experiments of this paper.
References
[1] C. Amante and B. W. Eakins. ETOPO1 1 arc-minute global relief model: Procedures, data
sources and analysis. Technical report, National Oceanic and Atmospheric Administration,
2009. NOAA Technical Memorandum, NESDIS NGDC-24.
[2] S. R. Clark, W. Wei, and X. Cai. Numerical analysis of a dual-sediment transport model applied
to Lake Okeechobee, Florida. In Proceedings of the 9th International Symposium on Parallel
and Distributed Computing, pages 189–194. IEEE Computer Society Press, 2010.
[3] K. L. Farnsworth. Monterey Canyon as a conduit for sediment to the deep ocean. Technical
report, Virginia Institute of Marine Science, 2000.
[4] D. Granjeon. Deterministic stratigraphic modeling; conception and applications of a
multilithological 3D diffusive model. Mem. Geosci. Rennes, 78, 1997.
[5] D. Granjeon and P. Joseph. Concepts and applications of a 3-D multiple lithology, diffusive
model in stratigraphic modeling. Numerical Experiments in Stratigraphy: Recent Advances
in Stratigraphic and Sedimentologic Computer Simulations, SEPM Special Publication No. 62,
pages 197–210, 1999.
[6] Michael Heroux, Roscoe Bartlett, Vicki Howle Robert Hoekstra, Jonathan Hu, Tamara Kolda,
Richard Lehoucq, Kevin Long, Roger Pawlowski, Eric Phipps, Andrew Salinger, Heidi
Thornquist, Ray Tuminaro, James Willenbring, and Alan Williams. An Overview of Trilinos.
Technical Report SAND2003-2927, Sandia National Laboratories, 2003.
[7] Eric W.H. Hutton and James P.M. Syvitski. Sedflux 2.0: An advanced process-response model
that generates three-dimensional stratigraphy. Computers & Geosciences, 34(10):1319–1337,
2008.
[8] T. E. Jordan and P. B. Flemings. Large-Scale stratigraphic architecture, eustatic variation, and
unsteady tectonism: A theoretical evaluation. Journal of Geophysical Research, 96(B4):6681–
6699, 1991.
[9] I Klaucke, D. G. Masson, N. H. Kenyon, and J. V. Gardner. Sedimentary processes of the lower
Monterey Fan channel and channel-mouth lobe. Marine Geology, 206:181–194, 2004.
[10] F Li, C Dyt, and C. Griffiths. 3d modelling of the isotatic flexural deformation. Computers and
Geosciences, 30:1105–1115, 2004.
[11] NOAA. Autochart bathymetric map production, 2012. http://www.ngdc.noaa.gov/autochart/.
National Oceanic and Atmospheric Administration. Accessed: October, 2011.
[12] Chris Paola. Quantitative models of sedimentary basin filling. Sedimentology, 47:121–178, 2000.
[13] Jan C Rivenæs. Application of a dual-lithology, depth-dependent diffusion equation in
stratigraphic simulation. Basin Research, 4(2):133–146, 1992.
[14] Jan C. Rivenæs. A computer simulation model for siliclastic basin stratigraphy. PhD thesis,
University of Trondheim, 1993.
[15] James P.M. Syvitski and Eric W.H. Hutton. 2D SEDFLUX 1.0C: An advanced process-response
numerical model for the fill of marine sedimentary basins. Computers & Geosciences, 27(6):731–
753, 2001.
[16] D. M. Tetzlaff and J. W. Harbaugh. Simulating Clastic Sedimentation. Van Nostrand Reinhold,
New York, 1989.