balancing e ciency and accuracy for sediment transport...

Balancing efficiency and accuracy for sediment

transport simulations

Wenjie Wei1, Stuart R. Clark1, Huayou Su2, Mei Wen2 and

Xing Cai1,3

1Department of Computational Geoscience, Simula Research Laboratory, P.O. Box

134, 1325 Lysaker, Norway2School of Computer Science, National University of Defense Technology, Changsha,

410073, China3Department of Informatics, University of Oslo, P.O. Box 1080 Blindern, 0316 Oslo,

Norway

E-mail: [email protected], [email protected], [email protected],

[email protected], [email protected]

Abstract. Simulating multi-lithology sediment transport requires numerically

solving a fully-coupled system of nonlinear partial differential equations. The most

standard approach is to simultaneously update all the unknown fields. Such a fully-

implicit strategy can be computationally demanding due to the need of Newton-

Raphson iterations, each having to set up and solve a large system of linearized

algebraic equations. Fully-explicit numerical schemes that do not solve linear systems

are possible to devise, but suffer from lower numerical stability and accuracy. If we

count the total number of floating-point operations needed to achieve stable numerical

solutions with a same level of accuracy, the fully-implicit approach probably wins over

its fully-explicit counterpart. However, the latter may nevertheless win in the overall

computation time, because computers achieve higher hardware efficiency for simpler

numerical computations. Adding to this competition, there are semi-implicit numerical

schemes that lie between the two extremes. This paper has two novel contributions.

First, we devise a new semi-implicit scheme that has second-order accuracy in the

temporal direction. Second, and more importantly, we propose a simple prediction

model for the overall computation time on multicore architectures, applicable to many

numerical implementations. Based on performance prediction, appropriate numerical

schemes can be chosen by considering accuracy, stability, and computing speed at

the same time. Our methodology is tested by numerical experiments modeling the

sediment transport in Monterey Bay.

Balancing efficiency and accuracy 2

1. Motivation

Often, numerical accuracy and computing time are two conflicting goals for sediment

transport simulations, like in many other cases of solving nonlinear partial differential

equations (PDEs). Ideally, one would like to spend the least amount of time waiting

for a computer to produce solutions with a desirable level of accuracy. This is however

easily said than done. On one side, due to the absence of analytical solutions, accuracy

of the numerical solutions can in the best case be estimated indirectly. On the other side,

there may exist a variety of numerical strategies, with different properties of efficiency,

stability, and accuracy. We also need to remember that the overall computational

efficiency often arises from a combination of factors: convergence rate, numerical

stability, software implementation, and computer hardware.

While restricting the focus to a specific mathematical model of sediment transport,

this paper presents a general methodology for choosing the best-performing temporal

discretization strategy out of a collection of alternatives. Moreover, as a novel

contribution in the numerical aspect, we propose in this paper a new temporal

discretization strategy, which is based on the Crank-Nicolson method and splits the

two coupled nonlinear PDEs. This new scheme achieves second-order accuracy in

time, without having to solve systems of nonlinear algebraic equations. Our second

novel contribution is a simple mathematical model for predicting the computing time

of any numerical implementation. In connection with sedimentary basin simulations,

different numerical schemes may have very different strengths and weaknesses. Our

performance-prediction model can help to pick a best-performing code out of a collection

of alternatives, taking into account accuracy, stability, algorithmic complexity and

hardware utilization.

2. Introduction and the mathematical model

Dynamic simulations of sedimentary basin filling nowadays may involve a multitude

of equations to represent a host of physical phenomena, from sediment erosion and

water-borne transport to underwater landslides and sediment compaction. There are

many ways to categorize these simulations, see [12], but for siliclastic depositional shelf

environments, the dominating assumption is that all or part of the averaged, long-term

movement of sediment is based on a diffusion equation. In diffusion-based models, the

sediment flux occurs in proportion to the slope between neighboring cells, representing

the averaged effects of water-borne sediment, sediments slumps and erosion. Transport

coefficients represent different levels of effectiveness for the position relative to sea level

and for water-driven and gravity-driven transport. Examples of such diffusion-based

models are DEMOSTRAT [13], DIONISOS [4, 5] and SEDFLUX [15, 7]. Some models,

such as the Stanford/CSIRO model SEDSIM [16, 10], calculate the water-driven part as

advective transport based on a simplified Navier-Stokes calculation. However, SEDSIM

still applies a diffusion equation to represent gravity-driven sediment transport. Thus,


the diffusion equation is an important component in either of these two model categories.

Although derived from sediment flux along rivers, see [12], it was Jordan and

Flemings [8] who first applied the diffusion model to clastic shelf deposition environments

with eustatic sea level variation. The equation has the basic form:

∂h

∂t= ∇ · (K∇h),

where h(x, y, t) is the height above some arbitrary horizontal surface in the x-y plane, t

is time and K(x, y, t) is the transport or diffusion coefficient giving the effectiveness of

the diffusion.

Using a parameter for the fraction of a given sediment, Rivenæs [13] added a second

equation to calculate the ratio of two sediments in a layer of deposited material. The

modified equations include s(x, y, t) and 1−s(x, y, t) as the fraction of the two sediments.

In particular, we consider in this paper a sedimentation scenario of sand and mud, and

it should be said that the subsequent numerical strategies carry straightforwardly over

to multi-lithology cases. The following two nonlinear PDEs, derived by Rivenæs [14],

constitute our mathematical model:

∂h

∂t=

1

Cs∇ · (αs∇h) +

1

Cm∇ · (β(1− s)∇h), (1)

A∂s

∂t+ s

∂h

∂t=

1

Cs∇ · (αs∇h). (2)

In the above model, α(x, y) and β(x, y) denote the diffusion coefficients for sand

and mud. In addition, Cs and Cm are the compaction ratio of the two sediment types.

Moreover, A is a constant representing the thickness of a top layer, in which sediments

are transported.

The initial conditions are of the form h(x, y, 0) = h0(x, y) and s(x, y, 0) = s0(x, y).

As boundary conditions, most of the boundary has the no-flow condition, i.e., the

homogeneous Neumann boundary condition ∂h/∂n = ∂s/∂n = 0. On the remaining

part of the boundary, the fluxes of sand and mud inflow are prescribed, i.e.,

−αs∂h∂n

= fs, −β(1− s)∂h∂n

= fm. (3)

These boundary conditions model an inflow of sediments due to, e.g., a river crosses the

boundary of the solution domain.

Although Granjeon [4] generalized these equations to handle multiple lithologies by

adding an additional equation, similar to (2), the major additional complexity in solving

these equations is in the nature of coupling (1) with (2), because of their different forms.

While the diffusion equations are prevalent, as presented above, the numerical

methods and efficiency of solving the coupled diffusion-sediment equations (1)-(2) have

not be presented in any detail in the literature. In this paper, we hope to close this

gap and, in addition, present a new semi-implicit scheme and compare it with the more

commonly used approaches.


3. Numerical strategies

This section is devoted to a description of several numerical methods for solving (1)-

(2). We will first look at different temporal discretization schemes, which constitute

the numerical core of the present paper. Thereafter, details associated with the spatial

discretization are presented.

3.1. Temporal discretization

The time domain 0 < t ≤ T is divided into a number of equal-distanced discrete time

levels, with ∆t as the time step size. Let superscript ` be the time level index, such

that h` denotes h(x, y, `∆t) and s` denotes s(x, y, `∆t). Then, the temporal derivatives

are simply approximated as

∂h

∂t≈ h`+1 − h`

∆t,

∂s

∂t≈ s`+1 − s`

∆t.

The remaining task of temporal discretization is to choose time level ` or ` + 1,

or a combination of both, at which the right-hand-side terms of (1)-(2) are to be

evaluated. Different strategies will give rise to fully-explicit, semi-implicit and fully-

implicit schemes.

3.1.1. Fully-explicit scheme To avoid solving systems of nonlinear algebraic equations,

the right-hand-side terms of (1)-(2) can use the already computed h and s values. More

specifically, (1)-(2) are transformed as follows, by a fully-explicit temporal discretization:

h`+1 − h`

∆t=

1

Cs∇ · (αs`∇h`) +

1

Cm∇ · (β(1− s`)∇h`),

As`+1 − s`

∆t+ s`+1h

`+1 − h`

∆t=

1

Cs∇ · (αs`∇h`+1).

It should be noted that h is to be updated before s during each time step. This is

why the newly computed h`+1 (from the first equation) is immediately used to compute

s`+1 (in the second equation). Another remark is that s`+1, instead of s`, is used in

the s∂h∂t

term on the left-hand side of (2). Numerical experiments show that this simple

trick improves the numerical stability of this fully-explicit scheme, in which both h`+1

and s`+1 are computed straightforwardly. The scheme has first-order accuracy in time.

3.1.2. Semi-implicit scheme 1 (backward Euler version) If h`+1 and s`+1 are used,

respectively, on the right-hand side of (1) and (2), a semi-implicit scheme based on the

backward Euler method can be derived as follows:

h`+1 − h`

∆t=

1

Cs∇ · (αs`∇h`+1) +

1

Cm∇ · (β(1− s`)∇h`+1),

As`+1 − s`

∆t+ s`+1h

`+1 − h`

∆t=

1

Cs∇ · (αs`+1∇h`+1).


Like the preceding fully-explicit scheme, h`+1 and s`+1 are computed separately

within each time step. However, each substep now requires solving a linear system,

thus giving semi-implicit as name of this scheme. Also like the fully-explicit scheme,

this scheme has first-order accuracy in time. Numerical stability is, however, the strong

feature of this semi-implicit scheme. We remark that these two schemes were proposed

and studied in our earlier work [2].

3.1.3. Semi-implicit scheme 2 (Crank-Nicolson version) To improve the accuracy of

the above semi-implicit scheme, we adopt the Crank-Nicolson strategy and propose in

this paper a new scheme:

h`+1,k − h`

∆t=

1

2

(1

Cs∇ · (αs`+1,k−1∇h`+1,k)

+1

Cm∇ · (β(1− s`+1,k−1)∇h`+1,k)

)+

1

2

(1

Cs∇ · (αs`∇h`) +

1

Cm∇ · (β(1− s`)∇h`)

)

As`+1,k − s`

∆t+s`+1,k + s`

2

h`+1,k − h`

∆t=

1

2

(1

Cs∇ · (αs`+1,k∇h`+1,k) +

1

Cs∇ · (αs`∇h`)

)As can be seen in the above two equations, the h solution is still to be computed

separately from s, and the Crank-Nicolson method is adopted in the temporal direction.

Careful readers will notice that inner iterations, with index k = 1, 2, . . ., are introduced

within each time step. Before starting the inner iterations, we assign s`+1,0 = s`, which

is needed in the first semi-discretized equation for computing h`+1,1. During each inner

iteration k, a linear system has to be solved for computing h`+1,k, and another linear

system is needed for s`+1,k. Numerical experiments suggest that two inner iterations

(k = 2) are sufficient for obtaining second-order accuracy in time. As comparison, one

inner iteration (k = 1) gives only first-order accuracy.

3.1.4. Fully-implicit scheme 1 (backward Euler version) Unlike the above three

schemes, fully-implicit schemes compute h`+1 and s`+1 simultaneously. The standard

fully-implicit scheme, which was already proposed in [14], uses the backward Euler

temporal discretization. That is, h`+1 and s`+1 are used in all the right-hand-side terms

of (1)-(2). Thereafter, a spatial discretization will give rise to the following system of

nonlinear algebraic equations per time step: Fh

(h`+1, s`+1,h`, s`

)= 0,

Fs

(h`+1, s`+1,h`, s`

)= 0.

(4)

Here, Fh denotes the set of nonlinear algebraic equations arising from (1), whereas Fh

arises from (2). Vectors h`+1, s`+1, h`, and s` contain, respectively, values of h`+1, s`+1,

h`, and s` on all the spatial grid points.


Newton-Raphson iterations can be used to solve the entire system of nonlinear

algebraic equations (4), for which a new system of linear equations needs to be set up

and solved in every Newton-Raphson iteration.

3.1.5. Fully-implicit scheme 2 (Crank-Nicolson version) The preceding fully-implicit

scheme uses the backward Euler method in the temporal discretization, so its accuracy

in time is of first order. To achieve second-order accuracy in time, the Crank-Nicolson

method can be adopted in the temporal discretization. That is, both h`+1, s`+1 and h`,

s` are used and equally weighted in all the right-hand-side terms of (1)-(2). The result

is also a system of nonlinear algebraic equations, of the same form of (4), for computing

h`+1 and s`+1 simultaneously.

3.2. Spatial discretization

We choose finite differences to carry out the spatial discretizations. This is mostly

motivated by the numerical and programming simplicity. It can be mentioned that

other spatial discretization techniques, such as finite elements, can also use the same

temporal discretizations discussed above.

3.2.1. Treatment of diffusion It is standard to use centered difference for the two

diffusion terms on the right-hand side of (1), for obtaining second-order accuracy in

space. For example, centered difference applied to the∇·(αs∇h) term gives the following

discretized form:

αi+ 12,jsi+ 1

2,j (hi+1,j − hi,j)− αi− 1

2,jsi− 1

2,j (hi,j − hi−1,j)

∆x2

+αi,j+ 1

2si,j+ 1

2(hi,j+1 − hi,j)− αi,j− 1

2si,j− 1

2(hi,j − hi,j−1)

∆y2,

where the subscripts i, j are the mesh point index for a 2D uniform grid with mesh

spacing ∆x and ∆y. In the above formula, the half-indexed terms are to be evaluated

as, e.g., αi+ 12,jsi+ 1

2,j = (αi,jsi,j + αi+1,jsi+1,j)/2.

3.2.2. Treatment of convection Equation (2) is a convection equation with respect

to s, because of the ∇ · (αs∇h) term. For the sake of numerical stability, one-sided

upwind finite difference is preferred over centered difference, despite its first-order spatial

accuracy.

To this end, it is customary to move the convection term ∇ · (αs∇h) to the left-

hand side of (2) when checking the flow direction. That is, −∇h gives the convection

velocity. The x-component, −∂h/∂x, is approximated by (hi−1,j − hi+1,j)/2∆x, the

sign of which determines how the x-component of the convection term is discretized by

one-sided upwind difference. More specifically, the

∂

∂x

(αs∂h

∂x

)


term is approximated by(αi,jsi,j − αi−1,jsi−1,j

∆x

)×(hi+1,j − hi−1,j

2∆x

)if we have hi−1,j > hi+1,j. Otherwise, the following approximation is used:(

αi+1,jsi+1,j − αi,jsi,j∆x

)×(hi+1,j − hi−1,j

2∆x

).

The discretization in the y-direction is done similarly.

3.2.3. Treatment of the boundary conditions Second-order accurate treatment of the

homogeneous Neumann condition ∂h/∂n = ∂s/∂n = 0 follows the standard approach

by using one layer of ghost boundary points. Attention is however needed for the

inhomogeneous influx conditions (3). In fact, we rewrite the two conditions in the

following equivalent form:

∂h

∂n= −fs

α− fm

β, s =

βfsβfs + αfm

.

That is, h has an inhomogeneous Neumann boundary condition, and s takes a Dirichlet

boundary condition.

4. Case Study: Monterey Bay

Monterey Bay is our region of study because of the publicly available, high-resolution

bathymetric data and its interesting features: the largest undersea canyons of the North

American West Coast. Our region covers the area from south of San Fransisco Bay in

the north, to Davidson Seamount in the south and west. To the east, the area takes in

the mouth of the Salinas River (Figure 1). To approximate the fluvial sediment load

of the Salinas River, we use an inflow boundary condition at the river mouth with an

average sediment flux of 1.8 ton/yr‡. This influx value is based on the average for the

years 1932 to 1999, and neglects the four most significant events during that period [3].

For our purposes, we use a rate of 20% sand, which is coarse-grain influx, a little higher

than the 10% value used in [3]. For marine sediment transport rates, we use an average

of approximately 3000 cm2/yr for sand-sized particles and 6000 cm2/yr for silt-sized

partiles. Rather than setting only a depth dependence in the formula as in [13], we

instead increase or reduce the transport coefficient depending on both the depth and

the curvature. The dependence on curvature is designed to capture higher rates of

sediment transport typical for submarine canyons and channels in the region [9]. To

pick out the high wavelength features of the canyons, we filtered the topography in the

Fourier domain, using a high-pass filter. The resulting α values are shown in the colour

overlay of Figure 1. Afterwards, the result was smoothed to improve numerical stability

and we arrived at transport coefficients along the Monterey Canyon of roughly 30,000

and 60,000 cm2/yr for α and β, respectively.

‡ Assuming average sediment density of 2.3 g/cm3.


Figure 1. Monterey Bay and the surrounding region; the rectangular study region

is bounded by the coordinates from 121.80◦W to 123.16◦W and from 35.50◦N to

37.16◦N. Inside the domain, the subsea bathymetric data are obtained from the

NOAA Autochart web-facility based on 2128 survey lines of ship-gathered bathymetric

data [11]. The colour overlay shows the transport coefficient α for sand and in

particular highlights the canyons. Two of them are indicated; the Monterey Canyon,

which extends from the red star out to the oceanward boundary of the domain at the

bottom of the image; and the Sur Canyon. Subaerial topography (shown in green)

is derived from ETOPO1 [1]. Six east-west and three north-south profiles of the

computational domain are shown, with colours indicating the percentage of silts and

clays. For example, light brown indicates a high sand percentage but low silt or clay.

The mouth of the Salinas River is at the red-starred location. The small reference plot

shows the West Coast of California.

In the analysis of the model, we use results from a model run of 250,000 years, on

a uniform 850× 700 mesh. Figure 1 shows profiles of the model along lines of latitude

and longitude. Our attention is focused on profiles A to F, since these are parallel to

the inflow point (the red star) and the Monterey Canyon. Profiles A and B sit almost

parallel to the shelf with a number of canyons transporting sediment directly into low-

lying areas. The sediment is sandy along much of the profile, particularly in profile A,

with silts filling areas towards the edge of the domain. Profiles C and D show a strong

layering, from well-mixed lower layers to a sandy deposit in the upper layers. Profiles

E and F show significant well-mixed layers only in the rightmost lower layers where the

Monterey Canyon intersects the profiles, and high silt percentages in the middle layers

stretching oceanward. Using the total change in height through the model run, we have

also calculated the average erosion and deposition rates in Figure 2. We find deposition

rates in the Monterey Canyon in the range of 0.25 cm/yr to 0.5 cm/yr.

The well-mixed nature of the lower layers of profiles C and D in Figure 1 shows the

effectiveness of the canyons in transporting both sand and smaller-sized particules, at


Figure 2. Average rate of sediment deposition (positive values in blue colours) or

erosion (negative values in red colours). The contours show the bathymetry in meters.


least until the topography becomes relatively flat and the silts are transported further

than the sands, creating a distinct sand to silt transition close to the seaward edge

of the model. The deposition rates of Figure 2 for the Monterey Canyon region are

in good agreement with Farnsworth [3], who estimated sediment accumulation rates

in the canyon at 0.35 cm/yr at 3000m depth. Although we have neglected long-shore

currents, the effect is likely to be limited to the the immediate coast as the topographic

highs surrounding the Monterey Canyon do not allow deep longshore currents. Finally,

although the model allows for higher transport rates in the canyons, these canyons

still fill with sediment and the gradients in h are therefore reduced, changing the

Monterey Bay region from canyon-dominated sediment transport to a prograding shelf-

type deposition, particularly noticible in the upper sand layers profiles B to F in Figure 1.

However, during the length of the model run, 250,000 years, the sea-level variation would

be a significant effect in renewing erosion of the present-day submarine bathymetry as

rivers cut into the landscape again.

5. Comparing the different methods

All the five temporal discretization schemes from Section 3.1 are capable of running

the 250,000-year simulation of the Monterey Bay case, using the rather coarse 850 ×700 mesh. Human eyes cannot detect any differences among the five simulations.

Nevertheless, the five schemes have different strengths and weaknesses, which will be

of great importance for high-resolution simulations. It is the purpose of this section

to take a closer look at their temporal accuracy, numerical stability and computational

speed.

5.1. Temporal accuracy

A very important property of any numerical method is its accuracy, in particular, how

fast the numerical error decreases with refining mesh spacing: ∆x, ∆y, ∆t. Our focus

now is on the relationship between ∆t and accuracy, because this is where the five

schemes differ. A good understanding of this issue can allow more accurate schemes to

use larger ∆t and thus save time.

For studying the temporal accuracy of a numerical method, we assume that its

numerical error has three leading terms:

|utrue − u∆| ≈ Ct∆tγ + Cx∆x

ν + Cy∆yν , (5)

where u∆ denotes the numerical solution of h or s, while Ct, Cx and Cy are constants

independent of ∆t, ∆x, and ∆y. The constant γ gives the order of temporal accuracy,

and similarly ν is for the two spatial directions. Here, we have assumed that the

numerical method achieves the same order of accuracy in x- and y-directions.

Fixing a spatial discretization scheme and the values of ∆x and ∆y, we can

generate a series of numerical solutions with decreasing time step sizes: ∆t, ∆t/2,

∆t/4, and so on. If the difference between two and two consecutive numerical solutions


Table 1. Temporal error analysis for the five schemes at T = 100 years. The spatial

mesh is fixed as 850× 700, and the value of ∆t is 0.02 yr.

Fully-explicit Semi-implicit 1 Semi-implicit 2 Fully-implicit 1 Fully-explicit 2

‖h∆t − h∆t2‖L2 1.158e+02 1.260e+02 5.024e-02 1.277e+02 1.329e-02

‖h∆t2− h∆t

4‖L2 5.790e+01 6.300e+01 1.251e-02 6.387e+01 3.321e-03

‖h∆t4− h∆t

8‖L2 2.895e+01 3.150e+01 3.024e-03 3.193e+01 8.283e-04

‖h∆t8− h∆t

16‖L2 1.448e+01 1.575e+01 7.961e-04 1.597e+01 2.076e-04

‖h∆t16− h∆t

32‖L2 7.238e+00 7.876e+00 1.937e-04 7.984e+00 5.190e-05

‖s∆t − s∆t2‖L2 4.246e+00 9.922e-02 4.190e-03 9.678e-02 1.262e-03

‖s∆t2− s∆t

4‖L2 2.119e+00 5.018e-02 1.036e-03 4.847e-02 3.152e-04

‖s∆t4− s∆t

8‖L2 1.059e+00 2.527e-02 2.549e-04 2.425e-02 7.816e-05

‖s∆t8− s∆t

16‖L2 5.292e-01 1.268e-02 6.501e-05 1.213e-02 1.974e-05

‖s∆t16− s∆t

32‖L2 2.645e-01 6.351e-03 1.634e-05 6.067e-03 4.927e-06

decreases with ∆t, it indicates that the numerical method is convergent in the temporal

direction. (Consistency is needed in both temporal and spatial discretizations to ensure

convergence toward the true solution.) In particular, if the consecutive differences are

observed to reduce with a factor of 2γ, we can then establish that γ is the order of

temporal accuracy for the numerical method under study.

Using the Monterey Bay case, we experimented with a series of six decreasing time

step sizes: ∆t = 0.02/2k yr, 0 ≤ k ≤ 5, for a 100-year simulation. Table 1 thus shows, for

the five numerical schemes, the differences (in discrete L2-norm) between two and two

consecutive h or s solutions. First-order temporal accuracy can clearly be observed for

the fully-explicit scheme and the backward-Euler version of both semi- and fully-implicit

scheme. The Crank-Nicolson version of semi- and fully-implicit scheme is second-order

accurate in time. In addition, Table 1 also reveals the actual magnitude of the numerical

errors. Semi-implicit scheme 1 and fully-implicit scheme 1 have roughly the same level

of accuracy. The fully-explicit scheme has relatively poor accuracy for the s solutions.

The two second-order schemes are considerably more accurate than all three first-order

schemes. Between the two second-order schemes, the fully-implicit version is about four

times more accurate.

5.2. Stability

Numerical stability is another important property of any scheme. Here, we consider

a numerical solution as unstable if s`i,j exceeds, at any mesh point, the correct value

range of [0, 1]. Typically, stability depends on the mesh spacing ∆x,∆y,∆t. In the

absence of a theoretical analysis of the stability condition, we can also resort to numerical

experiments.

More specifically, we used a 100-year simulation of the Monterey Bay case on four

spatial mesh resolutions. For each spatial resolution, we adopted a binary search to


Table 2. The largest admissible ∆t values for a 100-year simulation of the Monterey

Bay.

Mesh size 850× 700 1700× 1400 3400× 2800 6800× 5600

Fully-explicit 0.22 0.17 0.04 0.01

Semi-implicit 1 77.19 49.29 24.76 14.15

Semi-implicit 2 2.02 1.43 0.70 0.37

Fully-implicit 1 7.24 3.97 2.76 1.07

Fully-implicit 2 2.00 1.43 0.60 0.37

find the largest admissible ∆t, which maintained 0 ≤ s`i,j ≤ 1 throughout the entire

simulation. As shown in Table 2, the fully-explicit scheme is the least stable, while

semi-implicit scheme 1 is the most stable. We can also see that smaller values of ∆x

and ∆y require smaller ∆t. Although the actual values of admissible ∆t in Table 2

cannot be blindly used in other simulations, the table demonstrates the comparative

stability among the five schemes.

5.3. Computational speed

An understanding of the temporal accuracy and numerical stability is an important

first step toward choosing a best-performing numerical scheme. Once the spatial mesh

resolution is prescribed via ∆x and ∆y, each scheme can in principle estimate its

matching time step size ∆t, so that temporal errors and spatial errors are balanced.

(This issue will be discussed in Section 6.) If the estimated ∆t violates the stability

requirement, the value of ∆t has to be decreased accordingly. Now the following question

arises: Which scheme can finish the simulation most quickly, using its largest admissible

∆t value that satisfies both accuracy and stability?

We therefore turn our focus to predicting the computational time for each numerical

scheme, provided that the numbers of spatial mesh points and time steps are known.

This prediction relies on two types of information: (1) the number of floating-point

operations and volume of data traffic, (2) the main hardware features of an intended

computer.

5.3.1. Work load and data traffic What we are interested in is more than just the

conventional O(N) algorithmic complexity model, which is too crude for predicting

the actual time usage on real-world hardware. Instead, we count the actual numbers

of numerical operations and data read/write operations invoked. The latter factor is

particularly important for understanding the performance of a software implementation.

Let us recall that the spatial domain is gridded into a 2D uniform mesh. Except

for the boundary points, which only count for a very small percentage, the volume of

computation work and data traffic is the same for every mesh point. Therefore, we will

in the following only investigate the point-wise work load and data traffic.


Noting that all the five numerical schemes carry out several actions per time step,

we count for each action the work load and data traffic per mesh point. The reason for

action-wise counts is because the speed of some actions may be limited by the floating-

point operations, while other actions may be constrained by the data traffic, either

between the registers and L1 cache or between the main memory and the entire cache

hierarchy. Let us consider as example semi-implicit scheme 1. It has four actions per

time step, namely, a linear system is first set up and then solved for computing h`+1,

followed by two similar actions for computing s`+1. While setting up the h linear system

is likely determined by a computer’s floating-point capability, as suggested by Table 3,

the speed of the three other actions likely depends on the data movement capability

within the entire memory-cache hierarchy.

Table 3. Counts of floating-point operations (FLOP), data loads (LD) and stores (ST)

between L1 cache and registers, sum of loads and stores (MEM) that touch the main

memory. All the counts are per time step and per mesh point. The counts associated

with linear system solves are for one CG or GMRES iteration.

Scheme Action FLOP LD ST MEM

Fully Compute hi,j 57 43 1 5

-explicit Compute si,j 37 24 1 5

Semi- Set up h system 62 21 8 10

implicit 1 Solve h system 15 52 9 21

Set up s system 35 35 10 10

Solve s system 15 68 14 21

Semi- Set up h system 262 158 24 20

implicit 2 Solve h system 15 52 9 21

Set up s system 117 125 29 20

Solve s system 15 68 14 21

Fully- Set up h-s system 150 150 92 27

implicit 1 Solve h-s system 46 264 52 62

Fully- Set up h-s system 225 223 138 28

implicit 2 Solve h-s system 46 264 52 62

Counting the number of floating-point operations and volume of data traffic can be

done in two ways. The first approach is to manually accumulate this info by reading

through the computer program line by line. Such a manual count is doable but can

often be cumbersome and even inaccurate. The inaccuracy arises if a compiler, in

order to optimize performance, re-orders computations and introduces new intermediate

variables. For this particular reason, we adopt in this paper another counting approach.

More specifically, the PAPI tool§ is used to profile a compiled code. The actual numbers

of floating-point operations and data transfers between a CPU’s registers and its L1

cache are namely recordable by the CPU’s hardware performance counters.

§ PAPI: http://icl.cs.utk.edu/papi/.


Table 3 reports the numbers of floating-point and data load/store operations,

needed by the different actions of the five numerical schemes. PAPI-v4.1.4 was used on

an Intel Xeon E5504 processor. All the computer programs were compiled by the GNU

C++ compiler (version 4.4.3) with -O3 optimization. The “FLOP” column of Table 3

reports PAPI’s PAPI FP INS event, the “LD” column is associated with PAPI L1 DCR,

while PAPI L1 DCW gives the “ST” column. In addition, column “MEM” contains

manual counts or estimates of the minimum required volume of bi-directional data traffic

(loads+stores) that touches the main memory. This is because PAPI unfortunately does

not collect this information. We have always assumed that the CPU’s cache hierarchy

is not large enough to hold the entire data structure.

5.3.2. Relating computational speed to hardware To predict the computing time of a

numerical method, it is not enough to only know the numbers of involved floating-

point and data load/store operations, as reported in Table 3. First of all, a computer

program often involves other operations than those reported in the table. An example is

the preparation work of an iterative linear system solver before starting the iterations.

Secondly, the time usage associated with solving linear systems must be estimated

together with the number of iterations needed. Thirdly, for the two fully-implicit

schemes, Newton-Raphson iterations work as an outer loop, where a new linear system is

set up and solved during each Newton-Raphson iteration. An estimate of the number of

Newton-Raphson iterations is thus needed. Fourthly, and most importantly, prediction

of time usage has to consider the hardware capabilities of an intended computer, also

depending on whether the computer is run in serial or parallel mode.

Here, we want to make an attempt at predicting a lower bound of time usage by

the numerical schemes, based on information from Table 3 and the hardware (peak)

capabilities of a multicore-based parallel computer. Our assumptions are as follows:

(i) Due to hardware technologies such as pipelining of operations and prefetching data

into caches, modern CPUs are able to avoid, to a great extent, stall of the data

and/or instruction streams.

(ii) We only focus on three sources of performance limitation: (1) CPU’s clock rate,

(2) data transfer bandwidth between registers and the L1 cache, (3) data transfer

bandwidth between the last-level cache and main memory.

(iii) A lower bound of time usage is thus the maximum value among (1) time needed

by the CPU core(s) to execute the floating-point operations, (2) time needed by

the L1 cache(s) to load data into the registers, (3) time needed by the registers to

store data back to L1, and (4) time needed by the main memory to execute its data

loads and stores.

It should be remarked that the above assumptions are motivated by simplicity. Ideally,

the cache miss rates at different levels and the volumes of data traffic within the cache

hierarchy should be considered. However, accurate counts of the cache misses and

volumes of inter-cache data traffic are in general extremely difficult to quantify. These


are therefore not included in our modeling philosophy, which is easily put into practice

while still predicting a useful lower bound of the computing time.

As hardware capabilities of a multicore CPU, the following parameters are assumed

known:

• The peak capability of a single CPU core to execute floating-point operations is

denoted by F—max number of floating-point operations per second.

• The bandwidths between a CPU’s private L1 cache and registers are denoted by

BrL1—number of bytes readable from L1 per second, and Bw

L1—number of bytes

writable to L1 per second. The reason for the two bandwidths is because a dedicated

channel is assumed for the data loads, while another is dedicated for the data stores.

• The bandwidth of the main memory is represented by BM—number of bytes

transferred per second. Here we assume that load and store operations share the

same channel(s), which is also shared among multiple cores.

5.3.3. Simple models for predicting computational speed We denote by nFLOP the

number of floating-point operations, nload the number of bytes loaded from L1, and

nstore the number of bytes stored to L1. Similarly, nMload denotes the number of bytes

loaded from the main memory, while nMstore is for the stores.

Serial computing time When only a single CPU core is used, the lower bound of serial

computing time is described by the following simple formula:

max

(nFLOP

F,nload

BrL1

,nstore

BwL1

,nMload + nMstore

BM

). (6)

Since nstore can safely be assumed to be smaller than nload, while we typically have

BrL1 = Bw

L1, the above formula can be further simplified as

max

(nFLOP

F,nload

BrL1

,nMload + nMstore

BM

). (7)

Parallel computing time In a typical multicore architecture, the L1 cache is private to

each CPU core, so the aggregate effect of employing multiple BrL1 and Bw

L1 scales linearly

with the number of CPU cores in use. Using multiple CPU cores also means a linear

expansion of the floating-point capability F . On the other hand, the aggregate value of

the main memory bandwidth BM depends on the actual memory hierarchy, often not

scaling linearly with the number of CPU cores. Suppose p denotes the number of CPU

cores used, we use BpM to denote the aggregate main memory bandwidth. Now, the

lower bound of parallel computing time can be found as

max

(nFLOP

pF,nload

pBrL1

,nMload + nMstore

BpM

). (8)

Some comments are in order here. First, both the prediction models (7) and (8)

are based on a set of simplifications, making them easily applicable, but also with


Table 4. Total numbers of time steps, Newton-Raphson iterations, CG iterations and

GMRES iterations used by the five schemes for the example of using a Nehalem-EP.

Scheme #time #Newton #CG #GMRES

Fully-explicit 100 N/A N/A N/A

Semi-implicit 1 100 N/A 335 234

Semi-implicit 2 100 N/A 612 485

Fully-implicit 1 100 200 N/A 1676

Fully-implicit 2 100 200 N/A 1375

the possibility of gravely under-estimating the actual computing time. Second, neither

model considers the impact of inter-cache data traffic, i.e., L1↔L2 and L2↔L3. One

reason is that these data traffics are often not the bottleneck. Another reason is that

estimating the actual volumes of inter-cache data traffic will make the prediction models

unbearably complex. Third, the overhead of synchronization and data communication

between cores/sockets/nodes is ignored for simplicity.

5.3.4. The example of using a Nehalem-EP To check the quality of our prediction

models (7) and (8), we ran all the five numerical scheme for 100 time steps on a

1700×1400 mesh. The hardware used is a Nehalem-EP that consists of two sockets, each

being a quad-core Xeon 2.0 GHz E5504 processor. The values of F = 4 Gflops/s (no code

vectorization) and BrL1 = 16 GB/s are deduced from Intel’s hardware specification, while

the values of BpM have adopted the STREAM‖ benchmark’s “copy” rates, measured on

this particular computer. More specifically, we have B1M = 6.71 GB/s, B2

M = 13.18

GB/s, B4M = 16.90 GB/s, and B8

M = 17.11 GB/s.

Using the information given in Tables 3 and 4, we can calculate the values of nFLOP,

nload, and nMload + nMstore, which are needed in the prediction models (7) and (8). Table 5

compares the predicted time usages TP against the actual time usages TA. We remark

that the high-quality Trilinos software package [6] was used to implement all the linear

system solvers.

It can be seen from Table 5 that our simple prediction models (7) and (8)

consistently under-estimate the time usage. This is an expected behavior because the

models are meant to give a lower bound. In general, the prediction accuracy is slightly

better for the fully-explicit scheme, while roughly the same for the four non-explicit

schemes. This means that the predicted TP value is helpful in practice, because the

comparative speed difference between the five schemes is correctly anticipated.

‖ STREAM: Sustainable Memory Bandwidth in High Performance Computers,

http://www.cs.virginia.edu/stream/.


Table 5. Comparing the actual time TA with predicted time TP on a Nehalem-EP.

Scheme Time 1 core 2 cores 4 cores 8 cores

Fully- TA 13.04 6.73 3.56 1.43

explicit TP 7.97 3.99 1.99 1.20

Semi- TA 82.30 44.44 29.05 25.17

implicit 1 TP 47.52 23.76 15.72 15.52

Semi- TA 178.68 96.92 61.38 50.12

implicit 2 TP 110.79 55.40 34.37 30.21

Fully- TA 625.75 322.20 213.25 176.19

implicit 1 TP 562.23 281.12 140.56 121.64

Fully- TA 537.02 286.37 183.83 145.15

implicit 2 TP 485.04 242.52 121.26 101.50

6. Putting everything together

So far, the reader should have realized that there are many factors that affect the

computing time of a particular numerical scheme:

(i) The spatial problem size in form of the number of mesh points.

(ii) The number of floating-point operations needed per mesh point and per time step

(and per linear solver iteration).

(iii) The volumes of data traffic, which touch the L1 cache and main memory, per mesh

point and per time step (and per linear solver iteration).

(iv) The hardware capabilities of a multicore-based parallel computer, in form of F ,

BrL1, Bw

L1, and BpM .

(v) The total number of time steps needed.

Factor 1 is often prescribed a priori, the second and third factors are static

properties of a numerical scheme, while the fourth factor regarding the hardware is easily

obtainable. The last factor thus deserves our attention, because different numerical

schemes may require very different values of ∆t to achieve a same level of accuracy.

Moreover, numerical stability will impose an additional requirement on ∆t. Predicting

the actual time usage therefore relies on a good estimate of the largest admissible ∆t.

This requires a quantification of the numerical errors as described below.

6.1. Quantifying the error model

Let us recall the model of numerical errors (5) from Section 5.1. There, we have

assumed that the numerical errors have two independent contributions: Ct∆tγ and

Cx∆xν + Cy∆y

ν . In order to find the constant values Ct, γ, Cx, Cy, and ν, numerical

experiments are needed. Table 1 from Section 5.1 gives an example of how to determine

the values of Ct and γ, which depend on the temporal discretization chosen, and which


also differ for the h and s equations. To determine the values of Cx, Cy, and ν, another

set of numerical experiments is needed. This time, the value of ∆t is fixed, while a series

of different ∆x and ∆y values are tried. We remark that all such numerical experiments

can use a short simulation time length T and relatively coarse mesh spacings, to be

able to quickly establish (5). It should be remarked that the values of Ct, Cx and Cyare typically functions of T . Nevertheless, our hope is that the ratio between the three

contants remains the same, so that we can compare the magnitudes of error between

time and space.

6.2. Finding the largest admissible ∆t

For real-world sediment transport simulations, it is not unusual that the spatial mesh

spacing (∆x, ∆y) is prescribed as the starting point. This can come from earlier

experiences and/or considerations for the capacities of a target computer.

For each temporal discretization scheme, once ∆x and ∆y are given, we can use

the established error model (5) to estimate the largest value of ∆t, such that

Ct∆tγ ≤ Cx∆x

ν + Cy∆yν

holds for both h and s. Then, the already established information about numerical

stability, in form of Table 2, is extrapolated to check whether the estimated ∆t above

satisfies the stability requirement. If not, ∆t is decreased to ensure stability.

6.3. Predicting time usage

So far, we have found for each numerical scheme its largest admissible ∆t, such that

the numerical error contributed by the temporal discretization is guaranteed to not

exceed that of the spatial discretization. The stability condition is satisfied as well.

What remains is to predict the time usage for each numerical scheme. To this end, we

also need to estimate the iteration numbers of Newton-Raphson and/or linear solver(s)

for the non-explicit schemes. These are typically estimated by extrapolating known

iteration counts.

Finally, after obtaining the hardware capability parameters F , BrL1, Bw

L1, and BpM ,

we are ready to apply the prediction models (7) and (8).

6.4. A large-scale example

To synthesize a realistic scenario, we used the case of Monterey Bay again. This time, we

started by prescribing ∆x = ∆y = 20m, which gave a 9206× 6108 spatial mesh. Then,

the largest admissible ∆t value was determined for the fully-explicit scheme and the

two semi-implicit schemes. The two fully-implicit schemes were not considered, because

we knew from before that fully-implicit scheme 1 has no advantage over semi-implicit

scheme 1, while fully-implicit scheme 2 is much slower than semi-implicit scheme 2.


Table 6. Comparing the actual time usage TA with predicted time usage TP on

Tianhe-1A, for a 100-year simulation on a 9206× 6108 spatial mesh. The rows of F pA

report the achieved Gflops/s rates by using p CPU cores.

Scheme p 240 480 960 1920

Fully- TA 150.76 78.13 36.57 17.92

explicit TP 72.19 35.23 17.41 8.65

F pA 701.20 1353.04 2890.70 5899.16

Semi- TA 273.23 142.52 66.13 32.29

implicit 1 TP 178.79 89.40 44.70 22.35

F pA 41.80 80.14 172.71 353.72

Semi- TA 648.98 356.87 137.54 78.82

implicit 2 TP 429.58 214.79 107.40 53.70

F pA 50.99 92.72 240.58 419.81

Balancing the temporal and spatial errors, the fully-explicit scheme chose ∆t = 0.81

yr, and semi-implicit scheme 1 chose ∆t = 0.74 yr, whereas the second-order semi-

implicit scheme 2 chose ∆t = 22.9 yr. However, both the fully-explicit scheme and

semi-implicit scheme 2 had to decrease their choices of ∆t for the sake of numerical

stability. Finally, while keeping a small safety margin, we decided to use ∆t = 0.005 yr

for the fully-explicit scheme, ∆t = 0.5 yr for semi-implicit scheme 1, and ∆t = 0.25 yr

for semi-implicit scheme 2.

For a 100-year simulation of Monterey Bay on the 9206× 6108 mesh, we estimated

that the total numbers of floating-point operations needed would be 106×1012, 11×1012,

and 33 × 1012 for the fully-explicit scheme, semi-implicit scheme 1, and semi-implicit

scheme 2, respectively.

As a large-scale hardware testbed, we used Tianhe-1A Hunan Solution¶—the

world’s No. 28 supercomputer, according to the Top500 list published in June 2012. Each

compute node of this supercomputer has two six-core Xeon X5670 CPUs and one Nvidia

Tesla M2050 GPU. Since there are no GPU implementations for the two semi-implicit

schemes, only the CPU part of the supercomputer was used for our time measurements.

The hardware parameters needed for the prediction model (8) are F = 5.86 Gflops/s,

BrL1 = Bw

L1 = 23.44 GB/s, and B12M = 32.86 GB/s (i.e., when all the twelve cores per

compute node are in use). The compiler used was icc of version 11.0 using the -O3

optimization flag.

Table 6 lists the actual time usages TA and the achieved Gflops/s rates F pA, which

were measured on Tianhe-1A Hunan Solution. The predicted time usages TP are

also listed for comparison. Despite the fact that the fully-explicit scheme used the

most floating-point operations, its actual time usage was the lowest among the three

candidates. This was correctly anticipated by the prediction model (8). All three parallel

implementations scaled nicely between 240 and 1920 CPU cores. The highest F pA rate

¶ http://i.top500.org/system/177448


of 5899.16 Gflops/s was, not surprisingly, achieved by the fully-explicit scheme when

using 1920 cores.

7. Concluding remarks

It is not trivial to achieve the best possible computing speed, while maintaining

a desirable level of accuracy and avoiding numerical instability. It becomes more

complicated when there exists a collection of candidate numerical schemes. This paper

has outlined a methodology that aims at a systematic approach, which involves two main

tasks. First, small-scale and short-time-length experiments can be used to establish the

error model (5) and the numerical stability requirements in form of Table 2. Such

information helps choosing a largest admissible ∆t value when the spatial mesh space

is given. Second, the prediction models (7) and (8) can rank the candidate numerical

schemes with respect to the overall computing time. The two performance prediction

models are easy to use, because the needed hardware parameters are readily obtainable

for any computing system. Moreover, the static properties of a particular numerical

scheme, in form of Table 3, can be established by using, e.g., profiling tools such as PAPI.

More importantly, this methodology should be applicable to many other numerical

simulations.

The measurements presented in this paper may give an impression that the fully-

explicit scheme is deemed the winner of the overall computing time. To draw such a

conclusion is wrong, because a balanced relationship between ∆t and ∆x, ∆y will change

from case to case. It may even happen that the ranking of the schemes changes on a

different hardware platform. Therefore, the prediction models (7) and (8) are helpful

when planning really challenging and huge-scale simulations of marine sedimentary basin

filling.

One particular reason for the inferior computing speed of the two semi-implicit

schemes, in comparison with their fully-explicit counterpart, is the relatively large

numbers of CG or GMRES iterations needed to solve the linear systems per time

step. So far, we have not applied any preconditioner to the linear solvers. It remains

to be seen whether suitable preconditioners can sufficiently decrease the number of

CG/GMRES iterations, so that the overall time usage is reduced despite the extra

computing effort incurred by the preconditioners. On the other hand, the fully-explicit

scheme will relatively speaking better suit GPU platforms, because this scheme is easily

implemented and is the least sensitive to data traffic bandwidths.

Acknowledgments

We thank the National Supercomputing Center in Changsha for the access to the

Tianhe-1A Hunan Solution supercomputer. Dr. Nan Wu at the National University

of Defense Technology is acknowledged for his assistance with using the supercomputer.

Computing facilities from the Norwegian Metacenter for Computational Science


(NOTUR) were used to carry out some of the numerical experiments of this paper.

References

[1] C. Amante and B. W. Eakins. ETOPO1 1 arc-minute global relief model: Procedures, data

sources and analysis. Technical report, National Oceanic and Atmospheric Administration,

2009. NOAA Technical Memorandum, NESDIS NGDC-24.

[2] S. R. Clark, W. Wei, and X. Cai. Numerical analysis of a dual-sediment transport model applied

to Lake Okeechobee, Florida. In Proceedings of the 9th International Symposium on Parallel

and Distributed Computing, pages 189–194. IEEE Computer Society Press, 2010.

[3] K. L. Farnsworth. Monterey Canyon as a conduit for sediment to the deep ocean. Technical

report, Virginia Institute of Marine Science, 2000.

[4] D. Granjeon. Deterministic stratigraphic modeling; conception and applications of a

multilithological 3D diffusive model. Mem. Geosci. Rennes, 78, 1997.

[5] D. Granjeon and P. Joseph. Concepts and applications of a 3-D multiple lithology, diffusive

model in stratigraphic modeling. Numerical Experiments in Stratigraphy: Recent Advances

in Stratigraphic and Sedimentologic Computer Simulations, SEPM Special Publication No. 62,

pages 197–210, 1999.

[6] Michael Heroux, Roscoe Bartlett, Vicki Howle Robert Hoekstra, Jonathan Hu, Tamara Kolda,

Richard Lehoucq, Kevin Long, Roger Pawlowski, Eric Phipps, Andrew Salinger, Heidi

Thornquist, Ray Tuminaro, James Willenbring, and Alan Williams. An Overview of Trilinos.

Technical Report SAND2003-2927, Sandia National Laboratories, 2003.

[7] Eric W.H. Hutton and James P.M. Syvitski. Sedflux 2.0: An advanced process-response model

that generates three-dimensional stratigraphy. Computers & Geosciences, 34(10):1319–1337,

2008.

[8] T. E. Jordan and P. B. Flemings. Large-Scale stratigraphic architecture, eustatic variation, and

unsteady tectonism: A theoretical evaluation. Journal of Geophysical Research, 96(B4):6681–

6699, 1991.

[9] I Klaucke, D. G. Masson, N. H. Kenyon, and J. V. Gardner. Sedimentary processes of the lower

Monterey Fan channel and channel-mouth lobe. Marine Geology, 206:181–194, 2004.

[10] F Li, C Dyt, and C. Griffiths. 3d modelling of the isotatic flexural deformation. Computers and

Geosciences, 30:1105–1115, 2004.

[11] NOAA. Autochart bathymetric map production, 2012. http://www.ngdc.noaa.gov/autochart/.

National Oceanic and Atmospheric Administration. Accessed: October, 2011.

[12] Chris Paola. Quantitative models of sedimentary basin filling. Sedimentology, 47:121–178, 2000.

[13] Jan C Rivenæs. Application of a dual-lithology, depth-dependent diffusion equation in

stratigraphic simulation. Basin Research, 4(2):133–146, 1992.

[14] Jan C. Rivenæs. A computer simulation model for siliclastic basin stratigraphy. PhD thesis,

University of Trondheim, 1993.

[15] James P.M. Syvitski and Eric W.H. Hutton. 2D SEDFLUX 1.0C: An advanced process-response

numerical model for the fill of marine sedimentary basins. Computers & Geosciences, 27(6):731–

753, 2001.

[16] D. M. Tetzlaff and J. W. Harbaugh. Simulating Clastic Sedimentation. Van Nostrand Reinhold,

New York, 1989.

balancing e ciency and accuracy for sediment transport...

Documents