bayesian analysis and decisions in nuclear power plant ...popova/popova_morton_damien_hanson.pdf ·...

The Handbook of Applied Bayesian Analysis, Eds: Tony O’Hagan & Mike West, Oxford UniversityPress

Bayesian analysis and decisions in nuclear power plant

maintenance

Elmira Popova, David Morton, Paul Damien

The University of Texas at Austin

Austin, TX 78712, USA

Tim Hanson

University of Minnesota

Minneapolis, MN 55455, USA

1 Introduction

It is somewhat true that in most mainstream statistical literature the transition from in-

ference to a formal decision model is seldom explicitly considered. Since the Markov chain

Monte Carlo (MCMC) revolution in Bayesian statistics, focus has generally been on the

development of novel algorithmic methods to enable comprehensive inference in a variety of

applications, or to tackle realistic problems which naturally fit into the Bayesian paradigm.

In this chapter, the primary focus is on the formal decision or optimization model. The

statistical input needed to solve the decision problem is tackled via Bayesian parametric

and semi-parametric models. The optimization/Bayesian models are then applied to solving

an important problem in a nuclear power plant system at the South Texas Project (STP)

Electric Generation Station.

STP is one of the newest and largest nuclear power plants in the US, and is an industry

leader in safety, reliability and efficiency. STP has two nuclear reactors that together can

produce 2,500 megawatts of electric power. The reactors went online in August 1988 and

June 1989, and are the sixth and fourth youngest, respectively, of more than 100 such

reactors operating nationwide. STP consistently leads all US nuclear plants in the amount

1

of electricity its reactors produce.

The STP Nuclear Operating Company (STPNOC) manages the plant for its owners,

who share its energy in proportion to their ownership interests, which as of July 2004 are:

Austin Energy, The City of Austin, 16%, City Public Service of San Antonio, 40%, and

Texas Genco LP, 44%. All decisions that the board of directors make are of finite time

since every nuclear reactor is given a license to operate. In the case of STP, 25 and 27 years

remain on the licenses for the two reactors, respectively.

Equipment used to support production in long-lived (more than a few years) installations

such as those at STP requires maintenance. While maintenance is being carried out, the

associated equipment is typically out of service and the system operator may be required to

reduce or completely curtail production. In general, maintenance costs include labor, parts

and overhead. In some cases, there are safety concerns associated with certain types of plant

disruptions. Overhead costs include hazardous environment monitoring and mitigation,

disposal fees, license fees, indirect costs associated with production loss such as wasted feed

stock, and so forth.

STPNOC works actively with the Electric Power Research Institute (EPRI) to develop

robust methods to plan maintenance activities and prioritize maintenance options. Ex-

isting nuclear industry guidelines, Bridges and Worledge (August 2002); Gann and Wells

(2004); Hudson and Richards (2004); INPO (2002), recommend estimating reliability and

safety performance based on evaluating changes taken one at a time, using risk importance

measures supplemented by heuristics to prioritize maintenance. In our view, the nuclear

industry can do better. For example, Liming et al. (2003) propose instead investigating

“packages” of changes, and in their study of a typical set of changes at STPNOC, the pro-

jected plant reliability and nuclear safety estimates were found to be significantly different

compared to changes evaluated one at a time. STPNOC is working with EPRI to improve

its preventive maintenance reliability database with more accurate probability models to

aid in better quantification of preventive maintenance.

Here, we consider a single-item maintenance problem. That said, this chapter’s model

forms the basis for higher fidelity multi-item maintenance models that we are currently

2

investigating. Furthermore, as we describe below, even though the model is mathematically

a single-item model, in practice it could easily be applied to multiple items within an

equipment class.

Equipment can undergo either preventive maintenance (PM) or corrective maintenance

(CM). PM can include condition-based, age-based, or calendar-based equipment replace-

ment or major overhaul. Also included in some strategies is equipment redesign. In all

these categories of PM, the equipment is assumed to be replaced or brought back to the as-

purchased condition. CM is performed when equipment has failed unexpectedly in service

at a more-or-less random time (i.e., the out-of-service time is not the operator’s choice as

in PM).

Because such systems are generally required to meet production-run or calendar-based

production goals, as well as safety goals, installation operators want assurance that the plant

equipment will support these goals. On the other hand, the operator is usually constrained

by a limited maintenance budget. As a consequence, operators are faced with the important

problem of balancing maintenance costs against production and safety goals.

Thus, the primary decision problem with which we are confronted is interlinked, and

could be stated in the form of a question: “How should one minimize the total cost of

maintaining a nuclear power plant while ensuring that its reliability meets the standards

defined by the United States Department of Energy?”

There is an enormous literature on optimal maintenance policies for a single item that

dates back to the early 1950s. The majority of the work covers maintenance optimization

over an infinite horizon, see Valdez-Florez and Feldman (1989) for an extensive review.

The problem that we address in this chapter is over a finite planning horizon, which comes

from the fact that every nuclear power plant has a license to operate that expires in a

finite predefined time. In addition the form of the policy is effectively predefined by the

industry as a combination of preventive and corrective maintenance, as we describe below.

Marquez and Heguedas (2002) present an excellent review of the more recent research on

maintenance policies and solve the problem of periodic replacement in the context of a semi-

Markov decision processes methodology. Su and Chang (2000) find the periodic maintenance

3

policies that minimize the life cycle cost over a predefined finite horizon.

A review of the Bayesian approaches to maintenance intervention is presented in Wilson

and Popova (1998). Chen and Popova (2000) propose two types of Bayesian policies that

learn from the failure history and adapt the next maintenance point accordingly. They

find that the optimal time to observe the system depends on the underlying failure dis-

tribution. A combination of Monte Carlo simulation and optimization methodologies is

used to obtain the problem’s solution. In Popova (2004), the optimal structure of Bayesian

group-replacement policies for a parallel system of n items with exponential failure times

and random failure parameter is presented. The paper shows that it is optimal to observe

the system only at failure times. For the case of two items operating in parallel the exact

form of the optimal policy is derived.

With respect to the form of the maintenance policy that we consider, commercial nuclear

power industry practice establishes a set PM schedule for many major equipment classes.

In an effort to avoid mistakenly removing from service sets of equipment that would cause

production loss or safety concerns, effectively all equipment that can be safely maintained

together are grouped and assigned a base calendar frequency. The grouping policy ensures

PM (as well as most CM) will occur at the same time for equipment within a group. The

base frequency is typically either set by the refueling schedule (equipment for which at-

power maintenance is either impossible or undesirable) or calendar-based. The current

thinking for PM is typified by, for example, the INPO guidance, INPO (2002), whereby a

balance is sought between maintenance cost and production loss. By properly taking into

account the probability of production loss, the cost of lost production, the CM cost and PM

cost, the simple model described in this chapter captures industry practice. Not accounted

for in the model is the probability and cost of production loss due to PM during at-power

operation as well as potential critical path extension for PM during scheduled outages (such

as refueling outage). This is partly justified by the fact that PM is only performed during

the equipment’s group assigned outage time (unlike CM when the equipment outage time

is not predetermined) and because outage planning will (generally) assure PM is done off

critical path. In practice, there is normally one or two major equipment activities (main

4

turbine and generator, reactor vessel inspection, steam generator inspection, main condenser

inspection) that, along with refueling, govern critical path during planned outages.

2 Maintenance Model

The model we develop has the following constructs. The problem has a finite horizon of

length L (e.g., 25 or 27 years). We consider the following (positive) costs: Cpm – preventive

maintenance cost, Ccm – corrective maintenance cost, and Cd – downtime cost, which

includes all lost production costs due to a disruption of power generation. Let N(t) be the

counting process for the number of failures in the interval (0, t).

Let the random time to failure of the item from its as-new state be governed by dis-

tribution F with density f . Further assume that each failure of the item causes a loss of

production (i.e., a plant trip) with probability p and in that case a downtime cost, Cd > Ccm,

is instead incurred (this cost can include Ccm, if appropriate).

We consider the following form of a maintenance policy, which we denote (P1):

(P1): Bring the item to the “as-good-as-new” state every T units of time (preventive main-

tenance) at a cost of Cpm. If it fails meanwhile then repair it to the “as-good-as-old” state

(corrective maintenance) for a cost of Ccm or Cd, depending on whether the failure induces

a production loss. The optimization model has a single decision variable T , which is the

time between PMs, i.e., we assume constant interval lengths between PMs. The goal is to

find T ∈ A ⊂ [0, L] that minimizes the total expected cost, i.e.,

z(T ) = minT∈A

[CpmdL/T e+ {pCd + (1− p)Ccm} bL/T cE {N(T )}

+Cpm + {pCd + (1− p)Ccm} bL/T cE{N(L− T bL

Tc)}]

, (1)

where A = {i − integer, i ∈ [0, L]}, b·c is the “floor” (round-down to nearest integer)

operator, d·e is the “ceiling” (round-up to nearest integer) operator, and E {N(T )} is the

expected number of failures, taken with respect to the failure distribution, in the interval

(0, T ). Barlow and Hunter (1960), showed that for the above defined policy, the number

of CMs in an interval of length t follows a nonhomogeneous Poisson process with expected

5

number of failures in the interval (0, t),

E{N(t)} =∫ t

0q(u)du (2)

where q(u) = f(u)1−F (u) is the associated failure rate function. First we will describe an

algorithm to find the optimal T when the failure distribution is IFR (increasing failure

rate). Then we model the failure rate nonparametrically.

3 Optimization Results

The development of successful optimization algorithms in the IFR context detailed in the

previous section requires a closer examination of the objective function z(T ). This exam-

ination will culminate in certain key conditions that will guarantee a solution to the cost

minimization problem posed in equation (1). To this end, consider the following three

propositions in which the form of the IFR function is left unspecified, that is, the theory

below will hold for any arbitrary choice of an IFR function; for example, one could choose

a Weibull failure rate.

The proofs of the following propositions are given in Appendix B. Also the optimization

algorithm corresponding to these propositions, and which was used in the decision analysis

aspect of this chapter appears in Appendix B.

Proposition 1

Assume that we follow maintenance policy (P), and that the time between failures is a

random variable with an IFR distribution, i.e., the failure rate function q(t) is increasing.

Then, the objective function z(T ) is:

(i) lower semicontinuous with discontinuities at T = L/n, n = 1, 2, . . ., and

(ii) increasing and convex on each interval ( Ln+1 ,

Ln ), n = 1, 2, . . ..

Proposition 2

Let D = {d : d = L/n, n = 1, 2, . . .} denote the set of discontinuities of z(T ) (cf. Proposi-

tion 1). Then,

(i) zc(T ) is quasiconvex on [0, L].

Furthermore if d ∈ D then

6

(ii) zc(d) = z(d), and

(iii) limT→d−

[z(T )− zc(T )] = Cpm.

Proposition 3

Let

T ∗c ∈ arg minT∈[0,L]

zc(T )

and let n∗ be the positive integer satisfying Ln∗+1 ≤ T

∗c ≤ L

n∗ . Then,

T ∗ ∈ arg minT∈{ L

n∗+1, Ln∗ }

zc(T )

solves minT∈[0,L]

z(T ).

4 Data and Bayesian Models

The assumption of constant failure rates with the associated exponential failure distribu-

tion and homogeneous Poisson process pervades analysis in today’s nuclear power industry.

Often times, the data gathered in industry are the number of failures in a given time pe-

riod, which is sufficient for exponential (or constant) failure times. In general, we cannot

expect to have a “complete data set” but rather a collection of failure times and “success”

times (a.k.a. right-censored times). Recognizing the limitations of the exponential failure

rate model, the Risk Management Department at STP has implemented the procedure now

recommended by the U.S. Department of Energy, see Blanchard (1993a). The constant

failure rate assumption is dropped in favor of a Bayesian scheme in which the failure rate is

assumed to be a gamma random variable, which is the conjugate prior for the exponential

distribution, and an updating procedure is defined. This methodology requires gathering,

at a local level, the number of failed items in a given month only. The mean of the posterior

distribution, combining the local and industry-wide history, of the failure rate is then being

used as a forecast of the frequency of failures for the upcoming month.

The hazard function associated with a gamma random variable is monotonically in-

creasing or decreasing to unity as time tends to infinity. However, a Weibull family allows

hazards to decay to zero or monotonically increase, and is therefore a common choice in

7

reliability studies when one is uncertain a priori that the instantaneous risk of compo-

nent failure becomes essentially constant after some time, as in the case of exponential and

gamma random variables. Yu et al. (2004a, 2005) apply Bayesian estimation procedures

when the failure rate distribution is assumed Weibull. Thus a Weibull failure rate model

is an improvement on the constant failure rate model. But parametric models such as the

exponential, gamma, and Weibull models all imply that the hazard rate function is uni-

modal and skewed right. This has serious consequences for the estimation of the expected

value in equation (1) if the true underlying failure rate is multi-modal with varying levels

of skewness. A semi-parametric approach that relaxes such stringent requirements on data

while retaining the flavor of these parametric families is therefore warranted.

We first describe the nuclear power plant data to be used in the Bayesian analysis.

4.1 Data

The failure data analyzed in this chapter are for 2 different systems, the Auxiliary Feedwater

System, part of the safety system, and the Electro-Hydraulic Control System, part of the

nuclear power generation reactor.

The primary function of the Auxiliary Feedwater System (AF) is to supply cooling water

during emergency operation. The cooling water supplied is boiled off in Steam Generators to

remove decay heat created by products of nuclear fission. There are four pumps provided in

the system, three electric motor driven and one turbine driven. Each of these pumps supplies

its own Steam Generator although, if required, cross connection capability is provided such

that any steam generator could be cooled by any pump. In addition, isolation valves, flow

control valves, and back flow prevention valves called check valves are installed to control

flow through the system and to provide capability to isolate the pumping system, if needed,

to control an accident. A simplified schematic diagram is shown in Figure 1.

We are investigating the performance of the following five groups:

• the electric motor driven pumps (six in total, three in each of two STP Units), labeled

GS1

8

• the flow control valves for all the pumps (eight in total, four in each of two STP

Units), labeled GS2

• the cross connect valves (eight in total, four in each of two STP Units), labeled GS3

• the accident isolation valves’ motor driver (eight in total, four in each of two STP

Units), labeled GS4

• the check valves (eight in total, four in each of two STP Units), labeled GS5.

We have failure time observations for the 5 different groups described above starting

from 1988 until May 2008. There are 56 exact observations for GS1, 2144 for GS2, 202 for

GS3, 232 for GS4, and 72 for GS5.

Figure 1: Diagram of the Auxiliary Feedwater System

The Electro-Hydraulic Control System (EHC) consists of the high pressure Electro-

Hydraulic fluid system and the electronic controllers, which are used for the control of the

main electrical generating steam turbine as well as the steam turbines that drive the normal

steam generator feed water supply pump. The high pressure steam used to turn the turbines

described above is throttled for speed control and shut off with steam valves using hydraulic

operators (pistons). The control functions include limiting the top speed of the turbines

up to and including shut down of speeds become excessive. EHC provides high pressure

hydraulic oil which is used as the motive force and the electrical signals that adjust the

9

speed control and steam shut-off valves for the turbines. Because only one main electrical

generating steam turbine is provided for each plant, production of electricity from the plant

is stopped or reduced for some failures in the EHC system.

We would like to study the performance of:

• Particular pressure switches that control starting and stopping of the hydraulic fluid

supply pumps; the turbine shutoff warning; and temperature control of steam to the

main turbine from a heat exchanger, labeled GN1.

• Oil cleaning filters located at the outlet of the hydraulic fluid supply pumps, labeled

GN2.

• Solenoid valves (referred to as servo valves) which control the operation of the steam

valves to the main electrical generating steam turbine, labeled GN3.

• The two hydraulic supply pumps (one is required for the system to operate), labeled

GN4.

• Solenoid valves which, when operating properly, respond to certain failures and stop

the main electrical generating steam turbine as necessary to prevent damage, labeled

GN5.

We have failure time observations for the five different groups described above starting

from 1988 until May 2008. There are 221 exact and 1 censored observations for GN1, 243

exact and 1 censored for GN2, 133 exact for GN3, 223 exact for GN4, and 91 exact for

GN5.

4.2 Bayesian Parametric Model

Here we present a Bayesian model for the failure times assuming that the lifetime distribu-

tion is Weibull with random parameters λ and β.

The sampling plan is as follows: If we have n items under observation,s of which have

failed at ordered times T1, T2, . . . , Ts, then n− s have operated without failing. If there are

10

no withdrawals then the total time on test is: ω = nT βs , which is also the sufficient statistic

for estimating λ (also known as the rescaled total time on test).

Assume that λ has an inverted gamma prior distribution with hyper-parameters ν0, µ0

and β has uniform prior distribution with hyper-parameters α0, β0.

We use the generic information (lognormal prior for λ) to assess the values of α0 and β0.

First we calculate the mean and variance for the lognormal distribution with parameters:

M = lnMn − S2/2 (3)

S2 =(

lnEF1.645

)2

(4)

where Mn is the mean and EF is the error factor as derived in Blanchard (1993b). Then

we compute the parameters of the Gamma prior distribution, α0, β0:

α0 = Mn/β0 (5)

β0 = e2MeS2(eS

2 − 1)/Mn. (6)

We will follow the development from Martz and Waller (1982). The posterior expecta-

tions of λ and β are:

E[λ|z] =J2

(s+ ν0 − 1)J1(7)

E[β|z] =J3

J1(8)

where z is a summary of the sample evidence, ν =∏si=1 Ti, ω1 = nTs + µ0, J1 =∫ β0

α0

[βsνβ

ωs+ν01

]dβ, J2 =

∫ β0

α0

[βsνβ

ωs+ν0−11

]dβ, and J3 =

∫ β0

α0

[βs+1νβ

ωs+ν01

]dβ.

We use numerical integration to solve the above integrals.

The choice of the hyper-parameters values is not random. We used the values given in

the DOE database (see Blanchard, 1993b) for the values of the inverted gamma parameters,

and empirically assessed the parameters of the uniform prior distribution, for details see Yu

et al. (2004b). Table 1 shows the Bayes estimates of the model parameters for each group

of data.

In the Appendix we offer screen shots that are used by the engineers at STP to implement

the above estimation model in combination with the optimization model described in §3 on

a routine basis.

11

Table 1: Bayes estimates of the parametric Bayes modelGroup λ β

GN1 0.000152 4.763118GN2 0.000448 1.994206GN3 0.000294 9.980280GN4 0.001594 1.006348GN5 0.000298 5.931251GS1 0.000313 9.910241GS2 0.001499 0.348347GS3 0.000932 0.655784GS4 0.001667 0.723488GS5 0.000308 11.109310

4.3 Bayesian Semi-parametric Model

The failure data is modeled using an accelerated failure time (AFT) specification, along

with the Cox proportional hazard regression model:

Sx(t) = S0(ex′βt),

where x is a p dimensional vector of covariates, Sx(·) the corresponding reliability function,

and S0 is the “baseline” reliability function.

The mixture of Polya trees models (MPTs) of Hanson (2006) are used to model the

reliability or survival data. These models are “centered” at the corresponding parametric

models but allow significant data-driven deviations from parametric assumptions while re-

taining predictive power associated with parametric modeling. This point will be elaborated

in Appendix A. This more flexible approach in essence assumes a nonparametric model for

S0; i.e., an MPT prior is placed on S0. For the power plant data considered in this chapter,

without loss of generality, we take the first group from each system to be represented by

S0. Then, the covariates are indicator variables corresponding to each of the remaining five

groups in the data. Note that, if needed, you could easily incorporate other continuous,

time-dependent covariates as well in the above formulation. Details of the MPT model are

provided in Appendix A.

The mixture of Polya trees (MPT) prior provides an intermediate choice between a

12

strictly parametric analysis and allowing S0 to be completely arbitrary. In some ways

it provides the best of both worlds. In areas where data are sparse, such as the tails,

the MPT prior places relatively more posterior mass on the underlying parametric family

{Gθ : θ ∈ Θ}. In areas where data are plentiful the posterior is more data driven; and

features not allowed in the strictly parametric model, such as left-skew and multimodality,

become apparent. The user-specified weight w controls how closely the posterior follows

{Gθ : θ ∈ Θ} with larger values of w yielding inference closer to that obtained from the

underlying parametric model. Hanson (2006) describes priors for w; we simply fix w to be

some small value, typically w = 1.

Assume standard, right-censored reliability data D = {(xi, ti, δi)}ni=1, and let Di denote

the ith triple (xi, ti, δi). Let Ti ∼ Sxi(·). As usual, δi = 0 indicates that ti is a censoring

time, Ti > ti, and δi = 1 denotes that ti is a survival time, Ti = ti.

Given S0 (through Y and θ) and β, the survival function for covariates x is

Sx(t|Y, θ, β) = S0(ex′βt|Y, θ), (9)

and the pdf is

fx(t|Y, θ, β) = ex′βf0(ex

′βt|Y, θ), (10)

where S0(t|Y, θ) and f0(t|Y, θ) are given by (16) and (17).

Assuming uninformative censoring, the likelihood for right censored data is then given

by

L(Y, θ, β) =n∏i=1

fxi(ti|Y, θ, β)δiSxi(ti|Y, θ, β)1−δi . (11)

The MCMC algorithm alternately samples [β, θ|Y,D] and [Y|β, θ,D]. The vector [β, θ|Y,D]

is efficiently sampled with a random walk Metropolis Hastings step (Tierney, 1994). A pro-

posal that has worked very well in practice is obtained from using the large sample estimated

covariance matrix from fitting the parametric log-logistic model via maximum likelihood.

A simple Metropolis-Hastings step for updating the components (Yε0, Yε1) one at a time

first samples a candidate (Y ∗ε0, Y∗ε1) from a Dirichlet(mYε0,mYε1) distribution, where m > 0,

13

typically m = 20. This candidate is accepted as the “new” (Yε0, Yε1) with probability

ρ = min

{1,

Γ(mYε0)Γ(mYε1)(Yε0)mY∗ε0−wj2(Yε1)mY

∗ε1−wj2L(Y∗, θ, β)

Γ(mY ∗ε0)Γ(mY ∗ε1)(Y ∗ε0)mYε0−wj2(Y ∗ε1)mYε1−wj2L(Y, θ, β)

},

where j is the number of digits in the binary number ε0 and Y∗ is the set Y with (Y ∗ε0, Y∗ε1)

replacing (Yε0, Yε1). The resulting Markov chain has mixed reasonably well for many data

sets unless w is set to be very close to zero.

Figure 2: Estimated hazard curves for the AF system using the MPT model

Based on Figures 2, 3, and 4 it is clear that the assumption of IFR is inappropriate.

The 10 groups depict different hazard patterns that is nicely captured by the MPT model.

Also, the survival probabilities are widely different. For example, for the EHC system, the

group GN3 is by far the most reliable, whereas the group GS1 performs best for the AF

system.

In contrast, the parametric Weibull model produces different results. Figures 5 and

14

Figure 3: Estimated hazard curves for the EHC system using the MPT model

6 show the corresponding hazard curves using the Bayes point estimators of the model

parameters.

15

Figure 4: Estimated survival curves for the AF and EHC systems using the MPT model

Figure 5: Estimated hazard curves for the EHC system using the parametric Bayes model

16

Figure 6: Estimated hazard curves for the AF system using the parametric Bayes model

17

Appendix A

In this chapter we have introduced Bayesian decision analysis to enable engineers reach

an optimal decision in the context of maintaining nuclear power plants. Typical Bayesian

analysis usually stops at inference. However, as noted by De Groot, Smith, and Berger,

for Bayesian methods to gain acceptance, one needs to extend posterior inference to actual

decisions.

In this chapter we have used Polya trees and mixtures of Polya trees to carry out

inference. We provide some details of this methodology.

Extending the work of Lavine (1992, 1994), mixtures of Polya trees have been considered

by Berger and Guglielmi (2001), Walker and Mallick (1999), Hanson and Johnson (2002),

Hanson (2006), Hanson et al. (2006), and Hanson (2007).

The last four references deal with survival or reliability data. This prior is attractive in

that it is easily centered at a parametric family such as the class of Weibull distributions.

When sample sizes are small, posterior inference has much more of a parametric flavor due to

the centering family. When sample sizes increase, the data take over the prior and features

such as multi-modality and left skew in the reliability distributions are better modeled.

Consider a mixture of Polya trees prior on S0,

S0|θ ∼ PT (c, ρ,Gθ), (12)

θ ∼ p(θ), (13)

where (12) is shorthand for a particular Polya tree prior (Hanson and Johnson, 2002). We

briefly describe the prior but leave details to the references above.

Let J be a fixed, positive integer and let Gθ denote the family of Weibull cumula-

tive distribution functions, Gθ(t) = 1 − exp{−(t/λ)α} for t ≥ 0, where θ = (α, λ)′. The

distribution Gθ serves to center the distribution of the Polya tree prior. A Polya tree

prior is constructed from a set of partitions Πθ and a family A of positive real num-

bers. Consider the family of partitions Πθ = {Bθ(ε) : ε ∈⋃Jl=1{0, 1}l}. If j is the

base-10 representation of the binary number ε = ε1 · · · εk at level k, then Bθ(ε1 · · · εk) is

18

defined to be the interval (G−1θ (j/2k), G−1

θ ((j + 1)/2k)). For example, with k = 3, and

ε = 000, then j = 0 and Bθ(000) = (0, G−1θ (1/8)), and with ε = 010, then j = 2 and

Bθ(010) = (G−1θ (2/8), G−1

θ (3/8)), etc.

Note then that at each level k, the class {Bθ(ε) : ε ∈ {0, 1}k} forms a partition of

the positive reals and furthermore Bθ(ε1 · · · εk) = Bθ(ε1 · · · εk0)⋃Bθ(ε1 · · · εk1) for k =

1, 2, . . . ,M − 1. We take the family A = {αε : ε ∈⋃Mj=1{0, 1}j} to be defined by αε1···εk =

wk2 for some w > 0 (Walker and Mallick, 1999; Hanson and Johnson, 2002). As w tends to

zero the posterior baseline is almost entirely data-driven. As w tends to infinity we obtain

a fully parametric analysis.

Given Πθ and A, the Polya tree prior is defined up to level J by the random vectors

Y = {(Yε0, Yε1) : ε ∈⋃M−1j=0 {0, 1}j} through the product

S0{Bθ(ε1 · · · εk)|Y, θ} =k∏j=1

Yε1···εj , (14)

for k = 1, 2, . . . ,M , where we define S0(A) to be the baseline measure of any set A. The

vectors (Yε0, Yε1) are independent Dirichlet:

(Yε0, Yε1) ∼ Dirichlet(αε0, αε1), ε ∈M−1⋃j=0

{0, 1}j . (15)

The Polya tree parameters “adjust” conditional probabilities, and hence the shape of

the survival density f0, relative to a parametric centering family of distributions. If the

data are truly distributed Gθ observations should be on average evenly distributed among

partition sets at any level j. Under the Polya tree posterior, if more observations fall into

interval Bθ(ε0) ⊂ R+ than its companion set Bθ(ε1), the conditional probability Yε0 of

Bθ(ε0) is accordingly stochastically “increased” relative to Yε1. This adaptability makes

the Polya tree attractive in its flexibility, but also anchors the random S0 firmly about the

family {Sθ : θ ∈ Θ}.

Beyond sets at the level J in Πθ we assume S0|Y, θ follows the baseline Gθ. Hanson

and Johnson (2002) show that this assumption yields predictive distributions that are the

same as from a fully specified (infinite) Polya tree for large enough J ; this assumption also

19

avoids a complication involving infinite probability in the tail of the density f0|Y, θ that

arises from taking f0|Y, θ to be flat on these sets.

Define the vector of probabilities p = p(Y) = (p1, p2, . . . , p2J )′ as pj+1 = S0{Bθ(ε1 · · · εJ)|Y, θ} =∏Ji=1 Yε1···εi where ε1 · · · εJ is the base-2 representation of j, j = 0, . . . , 2J − 1. After simpli-

fication, the baseline survival function is,

S0(t|Y, θ) = pN[N − 2JGθ(t)

]+

2J∑j=N+1

pj , (16)

where N denotes the integer part of 2JGθ(t)+1 and where gθ(·) is the density corresponding

to Gθ. The density associated with S0(t|Y, θ) is given by

f0(t|Y, θ) =2J∑j=1

2Jpjgθ(t)IBθ(εJ (j−1))(t) = 2JpNgθ(t), (17)

where εJ(i) is the binary representation ε1 · · · εJ of the integer i and N as is in (16). Note

that the number of elements of Y may be moderate. This number is∑J

j=1 2j = 2J+1 − 2.

For J = 5, a typical level, this is 62, of which half need to be sampled in an MCMC scheme.

Appendix B: Implementation in STP

Proposition 1 Assume that we follow maintenance policy (P), and that the time between

failures is a random variable with an IFR distribution, i.e., the failure rate function q(t) is

increasing. Then, the objective function z(T ) is:

(i) lower semicontinuous with discontinuities at T = L/n, n = 1, 2, . . ., and

(ii) increasing and convex on each interval ( Ln+1 ,

Ln ), n = 1, 2, . . ..

Proof

We first show (ii). Assume that T ∈ ( Ln+1 ,

Ln ) for some n. Within this interval

z′(T ) ≡ dz(T )dT

= nC̄cm {q(T )− q(L− nT )]} .

Here, we have used q(t) = ddtE[N(t)] and the fact that bL/T c = n is constant within

this n-th interval. That z increases on this interval follows from the fact that z′(T ) is

positive since q is increasing and L − nT < T for T ∈ ( Ln+1 ,

Ln ). To show convexity, we

20

show that z′(T ) is increasing. Let T + ∆T ∈ ( Ln+1 ,

Ln ). Then, q(T + ∆T ) ≥ q(T ) and

q(L− n(T + ∆T )) ≤ q(L− nT ), and hence z′(T + ∆T ) ≥ z′(T ). This completes the proof

of (ii).

The convexity result for z(T ) from (ii) shows that T = L,L/2, L/3, . . . are the only

possible points of discontinuity in z(T ). The first term in z(T ) (i.e., the PM term) is lower

semicontinuous because of the ceiling operator. The second term (i.e., z(T ) when Cpm = 0)

is continuous in T . To show this it suffices to verify that

bL/T cE [N(T )] + E

[N

(L− T

⌊L

T

⌋)]has limit nE[N(d)] for both T → d− and T → d+, where d = L/n for any n = 1, 2, . . .. This

follows in a straightforward manner using the bounded convergence theorem, e.g., Cinlar

(1975, pp. 33). Hence, we have (i).

Let

zc(T ) = Cpm(L/T ) + C̄cm(L/T )E [N(T )] ,

i.e., zc(T ) is z(T ) with bL/T c and dL/T e replaced by L/T . The following proposition

characterizes zc(T ) and its relationship with z(T ).

Proposition 2

Let D = {d : d = L/n, n = 1, 2, . . .} denote the set of discontinuities of z(T ) (cf. Proposi-

tion 1). Then,

(i) zc(T ) is quasiconvex on [0, L].

Furthermore if d ∈ D then

(ii) zc(d) = z(d), and

(iii) limT→d−

[z(T )− zc(T )] = Cpm.

Proof

(i): We have

zc(T ) =Cpm + C̄cmE [N(T )]

T/L. (18)

The numerator of (18) is convex in T because E [N(T )] has increasing derivative q(T ). As

a result, zc(T ) has convex level sets and is hence quasiconvex.

21

(ii): The result is immediate by evaluating z and zc at points of D.

(iii): Let d ∈ D. Then, there exists and positive integer n with d = L/n. Note that due to

the ceiling operator, we have the following result

limT→d−

{⌈L

T

⌉− L

T

}= 1. (19)

Based on (19) and the definitions of z and zc it suffices to show

bL/T cE [N(T )] + E

[N

(L− T

⌊L

T

⌋)]has limit nE[N(d)] as T → d−. This was established in the proof of part (i) in Proposition

1.

Proposition 2 shows that at its points of discontinuity, our objective function z drops by

magnitude Cpm to agree with zc. Before developing the associated algorithm, the following

proposition shows how to solve a simpler variant of our model in which the set A of feasible

replacement intervals is replaced by [0, L].

Proposition 3

Let

T ∗c ∈ arg minT∈[0,L]

zc(T )

and let n∗ be the positive integer satisfying Ln∗+1 ≤ T

∗c ≤ L

n∗ . Then,

T ∗ ∈ arg minT∈{ L

n∗+1, Ln∗ }

zc(T )

solves minT∈[0,L]

z(T ).

Proof

Proposition 1 says that z(T ) increases on each interval ( Ln+1 ,

Ln ), and is lower semicon-

tinuous. As a result, minT∈[0,L] z(T ) is equivalent to minT∈D z(T ), where D = {d : d =

L/n, n = 1, 2, . . .} is the set of discontinuities of z. This optimization model is, in turn,

equivalent to minT∈D zc(T ) since, part (ii) of Proposition 2 establishes that z(T ) = zc(T )

for T ∈ D. Finally, since zc(T ) is quasiconvex (see Proposition 2 part (i)), we know zc(T )

is nondecreasing on [T ∗c , L] and nonincreasing on [0, T ∗c ]. Proposition 3 follows.

22

The Optimization Algorithm

Proposition 3 shows how to solve minT∈[0,L] z(T ). The key idea is to first solve the

simpler problem minT∈[0,L] zc(T ), which can be accomplished efficiently, e.g., via a Fibonacci

search. Let T ∗c denote the optimal solution to the latter problem. Then, the optimal solution

to the former problem is either the first point in D to the right of T ∗c or the first point in

D to the left of T ∗c , whichever yields a smaller value of z(T ) (or zc(T ) since they are equal

on D).

As described above, our model has the optimization restricted over a finite grid of points

A ⊂ [0, L]. Unfortunately, it is not true that the optimal solution is given by either the

first point in A to the right of T ∗c or the first point in A to the left of T ∗c . However, the

results of Proposition 1, part (ii) establish that within the set A ∩ ( Ln+1 ,

Ln ) we can restrict

attention to the left-most point (or ignore the segment if the intersection is empty). To

simplify the discussion that follows we will assume that A has been redefined using this

restriction. Furthermore, given a feasible point TA ∈ A with cost z(TA) we can eliminate

from consideration all points in A to the right of min{T ∈ D : T ≥ T ∗c , zc(T ) ≥ z(TA)}

and all points in A to the left of max{T ∈ D : T ≤ T ∗c , zc(T ) ≥ z(TA)}. This simplification

follows from the fact that z increases on each of its intervals and is equal to the quasiconvex

zc on D.

Our algorithm therefore consists of:

• Step 0: Let T ∗c ∈ arg minT∈[0,L] zc(T ).

• Step 1: Let T ∗ ∈ A be the first point in A to the right of T ∗c and let z∗ = z(T ∗). Let

TA = T ∗.

• Step 2: Increment TA to be the next point in A to the right. If z(TA) < z∗ then

z∗ = z(TA) and T ∗ = TA. Let d be the next point to the right of TA in D. If

zc(d) ≥ z∗ then let TA to be the first point in A to the left of T ∗c and proceed to the

next step. Otherwise, repeat this step.

• Step 3: If z(TA) < z∗ then z∗ = z(TA) and T ∗ = TA. Let d be the next point to the

23

left of TA in D. If zc(d) ≥ z∗ then output T ∗ and z∗ and stop. Otherwise, increment

TA to be the next point in A to the left and repeat this step.

Step 0 solves the simpler continuous problem for T ∗c , as described above and Step 1

simply finds an initial candidate solution. Then, Steps 2 and 3, respectively, increment to

the right and left of T ∗c , moving from the point of A in, say, A ∩ ( Ln+1 ,

Ln ) to the point (if

any) in the next interval. Increments to the right and to the left stop when further points

in that direction are provably suboptimal. The algorithm is ensured to terminate with an

optimal solution.

Acknowledgements

The authors would like to thank the OR/IE graduate students Alexander Galenko and

Dmitriy Belyi for their help in proving the propositions and computational work. Thank

you to Ernie Kee for providing the data and helping with the nuclear engineering part of

the paper. This research has been partially supported by the Electric Power Research Insti-

tute under contract EP-P22134/C10834, STPNOC under grant B02857, and the National

Science Foundation under grants CMMI-0457558 and CMMI-0653916.

24

References

Barlow, R., Hunter, L., 1960. Optimum preventive maintenance policies. Operations Re-

search 8, 90–100.

Berger, J. O., Guglielmi, A., 2001. Bayesian testing of a parametric model versus nonpara-

metric alternatives. Journal of the American Statistical Association 96, 174–184.

Blanchard, A., 1993a. Savannah river site generic data base development. Tech. Rep. WSRC-

TR 93-262, National Technical Information Service, U.S. Department of Commerce, 5285

Port Royal Road, Springfield, VA 22161, Savannah River Site, Aiken, South Carolina

29808.

Blanchard, A., 1993b. Savannah river site generic data base development. Tech. Rep.

WSRC-TR 93-262, National Technical Information Service, U.S. Department of Com-

merce, 5285 Port Royal Road, Springfield, VA 22161, Savannah River Site, Aiken, South

Carolina 29808.

Bridges, M., Worledge, D., August 2002. Reliability and risk significance for maintenance

and reliability professionals at nuclear power plants. Tech. rep., EPRI Technical Report

1007079.

Chen, T.-M., Popova, E., 2000. Bayesian maintenance policies during a warranty period.

Communications in Statistics - Stochastic Models 16 (1), 121–142.

Cinlar, E., 1975. Introduction to stochastic processes. Prentice Hall.

Gann, C., Wells, J., 2004. Work Process Program. STPNOC plant procedure 0PGP03-ZA-

0090, Revision 28.

Hanson, T., 2006. Inferences on mixtures of finite Polya tree models. Journal of the Amer-

ican Statistical Association, 101, 1548–1565.

Hanson, T., 2007. Polya trees and their use in reliability and survival analysis. In: Ency-

clopedia of Statistics in Quality and Reliability. John Wiley & Sons, in press.

25

Hanson, T., Johnson, W., Laud, P., 2006. A semiparametric accelerated failure time model

for survival data with time dependent covariates. In: Upadhyay, S., Singh, U., Dey,

D. (Eds.), Bayesian Statistics and its Applications. New Delhi:Anamaya Publishers, pp.

254–269.

Hanson, T., Johnson, W. O., 2002. Modeling regression error with a mixture of Polya trees.

Journal of the American Statistical Association 97, 1020–1033.

Hudson, E., Richards, D., 2004. Configuration Risk Management Program. STPNOC plant

procedure 0PGP03-ZA-0091, Revision 6.

INPO, November 2002. Equipment Reliability Process Description. Institute of Nuclear

Power Operations process description, AP 913, Revision 1.

Lavine, M., 1992. Some aspects of Polya tree distributions for statistical modeling. Annals

of Statistics 20, 1222–1235.

Lavine, M., 1994. More aspects of Polya trees for statistical modeling. Annals of Statistics

22, 1161–1176.

Liming, J., Kee, E., Johnson, D., Sun, A., April 20-30 2003. Practical application of decision

support metrics for nuclear power plant risk-informed asset management. In: Proceedings

of the 11th International Conference on Nuclear Engineering. Tokyo, Japan.

Marquez, A., Heguedas, A., 2002. Models for maintenance optimization: a study for re-

pairable systems and finite time periods. Reliability Engineering and System Safety 75,

367–377.

Martz, H. F., Waller, R. A., 1982. Bayesian reliability analysis. John Wiley and Sons.

Popova, E., 2004. Basic optimality results for Bayesian group replacement policies. Opera-

tions Research Letters 32, 283–287.

Su, C.-T., Chang, C.-C., 2000. Minimization of the life cycle cost for a multistate system

under periodic maintenance. International Journal of Systems Science 31, 217–227.

26

Tierney, L., 1994. Markov chains for exploring posterior distributions. The Annals of Statis-

tics 22 (4), 1701–1762.

Valdez-Florez, C., Feldman, R., 1989. A survey of preventive maintenance models for

stochastically deteriorating single-unit systems. Naval Research Logistics, 419–446.

Walker, S. G., Mallick, B. K., 1999. Semiparametric accelerated life time model. Biometrics

55, 477–483.

Wilson, J. G., Popova, E., 1998. Bayesian approaches to maintenance intervention. In:

Proceedings of the section on Bayesian Science of the American Statistical Association.

pp. 278–284.

Yu, W., Popova, E., Kee, E., Sun, A., Grantom, R., May 16-20 2005. Basic factors to

forecast maintenance cost for nuclear power plants. In: Proceedings of 13th International

Conference on Nuclear Engineering. Beijing, China.

Yu, W., Popova, E., Kee, E., Sun, A., Richards, D., Grantom, R., July 2004a. Equipment

data development case study-Bayesian Weibull analysis. Tech. rep., STPNOC.

Yu, W., Popova, E., Kee, E., Sun, A., Richards, D., Grantom, R., July 2004b. Equipment

data development case study-Bayesian Weibull analysis. Tech. rep., STPNOC.

27

bayesian analysis and decisions in nuclear power plant ...popova/popova_morton_damien_hanson.pdf ·...

Documents