ae4-s12 article1 reliability(1)

14
1 DESIGNING AND MANAGING FOR A RELIABILITY OF ZERO Mike Hurley (1) , Bill Purdy (2) (1) Naval Research Laboratory, 4555 Overlook Avenue, SW, Washington DC 20375, +001 202-767-0528, [email protected] (2) Purdy Engineering, 4555 Overlook Avenue, SW, Washington DC 20375, +001 202-767-0529, [email protected] ABSTRACT The goal of this paper is to provoke the reader to evaluate their thoughts on reliability. All space systems desire high reliability; however, if a program delivers late, then the true reliability for users is zero for every year it is late. Ironically, efforts to achieve high reliability often prove counterproductive. The lack of a sound understanding about reliability can lead to unnecessary design complexity, excessive process controls, increased costs, and unacceptable schedule delays, resulting in reduced availability to the user. This paper discusses what reliability is and what it is not, while highlighting common misunderstandings that often mislead designers and managers. We examine historical space systems and on-orbit data to confirm key points. We apply this study to the typical small satellite challenge of providing good reliability within a small budget and short schedule. Ultimately, this paper strives to advance the industry-wide understanding necessary to better achieve reliable, available systems for users. 1. APPROACH In this paper we discuss reliability on three levels – philosophy, technical understanding, and evaluation of reliability practices and benefits. We then discuss where small satellites and systems are particularly good at improving aspects of reliability. 2. PHILOSOPHY The space industry’s philosophy and management understanding of reliability may be one of the most important drivers in space programs today. Larger space programs especially make many decisions and implement many processes, consciously or subconsciously, that are rooted in the “requirement” that the program “be reliable.” From a management perspective, this usually means that in the end the system must work. While the system’s cost and schedule are essential at the beginning of a program, these factors almost inevitably give way to technical and programmatic factors often tied to both actual and perceived reliability. This reliability is heavily influenced by the perspective of the space system program office and/or developers; rarely from the perspective of the end users. If a program delivers late, then the true reliability is zero for every day (usually every year) it is late. Similarly, if a program’s cost doubles, then theoretically the users have lost the benefit of an additional future space system. Ironically, efforts to achieve high technical and process reliability often prove counterproductive to schedule and cost, which are essential elements of reliability, especially from a user’s

Upload: krusha03

Post on 29-Nov-2014

37 views

Category:

Documents


4 download

TRANSCRIPT

1

DESIGNING AND MANAGING FOR A RELIABILITY OF ZERO

Mike Hurley(1)

, Bill Purdy(2)

(1)

Naval Research Laboratory, 4555 Overlook Avenue, SW, Washington DC 20375,

+001 202-767-0528, [email protected]

(2)

Purdy Engineering, 4555 Overlook Avenue, SW, Washington DC 20375,

+001 202-767-0529, [email protected]

ABSTRACT

The goal of this paper is to provoke the reader to evaluate their thoughts on reliability. All space

systems desire high reliability; however, if a program delivers late, then the true reliability for

users is zero for every year it is late. Ironically, efforts to achieve high reliability often prove

counterproductive. The lack of a sound understanding about reliability can lead to unnecessary

design complexity, excessive process controls, increased costs, and unacceptable schedule

delays, resulting in reduced availability to the user. This paper discusses what reliability is and

what it is not, while highlighting common misunderstandings that often mislead designers and

managers. We examine historical space systems and on-orbit data to confirm key points. We

apply this study to the typical small satellite challenge of providing good reliability within a

small budget and short schedule. Ultimately, this paper strives to advance the industry-wide

understanding necessary to better achieve reliable, available systems for users.

1. APPROACH

In this paper we discuss reliability on three levels – philosophy, technical understanding, and

evaluation of reliability practices and benefits. We then discuss where small satellites and

systems are particularly good at improving aspects of reliability.

2. PHILOSOPHY

The space industry’s philosophy and management understanding of reliability may be one of the

most important drivers in space programs today. Larger space programs especially make many

decisions and implement many processes, consciously or subconsciously, that are rooted in the

“requirement” that the program “be reliable.” From a management perspective, this usually

means that in the end the system must work. While the system’s cost and schedule are essential

at the beginning of a program, these factors almost inevitably give way to technical and

programmatic factors often tied to both actual and perceived reliability. This reliability is heavily

influenced by the perspective of the space system program office and/or developers; rarely from

the perspective of the end users. If a program delivers late, then the true reliability is zero for

every day (usually every year) it is late. Similarly, if a program’s cost doubles, then theoretically

the users have lost the benefit of an additional future space system.

Ironically, efforts to achieve high technical and process reliability often prove counterproductive

to schedule and cost, which are essential elements of reliability, especially from a user’s

2

perspective. The authors have witnessed many schedules extended and costs increased for the

sake of nominally robust processes. In fact, program cancellation due to budget overruns can and

does occur, resulting in a permanent reliability of zero. Unfortunately, the processes intended to

enhance reliability often are executed by personnel with little experience or technical knowledge

and are sometimes performed at great time and expense to address relatively low risk items.

Without examples and more technical discussion, comments like this are too general to be

productive. So, to conclude this philosophy section, let’s just emphasize that schedule and cost

should be viewed as critical elements of a space system’s reliability, and that true reliability for

user operations is what ultimately counts.

Technically - What Is Reliability?

True reliability is a measure of how well a system performs in its operational environment.

Regardless of the number of parts, design reviews, quality inspections, etc. used to develop the

system, in the end the system may prove highly reliable or terribly unreliable. By necessity, all

space system development has to deal with estimated and perceived reliability. According to the

Reliability Analysis Center, reliability is “the probability that an item can perform its intended

function for a specified interval under stated conditions without failure.” There is great value in

performing reliability related analyses and best practices, yet it can be surprising to watch how

much weight a program will put on a specific number resulting from a reliability analysis. This

single reliability number is often highly politically charged and is often presented to outside

organizations in an attempt to provide a simplified understanding of how likely the system is to

work. At a System Requirements and Design Review (SRDR), the authors witnessed a program

office order that the reliability analysis be completed by the Preliminary Design Review (PDR)

and at the same time announce that the reliability for the space system including launch will be

90%. While the resulting analysis may have been useful in creating a perceived reliability for

sponsors or others necessary to politically support the program, clearly this pre-determined

“analysis” did not add value to the design or final operations. Such political emphasis and

simplified understanding is undoubtedly one of the issues in applying good reliability analysis

and balanced processes to space systems.

What is the Reliability Prediction?

The reliability prediction is a calculated likelihood of the avoidance of part failures that induce

loss of a spacecraft or mission. In other words, 1 minus the reliability number equals the

probability of mission loss or degradation due to a part failure. The reliability analysis pulls upon

mean time between failure data for standardized parts and electrical connections. In the U.S.,

MIL-HDBK-217 is the guideline for performing such analyses. The analyses consider the failure

of each electronic part and connection in a system and the moving mechanical assemblies. It is

especially important to understand the contributors to true reliability that are and are not

considered in a reliability prediction:

3

Failure Modes Considered in Reliability Prediction

Failure Modes Not Considered in Reliability Prediction

• Electronic part failure

• Solder joint failure

• Connector / pin failure

• Mechanical element, e.g. bearing failure

• Design failure

• Software failure

• Operator error

• Proper build, assembly & workmanship

• Late launch

• Insufficient funds

As shown, the reliability prediction is based primarily on electronics and in particular electronic

part failure. This is useful, but the failure data are gathered from a database of the reliability of

mass produced electronics operated primarily on earth. The problem with such a database,

particularly in the rapidly evolving world of electronics, is that time is required to accumulate

statistically valid datasets and much of the supporting data are outdated.

Limitations of Reliability Prediction

Some of the best explanations on the limitations of reliability predictions come from MIL-

HDBK-217 revision F, section 3. The direct extractions are shown in italics.

3.3 Limitations of Reliability Predictions – This handbook provides a common basis for

reliability predictions, based on analysis of the best possible data at the time of issue. It is

intended to make reliability prediction as good a tool as possible. However, like any tool,

reliability prediction must be used intelligently, with due consideration of its limitations.

The first limitation is that the failure rate models are point estimates which are based on

available data.

Even when used in similar environments, the differences between system applications can

be significant. Predicted and achieved reliability have always been closer for ground

electronic systems than for avionic systems, because the environmental stresses vary less

from system to system on the ground and hence the field conditions are, in general, closer

to the environment under which the data was collected for the prediction model.

However, failure rates are also impacted by operational scenarios, operator

characteristics, maintenance practices, measurement techniques and differences in

definition of failure. Hence, a reliability prediction should never be assumed to represent

the expected field reliability as measured by the user … note that none of the

applications discussed above requires the predicted reliability to match the field measurement.

So the handbook itself points out that it was never intended to be an accurate predictor of

operational reliability or the probability of success. On-orbit data tend to strongly confirm this

point. For example, predicted reliability levels for generally single string, small satellites would

indicate that they are highly unlikely to last more than a couple of years. However, most

satellites that survive the 1st year often last for 5 to 10 years. NASA’s Earth Observation-1 (EO-

1) satellite is a single string system that launched in November 2000. The EO-1 spacecraft bus

had a predicted reliability probability of success (Ps) of 75% for 1 year, 32% for 4 years, etc.

Note, this is only the bus reliability – the predicted payload reliability reduces these numbers

4

much further. Fig. 1 shows the ash plume from the April 2010 eruption of Eyjafjallajökull

volcano from the statistically impossible EO-1 spacecraft near 9.5 years after launch. The

WindSat/Coriolos mission is an example of a moderate (but not high) reliability mission. The

WindSat payload is a 22 channel radiometer with mostly a single string design; its calculated

reliability predicted only a 3% probability of success in 2010, 7 years after launch, yet WindSat

is operating 24-7 today. The Coriolis spacecraft upon which the WindSat payload resides is fully

redundant yet in 7 years it has not experienced a failure on its primary side, thus functioning as if

it were single string and lasting 7 years. Conversely, even the highest reliability commercial

communication spacecraft with a Ps >95% for five years and >90% for 10 years regularly have

their mission lives substantially reduced. The “Satellites & Launches Trend Down” article by the

Teal Group [1] lists over a dozen high reliability geosynchronous spacecraft which had their

mission life reduced or ended within 5-10 years after launch.

Fig. 1. April 2010 Eruption of Eyjafjallajökull Volcano from the EO-1 spacecraft [2]

Value of Reliability Prediction

Some of the best explanations on the value of reliability predictions also come from MIL-

HDBK-217F:

3.2 The Role of Reliability Prediction - Reliability prediction provides the quantitative

baseline needed to assess progress in reliability engineering. A prediction made of a

proposed design may be used in several ways. Once a design is selected, the reliability

prediction may be used as a guide to improvement by showing the highest contributors to

failure. If the part stress analysis method is used, it may also reveal other fruitful areas

for change (e.g., over stressed parts). The impact of proposed design changes on

reliability can be determined only by comparing the reliability predictions of the existing

and proposed designs. The ability of the design to maintain an acceptable reliability level

under environmental extremes may be assessed through reliability predictions. The

predictions may be used to evaluate the need for environmental control systems.

5

So, reliability prediction analysis is a strong quantitative approach for comparing and improving

designs. This prediction analysis, along with associated reliability analyses such as the failure

modes and effects analysis (FMEA) and parts stress analysis over temperature, are excellent for

identifying weak links in a design and making corrections. Design improvements include

electronics parts substitution, implementation of select redundancy, changes in thermal design

requirements, and simplification to reduce complexity.

3. LINKING RELIABILITY TO SCHEDULE

A reliability prediction and true reliability are inextricably linked to the variable of time. The

user’s perspective on reliability cares not if the satellite will work eventually in the distant future,

but only that it is available today (or when promised).

A reliability requirement is typically defined as a probability of success of XX% at some number

of years after launch. This equates to R1 at year 1, R2 at year 2, etc. Writing the reliability

requirement by year makes the impacts of program decisions and delays more visible. For

example if R1=0.90 for 2012, R2=0.87 for 2013, R3= 0.85 for 2014, etc., then, if a program’s

schedule slips a year to 2013, R1= 0.

The fundamental reliability equation will be used to illustrate the reliability/time relationship;

t

s ePλ−

= (1)

Where Ps is probability of success with the value 1.0 representing no chance of failure

whatsoever, λ = failure rate in Failure In Time (FIT) per billion hours (1 FIT denotes one failure

per billion hours operation ~114,100 years), and t = time in hours.

Let’s examine the probability of success for a single spacecraft from a user’s perspective. Our

user has been promised a space based capability to start in year zero and continue through year

five. We will study the results of two variables using Eq. 1: the predicted reliability and the

launch delay in years. These results describe the predicted reliability at the end of 5 years as the

reliability requirement is typically worded in spacecraft specifications. See Table 1 below.

Reliability analysis for four different bounding cases are presented in Table 1. High, medium,

and low reliability systems delivered on schedule and a fourth case of a high reliability system

delivered three years late in year 4. Case 4 shows the problematic situation of the user delivered

a wonderfully reliable spacecraft 3 years late. For the full first three years the user has no

capability; a reliability of zero has been achieved. This is no different from a launch failure

followed by the building of a second spacecraft over the following 3 years. The users are the

ultimate arbiter of value, and the authors’ suspect that most users would prefer cases one, two, or

even three over case four. Not many sponsors would agree to the case three spacecraft

requirement, “The spacecraft shall have a predicted reliability of 17% at the end of 5 years,” nor

would most sponsors agree to launch 3 years late. So perhaps it’s important to ask your user if

they would rather have low reliability on time or high reliability late.

6

Table 1. Reliability prediction analysis using equation 1 compares the expected probability of

success as a function of reliability level and launch delay

1 2 3 4 5

1 90% 0 98% 96% 94% 92% 90% High Reliability, deliver on time

2 70% 0 93% 87% 81% 76% 70% Medium Reliability deliver on time

3 17% 0 70% 50% 35% 25% 17% Low Reliability deliver on time

4 90% 4 0% 0% 0% 98% 96% High Reliability, deliver late

Relaibility at End of Year

CommentDelivery Date;

start of Year

Predicted

Reliability at 5

Years

Case

4. TRUE RELIABILITY FOR SPACE SYSTEMS – HISTORY AND EXAMPLES

“A study of on-orbit spacecraft failures” by Mak

Tafazoli [3] studied more than 4000 spacecraft

launched in the past 25 years and identified 156 on-

orbit failures that occurred on 129 different spacecraft

from 1980 to 2005. In this study it was observed that

41% of all failures happen within 1 year of on-orbit

activities. He observes that the data suggest

insufficient testing and inadequate modeling of the

spacecraft and its environment. See Fig. 2. The study

concludes that the reliability lessons learned are that

adequate testing, redundancy, and flexibility are the

keys to a reliable spacecraft failure recovery system. An example of flexibility would be the

ability to upload new software to run new attitude control modes without using a failed sensor.

Notice that only 1 of these 3 characteristics, redundancy, is even included in reliability

prediction. Based on this study, adequate testing and flexibility in the design are two of the three

most important things space programs should do to realize true operational reliability. Therefore,

although testing and flexibility are not included in reliability predictions, programs must consider

them carefully to realize a truly reliable space system.

Comparing the real world results versus the theoretical reliability mathematics reinforces the

point that the reliability prediction is not a prediction of true on-orbit reliability. The actual data

and calculated reliability do not even have the same failure trends. Referring to the commonly

used reliability in equation 1 and Table 2, the mathematical prediction of reliability inherently

decreases over time. Stated another way, the probability of failure (1 - Reliabilty) increases over

time, predicting that the number of failures per year will increase over the life of a spacecraft.

Yet data consistently show the opposite is true.

Fig. 2. Spacecraft Failures over Time

7

Table 2. Spacecraft Failure Distribution Grouped by Years On-Orbit .

Failure Distribution Grouped by Years On-Orbit

0 - 1 1 - 3 3 - 5 5 - 8 >8 Comment

41% 17% 20% 16% 6% On-orbit failures per Tafozili’s study [3] of 4000 spacecraft from 1980 to 2005

Far more satellite failures occur in the first year than the mathematics of reliability indicate. Also

contrary to the mathematics, fewer failures occur in later years and, once functioning, spacecraft

life often far exceeds predictions. Tafozili’s results indicate that infant mortality holds true for

spacecraft. The belief that the preponderance of failures in the first year is driven by design faults

that expose themselves early in a mission is corroborated in a variety of reports. [3,4]

However, one should be aware that more advanced reliability calculations can be performed to

better represent data collected. MIL-STD-217F does support use of a Weibull distribution

function which can be used to better predict on-orbit results. The failure rate of the Voyager

spacecraft shown in Fig. 3 was calculated using a Weibull distribution. This Fig. comes from

“The Cosmos on a Shoestring, Small Spacecraft for Space and Earth Science, Appendix B” by

Liam Sarsfield. [4] For scale, 160,000 hours is a bit over 18 years. The result is a decreasing

failure rate over time which is consistent with on-orbit data. Unfortunately, space system

reliability calculations typical use more basic equations resulting in pessimistic estimates and

lifetimes.

Fig. 3. Failure rate predictions based on Weibull distribution reliability analysis

8

Launch Vehicle Contribution to True Reliability for Space Systems

A related and key factor in all space systems is the launch vehicle. While it is not the intent of

this paper to discuss launch vehicle reliability at any depth, the authors want to highlight that the

true, measured reliability of proven launch vehicles is only 90-95%. “Of the 4378 space launches

conducted worldwide between 1957 and 1999, 390 launches failed (the success rate was 91.1

percent), with an associated loss or significant reduction of service life of 455 satellites (some

launches included multiple payloads).” [5] Regardless of the attempted reliability for all the other

aspects of the spacecraft, the space system reliability cannot exceed the launch vehicle’s

reliability. This fact should have implications often in the favor of small to medium size

spacecraft for the simple reason that large spacecraft, no matter how well designed or thoroughly

tested, will have approximately a 1 in 10 chance of never reaching orbit. The bigger the

spacecraft, the bigger the impacts to the cost, the schedule, and the users. Unlike many spacecraft

failures, which can occur years into a mission or still allow degraded mission performance, a

launch failure usually results in total mission loss, R=0. Users of course do not care what the

source of mission loss is, all the user knows is the capability they were depending on is not

available.

Spacecraft Compared to Aircraft

While on the topic of the reliability of space systems, one sometimes hears “spacecraft need to

become reliable like aircraft.” While true in principle, the space industry needs to realize it is a

relatively mature industry and that demand for spacecraft is fundamentally much smaller than for

aircraft. The space industry has matured to between 80 and 125 launches per year during the last

2 decades, actually with a downward trend [1]. By comparison, the commercial airlines flew

over 10,000,000 flights in 2009 alone [6] and this number does not include military or private

flights. This high quantity demand allows the airline industry to manage reliability differently

and predict reliability much more accurately. Mass production with multiple iterations on every

model, regular maintenance, proven flight simulation modeling, and highly matured operations

are tools used by the aircraft industry that have limited benefits to the spacecraft industry. The

spacecraft industry should maximize use of these tools and best practices to improve reliability

where applicable. However the space industry must remain cognizant of the major differences

between the aircraft and spacecraft industries and, as a result, address space system reliability

quite differently. As an aside, for readers who fly frequently, you will be comforted to know

your odds of dying while flying any of the top 25 airlines is only 1 in 9.2 million! [7]

5. DISCUSSION SPECIFIC TO SMALL SATELLITES AND SYSTEMS

Small satellites and systems have some inherent benefits toward addressing reliability both

mathematically and in real terms. At a macro level, the quantity of small satellites tends to be

larger and the cost smaller. Missions that require more than one satellite typically degrade

gracefully, and single small satellite failures tend to have no appreciable effect on a national

scale as the investment is inherently modest. Small satellites also tend to have shorter schedules.

The shorter schedules allow newer, often better technologies to be used in the design. On a

technology note, the Technology Readiness Level (TRL) concept provides a sound process to

mature a new technology into a flight qualified item; however, this concept can be misused by

programs to eliminate valuable technologies. “We must have TRL-7 technologies or greater.”

9

This type of blanket requirement eliminates new technologies that can make the job much easier,

or fundamentally enable the mission. Instead, the technology should be developed and qualified

on a timeframe consistent with the program. Conversely, this type of TRL requirement can

pressure space programs to pretend that needed technologies are at a higher TRL than they are to

get a mission go-ahead. This situation causes an artificial feeling of comfort often resulting in a

lack of focus and resources needed early on to ensure that the development does not affect the

schedule. Smaller and larger satellites compliment each other well in many ways, including

reliability. On a mission level, larger systems allow for larger aperture, power, etc., enabling

capabilities simply not possible on smaller systems due to physics. In terms of reliability, larger

systems and programs can often afford and justify more thorough quality assurance, testing (such

as parts radiation testing), independent reviewers, etc. A mix of both small and large space

systems can best address the wide range of space missions, users, and reliability needs.

6. CONSIDERATIONS ON ACHIEVING TRUE RELIABILITY

It has been shown that reliability analysis is great for improving a design, but fundamentally

misapplied as a predictor of spacecraft success on orbit. So let’s consider true reliability and how

best to achieve it with limited resources.

True reliability depends upon many factors: meeting mission performance, surviving the

environments, avoiding parts failures, proper manufacture such that the spacecraft is built as

designed, staying on budget or at least close enough that the project is not cancelled, meeting the

schedule promised to users, proper operations, and robust software.

These factors are addressed by space programs through a variety of practices. There are many

practices that can be applied towards the goal of true reliability. Every program must select,

prioritize and pay for a combination of reliability practices that best support their mission. These

practices include good design, thorough testing, large margins and design/operational flexibility,

redundancy, use of mass produced components (although this is rarely available in the low

production world of spacecraft), reliability analysis, manufacturing and process controls

including quality assurance, preparation of mission simulators, and budgeting for operations

training.

The authors have provided their opinions on the relative contributions of each of these practices

to each of the elements of true reliability in Table 3. As this is a matter of judgment, one can

argue with any of the individual positions. However, the value of this analysis is in the

observation of the broad trends between practice and result.

10

Table 3. Authors’ opinion - effects of development practices on true spacecraft reliability

Failure Modes

Reliability

Practice to

Address

Failure Mode

Meets

Mission

Performance

Survives

Environments -

Stress &

Thermal

Avoidance

of parts

failure,

radiation, &

wear out

Built as

Designed

Meets

Budget

Meets

Schedule

Operator

Error

Software

Failure

Good Design strong benefit strong benefit

weak benefit

through

simplicity NA

weak

opposition

through higher

cost

strong benefit

through

simplicity

weak benefit

through

simplicity strong benefit

Good Testing strong benefit strong benefit NA strong benefit

weak

opposition

through higher

cost

weak

opposition

through longer

schedule

strong benefit if

test like you fly strong benefit

Flexibility &

Margins NA

ability to survive

after component

failures

weak benefit -

margins provide

additional

robustness

against some

part failures NA

weak

opposition

through higher

cost strong benefit

more likely can

recover from

operator errors NA

Redundancy NA

ability to survive

after component

failures strong benefit NA

strong

opposition

because of cost

of parts &

complexity

weak-to-strong

opposition

through

increased build

and test

schedule NA NA

Use of Mass

Production

Components if

Available NA

weak benefit

because

capabilities known

in advance

strong benefit

because true

reliability data

exists & learning

curve complete weak benefit

strong benefit

through

production

efficiency

strong benefit

through

production

efficiency or

truly off the

shelf

weak benefit -

ops of

component

often well

understood

applicability

depends on

specific

component type

Reliability

Analysis

weak benefit

throug circuit

improvements

weak benefit

through parts

thermal stress

analysis strong benefit NA

weak-to-strong

opposition

because of cost

of Hi-REL parts

if chosen

weak-to-strong

opposition

because of lead

time of Hi-REL

parts if chosen NA NA

Rigorous

Manufacturing

& QA Controls NA NA strong benefit strong benefit

strong

opposition

through cost of

paperwork

weak

opposition

through longer

schedule

weak benefit

from QA &

config control of

operations

procedures

strong benefit

through

software QA

Mission

Simulation &

Training

strong benefit

"flying" scenarios

before launch,

increases on-

orbit availability NA NA NA

weak

opposition,

additional cost

for mission

simulator &

training

weak benefit

from allowing

parallel testing strong benefit

strong benefit in

wring out errors

& inefficiencies

in both ground

and flight

software

Let’s study each practice and the trends resulting true reliability in the table above.

Reliability Effects of Good Design

Good design is crucial to any product, but let’s examine the specific benefits and costs as applied

to spacecraft reliability. Good design is essential to meeting mission performance and surviving

the environment. A good design is simple yet meets all functions. The simplicity of a good

design actually has important reliability benefits through schedule control because of the

efficiency with which a good design can be built and tested and in the reduced quantity of time-

consuming test failures. As a design element, software derives great reliability benefits from

good design. Because nothing is free, the effort required for good design requires time and

funding, increasing schedule and cost. Still the broad trend is that good design pays many

11

reliability dividends, indicating that a program desiring true reliability gets much value for the

application of resources towards good design.

Reliability Effects of Good Testing

A perfect design will work perfectly without any testing. However, the next perfect design will

be the first in recorded history. Testing is crucial in finding problems on the ground before they

become reliability problems on orbit. Good testing pays large dividends, including exposing

problems in meeting thermal and launch environments, and in finding manufacturing flaws and

software flaws. Additionally, when one follows the test like you fly adage, flight operations

reliability ensues from the refinement and practice of flight operational ground software and

procedures. Because nothing is free, good testing requires time and funding, increasing schedule

and cost. Testing carries the broad trend of large reliability benefits with low negative impacts.

Reliability Effects of Flexibility and Margins

The production of a spacecraft with inherent flexibility in its operation either through forms of

redundancy or a design that can perform core functions multiple ways allows spacecraft to

continue operating in the face of component failures or operations errors. Similarly, large

margins either at the component or spacecraft level allow a spacecraft to continue operation in

the face of failures. Let’s examine how the power subsystem design affects the spacecraft:

• Case 1 is a spacecraft with modest power margins and a single array on a single gimbal. A

gimbal failure will disable this spacecraft and end its mission.

• Case 2 is the same spacecraft with large power margins and a single array on a single gimbal.

A gimbal failure will limit the capability of this spacecraft, but the high margins may create

enough power without gimballing the array to continue operations with modest degradation.

• Case 3 is the same spacecraft with large power margins but, instead of a single large array on a

single gimbal, it has two half size arrays on two gimbals. Should a gimbal failure occur, the

high margins, coupled with the second gimbaled array, are likely to give ground operators the

flexibility need to maintain most, if not all, of the mission capability.

These three cases illustrate the benefits and costs of margins and flexibility. More margins

increase cost and mass for the solar array, as does a second gimbaled array. The broad trend of

flexibility and margins is that both the benefits and costs will be unique for each mission and

each subsystem. A program’s reliability will benefit from a continuous search for opportunities

to provide flexibility and margins wherever available with a tolerable cost.

Reliability Effects of Redundancy

Redundancy has large benefits for reliability by providing insurance against bad parts or

environmental failure. While redundancy can be one of the best ways to improve true reliability,

it is also one of the most expensive options due to financial cost, weight, modest increases in

production and test schedules, and software complexity. Redundancy will not protect against

fundamental design flaws; the redundant item will have the same flaw. The authors have

12

frequently witnessed spacecraft redundancy extend operational mission life by many years. The

authors have also seen several cases where entire missions were lost, or greatly degraded, due to

the blind application of a redundancy requirement. The latter case has been especially prevalent

for mechanisms. Often mechanisms can be made more reliable by testing a simple non-redundant

design vice a necessarily complex design to allow redundancy. For example, redundant solar

array drives require the complexity of an additional clutching mechanism to switch between

drives. Maybe the most important point is to use redundancy smartly. If applied well,

redundancy is highly effective at increasing true reliability, if applied poorly, redundancy leads

to unnecessary cost and complexity that substantially reduce true reliability.

Reliability Effects of the Use of Mass Production Spacecraft Components

Aircraft and automotive reliability gains tremendous benefits from the fact that mass production

components are available for key functions. Mass produced components have typically

completed their learning curve so that most to all reliability weakness have been removed. Mass

production provides great benefits in manufacturing by consistently producing the intended

design and driving down the cost and schedule. Unfortunately, very few spacecraft components

are mass produced other than electronics piece parts and similar items. The authors hope that, as

the space industry continues to mature, commonality can be driven into component

specifications and production quantities can be sufficient to support mass production

components. While there are large reliability benefits available from mass production with

minimal additional cost, it is quite hard to find for spacecraft.

Reliability Effects of Reliability Analysis

Reliability analysis avoids parts failures and detects weaknesses in designs at a relatively modest

economic cost. Reliability prediction analysis, along with associated reliability analyses such as

the failure modes and effects analysis (FMEA) and parts stress analysis over temperature, are

excellent for identifying weak links in a design and making improvements. If misused as a

predictor of true spacecraft reliability, the analytical predictions often drive programs towards

very expensive and long lead class 1 maximum reliability parts. Both MIL-STD-217F and on-

orbit data confirm the use of these calculations to predict on-orbit reliability is inappropriate and

inaccurate. Therefore, the spacecraft community must avoid this tendency for misuse and use

reliability analyses only to drive decisions for which they are applicable. Proper reliability

analysis can be one of the most economical practices for improving spacecraft reliability;

however, its misuse as a predictor of on-orbit spacecraft reliability can lead to great cost and

schedule expense with little-to-no true reliability increase and potentially even major reliability

decreases such as program cancellation.

Reliability Effects of Rigorous Manufacturing Controls

The manufacture of a spacecraft with rigorous production process control provides strong

reliability by avoiding problems from bad parts slipping into the build, ensuring that the built

spacecraft meets the engineered design, and implementing thorough software quality assurance.

These practices are always necessary at some level as they can prevent costly errors at the system

level and catch some items which testing is simply unable to screen, such as parts radiation

hardness. The price one pays for strict process control is cost and schedule increases. The degree

13

of process control is best decided on the basis of careful considerations of the costs and benefits

relative to the unique circumstances of a program’s reliability needs.

Reliability Effects of Mission Simulation and Training

Sound flight operations carry large increases in true reliability through increased spacecraft

availability and avoidance of mission ending operations failures. The preparation of a realistic

mission simulator, supported by sufficient training, provides these benefits. The economic cost to

design and build the simulator is often a substantial program decision made relative early in the

program. The authors have consistently seen programs that chose not to develop a good

simulator pay much greater cost downstream due to technically inaccurate testing, serial testing

constraints, excessive testing on the spacecraft due to lack of alternatives, and poor spacecraft

availability during the first year on orbit. Further, such a simulator fundamentally enables sound

training at relatively small cost. For almost all spacecraft programs, the authors recommend a

simulator that, at a minimum, enables commands and telemetry to be sent, and represents attitude

determination and control events (i.e. slews, ground contacts, etc.). A simulator and associated

training costs are almost always worth the operational reliability benefits.

Selection of Practices for True Reliability

The analysis above shows relative merits and impacts of major practices on true reliability. Each

program has unique circumstances which adjust the relative merits and costs of each practice.

Therefore, each program must create program reliability plans given their unique

circumstances.

7. SUMMARY

• The title of this paper, Designing and Managing for a Reliability of Zero, is meant to bring

attention to the counterintuitive fact that overly or improperly applied reliability practices often

decrease true reliability. This occurs most notoriously by causing schedule delays, and

sometimes even cancellations due to budget overruns, providing an effective reliability of zero

for users.

• Reliability analysis is great for improving a design, but fundamentally misapplied as a

predictor of spacecraft success on orbit. Both MIL-STD-217F and on-orbit data confirm the

use of these calculations to predict on-orbit reliability is inappropriate and inaccurate.

Therefore, the spacecraft community must avoid this tendency for misuse and use reliability

analyses only to drive decisions for which it is applicable.

• Proper reliability analysis can be one of the most economical practices for improving true

spacecraft reliability.

• Misuse of reliability analysis as a predictor of on-orbit spacecraft performance can lead to

great cost and schedule expense with little-to-no true reliability increase and potentially even

major reliability decreases such as program cancellation. Examples include implementing full

redundancy as a hard requirement and mandating all class 1 electronics parts.

14

• Small satellites and systems have some inherent benefits toward addressing reliability both

mathematically and in real terms. Small satellites also tend to have shorter schedules allowing

newer, often better technologies to be used in the design. Industry wide, a mix of both small

and large space systems can best address the wide range of space missions, users, and

reliability needs.

• Each space mission developer should create program reliability plans based on conscious value

judgments of the true reliability provided by each of the available practices given its unique

program circumstances.

8. CONCLUSION

Ultimately this paper strives to advance the industry-wide understanding necessary to better

achieve reliable, available systems for users. We hope these thoughts have been useful and

appreciate in advance your efforts toward this lofty goal.

9. REFERENCES

[1] “Satellites & Launches Trend Down,” Aerospace America, January 2004, Marco Cáceres,

Teal Group, http://www.aiaa.org/aerospace/images/articleimages/pdf/insightsjanuary04.pdf

[2] http://earthobservatory.nasa.gov/NaturalHazards/view.php?id=43688

[3] “A study of on-orbit spacecraft failures” by Mak Tafazoli, Canadian Space Agency, Canada

Received 5 December 2007; available online 31 October 2008

[4] “The Cosmos on a Shoestring, Small Spacecraft for Space and Earth Science, Appendix B”

by Liam Sarsfield, a RAND Publication

[5] “Space Launch Vehicle Reliability”, I-Shih Chang, Crosslink the Aerospace Corporation

Magazine. http://www.aero.org/publications/crosslink/winter2001/03.html

[6] Bureau of Transportation Statistics, Research and Innovative Technology Administration,

http://www.transtats.bts.gov/Data_Elements.aspx?Data=2

[7] OAG Aviation & PlaneCrashInfo.com accident database, 1985 – 2009