planning to fail - incorporating reliability into design ... · 1.2 overview this thesis ﬁrst...

Planning to Fail:Incorporating Reliability into Design

and Mission Planning for Mobile Robots

Stephen B. Stancliff

CMU-RI-TR-09-38

Submitted in partial fulfillment of therequirements for the degree of

Doctor of Philosophy in Robotics.

The Robotics InstituteCarnegie Mellon University

Pittsburgh, Pennsylvania 15213

September, 2009

Thesis Committee:John Dolan, ChairBrett Browning

Michael NechybaAshitey Trebi-Ollennu, California Institute of Technology, JPL

Copyright c©2009 by Stephen B. Stancliff. All rights reserved.

ABSTRACT

Current mobile robots generally fall into one of two categories as far as reliability

is concerned – highly unreliable, or very expensive. Most fall into the first category,

requiring teams of graduate students or staff engineers to coddle them in the days

and hours before a brief demonstration. The few robots that exhibit very high

reliability, such as those used by NASA for planetary exploration, are very expensive.

In order for mobile robots to become more widely used in real-world environments,

they will need to have reliability in between these two extremes. In many applications

some amount of unreliability is acceptable if it results in reduced costs. Even in

applications where a failure probability very near zero is desired (such as planetary

exploration), the ability to design robots to a specific reliability goal should allow

us to reduce the costs of these highly reliable robots by designing them to be “just

reliable enough” to complete the mission, rather than designing them to be “as

reliable as possible.”

In order to design mobile robots with respect to reliability, we need quantitative

models for predicting robot reliability and for relating reliability to other design

parameters such as cost. To date, however, there has been very little formal

discussion of reliability in the mobile robotics literature, and no general method

has been presented for quantitatively predicting the reliability of mobile robots.

ii

This thesis focuses on this problem of predicting reliability for mobile robots

and in particular for teams of mobile robots, and proposes solutions for using

reliability as a design input for several mobile robot design problems:

• Given a choice of components from which to assemble a robot, how do we

select the ones that will optimize the tradeoff of reliability against other

factors such as cost?

• Given a choice of robots from which to assemble a multirobot team, how do

we select the ones which will optimize the reliability tradeoffs for the entire

robot team?

• Given a multirobot team and a list of mission tasks, how do we assign tasks

to team members in order to maximize the probability of completing the

mission?

iii

Table of Contents

List of Tables vii

List of Figures viii

Chapter 1. INTRODUCTION 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 2. SINGLE-ROBOT RELIABILITY 92.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Reliability Background . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Types of robot failures. . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Reliability model. . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Consequences of constant hazard rate. . . . . . . . . . . . . 17

2.3 Robots and Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Robot decomposition. . . . . . . . . . . . . . . . . . . . . . 20

2.3.2 Module–task and robot–task reliability. . . . . . . . . . . . 22

2.3.3 Single-robot example. . . . . . . . . . . . . . . . . . . . . 23

2.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Chapter 3. MULTIROBOT RELIABILITY 283.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Analytical Solutions for Simple Multirobot Missions. . . . . . . . . 30

3.3 Stochastic Simulation for Complex Multirobot Missions. . . . . . . 31

3.4 Example Results for a Complex Multirobot Mission. . . . . . . . . 35

3.4.1 Comparing teams having different numbers of robots. . . . . 37

iv

3.4.2 Comparing teams with robots having different reliabilities . . 393.5 Example – Repairable vs. Nonrepairable Robot Teams. . . . . . . . 413.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Chapter 4. DESIGN TRADEOFFS 464.1 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.1 Cost of reliability. . . . . . . . . . . . . . . . . . . . . . . . 474.1.2 Expected mission reward. . . . . . . . . . . . . . . . . . . . 484.1.3 Overall cost–reliability relationship. . . . . . . . . . . . . . 49

4.2 Example – Multirobot Team Size. . . . . . . . . . . . . . . . . . . 534.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Operating Conditions. . . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Extrapolation of MTTF to other operating points. . . . . . . 574.3.2 Operating envelope. . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Chapter 5. MISSION PLANNING 635.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Illustrating Example. . . . . . . . . . . . . . . . . . . . . . . . . . 675.3 Simulation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.1 Minimax utility function . . . . . . . . . . . . . . . . . . . . 745.3.2 Differences in plan durations. . . . . . . . . . . . . . . . . 755.3.3 Overall planner performance metric. . . . . . . . . . . . . . 775.3.4 Minisum utility function . . . . . . . . . . . . . . . . . . . . 78

5.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Chapter 6. INCOMPLETE MISSION PLANNERS 826.1 Greedy Planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1.1 Incorporating reliability . . . . . . . . . . . . . . . . . . . . 846.1.2 Incorporating reliability – revised method. . . . . . . . . . . 85

6.2 Less-Greedy Planner. . . . . . . . . . . . . . . . . . . . . . . . . . 906.3 Compromise Planner. . . . . . . . . . . . . . . . . . . . . . . . . 926.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

v

Chapter 7. CONCLUSIONS 957.1 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.3 Future Research. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Appendix A. Subsystem Reliability Data 102

Appendix B. Expected Value Calculation 105

Bibliography 108

vi

List of Tables

2.1 Module usage during sampling task. . . . . . . . . . . . . . . . . 24

2.2 Components comprising power subsystem. . . . . . . . . . . . . . 24

2.3 Robot subsystem reliabilities. . . . . . . . . . . . . . . . . . . . . 25

2.4 Module reliabilities during sampling task. . . . . . . . . . . . . . 25

3.1 Subsystem usage by task. . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Module–task reliabilities. . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Baseline team costs and rewards. . . . . . . . . . . . . . . . . . . 50

5.1 Robot and target parameters. . . . . . . . . . . . . . . . . . . . . 68

5.2 Plan durations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Naive and expected durations. . . . . . . . . . . . . . . . . . . . . 72

A.1 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.2 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.3 Computation & Sensing. . . . . . . . . . . . . . . . . . . . . . . . 103

A.4 Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.5 Manipulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

B.1 Robot and target parameters. . . . . . . . . . . . . . . . . . . . . 107

B.2 Plan durations and probabilities. . . . . . . . . . . . . . . . . . . . 107

B.3 Plan durations – expected (minimax). . . . . . . . . . . . . . . . . 107

B.4 Plan durations – minimax (expected). . . . . . . . . . . . . . . . . 107

vii

List of Figures

2.1 The bathtub curve. . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Failure rates making up bathtub curve. . . . . . . . . . . . . . . . 16

2.3 Modular robot concept. . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 NASAHierarchical System Terminology. . . . . . . . . . . . . . . 21

3.1 Possible paths for simple mission. . . . . . . . . . . . . . . . . . . 30

3.2 Same mission as Figure 3.1 but with one repair allowed. . . . . . . 32

3.3 State–transition diagram for complex mission. . . . . . . . . . . . 33

3.4 Different numbers of robots. . . . . . . . . . . . . . . . . . . . . . 38

3.5 Closeup of area of interest from Figure 3.4. . . . . . . . . . . . . . 38

3.6 Different component reliabilities. . . . . . . . . . . . . . . . . . . 39

3.7 Total work completed; two-component robots. . . . . . . . . . . . 42

3.8 Improvement of repairable team over nonrepairable team. . . . . . 42

3.9 Total work completed; six-component robots. . . . . . . . . . . . 43

3.10 Effect of failure rate on repairable team superiority. . . . . . . . . 43

4.1 Relative cost of rovers as function of component reliability . . . . . 48

4.2 Expected value of mission as a function of component reliability . . 49

4.3 Net expected gain. . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Comparison of equal-cost teams. . . . . . . . . . . . . . . . . . . 56

4.5 Effect of operating conditions on bearing MTTF. . . . . . . . . . . 60

4.6 Lines of constant MTTF . . . . . . . . . . . . . . . . . . . . . . . 61

5.1 Exploration mission. . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Chosen plan and backups. . . . . . . . . . . . . . . . . . . . . . . 69

5.3 Plan with shortest expected duration. . . . . . . . . . . . . . . . . 73

5.4 Suboptimal allocations. . . . . . . . . . . . . . . . . . . . . . . . 75

5.5 Suboptimal allocations. . . . . . . . . . . . . . . . . . . . . . . . 76

viii

5.6 Average increase in mission duration as a function of target count . 77

5.7 Expected increase in duration as a function of target count . . . . . 78

5.8 Expected increase in mission duration. . . . . . . . . . . . . . . . 79

5.9 Expected increase in power consumption. . . . . . . . . . . . . . . 80

6.1 Planner comparison (first approach). . . . . . . . . . . . . . . . . 85

6.2 Planner comparison (second approach). . . . . . . . . . . . . . . . 87

6.3 How often reliability-enhanced planner chooses a better plan . . . . 88

6.4 Expected increase in mission duration (greedy planner). . . . . . . 89

6.5 Expected increase in mission duration (less-greedy planner). . . . . 91

6.6 Comparison of greedy and less-greedy planners. . . . . . . . . . . 91

6.7 Comparison of less-greedy and compromise planners. . . . . . . . 93

6.8 How the number of plans tested affects performance. . . . . . . . . 93

ix

Chapter 1

INTRODUCTION

1.1 Motivation

Many of the most promising applications for mobile robots are those that reduce

or eliminate the need for humans to perform tasks in dangerous environments.

Examples include space exploration, mining, and toxic waste cleanup. For mobile

robots to succeed in keeping humans from these dangers, these robots must be

highly reliable so that people do not have to enter the dangerous area to repair or

replace failed robots.

Unfortunately, most current mobile robots have poor reliability, requiring frequent

maintenance and repair. Historical failure data for small field robots reveal that

they are either broken or under repair approximately half ofthe time [1].

Notable exceptions to this observation are the planetary rovers built and operated

for NASA by the Jet Propulsion Laboratory (JPL). The currentMars Exploration

Rovers (MER), for instance, have now been in operation on Mars for more than

1

five years. There are few, if any, other mobile robots that have operated for as long

as a year without repair.

The reliability of NASA rovers is achieved through the use ofhighly robust

components as well as component redundancy, both of which lead to the robots

being very expensive. The cost of the first MER rover was approximately $150M

[2]. Other than space exploration and perhaps a few military applications, it is

hard to imagine many applications for which a robot price tagin the hundreds

of millions of dollars will be acceptable. Therefore, the current NASA design

paradigm of making robots as reliable as possible is not broadly applicable.

Even in the realm of planetary exploration, the current design paradigm may not

be able to provide the reliability required for future missions. In the near future,

NASA intends to send rovers to Mars for missions lasting an order of magnitude

longer than the original MER mission. Using the current design paradigm,

increasing the mission duration by an order of magnitude requires that the rover

be built using components with failure rates an order of magnitude lower. Since

NASA rovers already make use of some of the most reliable components available,

it is doubtful whether components with an order-of-magnitude greater reliability

are available, let alone affordable.

Both of these situations – the unreliability of mobile robotsin general and the high

cost of reliable NASA rovers – reveal the need for principledconsideration of

reliability as a design parameter for robots and robot missions. To date, however,

2

there has been very little formal discussion of reliabilityin the robotics literature,

and no general methods have been presented for using quantitative reliability as an

input for robot design or mission planning.

3

1.2 Overview

This thesis first addresses the question of how to apply existing quantitative reliability

estimation methods to mobile robots. Second, we present methods for using reliability

as an input parameter in the design of robots and multirobot teams. Finally, we

consider how knowledge of robot reliability can be used to improve the performance

of multirobot planners.

The methods developed in this thesis have their roots in the field of reliability

engineering. We have developed a formal representation formobile robots that

allows us to apply common reliability engineering models ina systematic way to

determine the probability that a robotic mission will be successfully completed.

As is often the case when applying outside fields of knowledgeto mobile robots,

we have discovered areas where the existing methods fall short and must be modified

to deal with the complexities of mobile robot systems. In particular, traditional

methods of combining component reliabilities into system reliability assume that

the components are independent in terms of reliability. This assumption fails for

many multirobot missions. We present a framework for dealing with the complexity

that these dependencies add to the problem.

We then apply these methods to several single-robot and multirobot design problems,

examining the tradeoffs between cost and reliability, between repairable and

4

nonrepairable robots, and between teams of many low-reliability robots versus

teams of fewer high-reliability robots.

Finally, we examine the role of reliability in multirobot task allocation. Specifically,

we evaluate the hypothesis that ignoring robot reliabilityinformation when generating

initial task allocations leads to suboptimal performance.Our results show that this

is indeed the case and that the difference in performance is substantial.

5

1.3 Contributions

The contributions of this dissertation to the robotics community are the following:

• Introduction of models for reliability prediction from thereliability

engineering literature into the mobile robotics literature.

• A general theoretical framework for applying these reliability engineering

methods to robots and multirobot teams.

• The first quantitative analysis of cost–reliability tradeoffs for planetary rover

missions.

• The first quantitative analysis of the tradeoff between robot reliability and

team size.

• The first analysis of the benefit of using reliability knowledgea priori in

multirobot mission planning, providing strong evidence that planners which

do not use this information choose suboptimal plans.

• A model for how reliability knowledge can be used to improve task allocation

for incomplete multirobot planners.

Overall, these contributions allow us to begin consideringrobot reliability as an

input parameter for robot design and operation, rather thanas an uncontrolled

output resulting from decisions made without regard to reliability.

6

1.4 Outline

Chapter 2 of this document introduces the relevant terminology and models we

borrow from the reliability engineering literature and describes the framework we

have developed for applying reliability engineering to thedesign of mobile robots.

This chapter presents an example problem in which we calculate the probability

that a planetary exploration rover will successfully complete a sampling mission.

Chapter 3 considers the design of multirobot teams. We first present a

straightforward method for finding analytical solutions tomultirobot missions.

Because such methods are impractical for missions of significant complexity, we

then introduce a method that uses stochastic simulation forevaluation of more

complex missions. In this chapter we evaluate a multirobot mission in which

planetary rovers must work cooperatively to install a solarpanel array. We also

analyze a problem introduced in [3] that compares the performance of a team of

repairable robots with that of a nonrepairable team.

In Chapter 4 we demonstrate how reliability can be integratedwith other design

parameters in order to optimize robot design across multiple design constraints.

The bulk of this chapter examines the relationship between reliability and cost

in the context of a planetary exploration mission. We also examine how operating

conditions affect reliability and how single-point reliability data can be extrapolated

to off-design operating conditions.

7

Chapters 5 and 6 examine the role of quantitative reliabilityin multirobot mission

planning. Specifically, in Chapter 5 we test the hypothesis that it is necessary to

consider robot reliability when generating initial task allocations, rather than, as

is currently practiced, dealing with reliability only after the fact, by reallocation of

tasks after robot failure occurs. In Chapter 6 we extend theseresults by

demonstrating that reliability information can be used to improve plan selection

for heuristic planners.

Finally, in Chapter 7 we summarize the contributions of this thesis and discuss

future directions for this research.

8

Chapter 2

SINGLE-ROBOT RELIABILITY

This section provides an overview of methods and models fromthe reliability

engineering literature, introduces the representation weuse for modeling mobile

robots using these methods, and shows how these methods can be applied in order

to predict the probability that a single robot will completea given task.

2.1 Related Work

The reliability engineering literature (e.g., [4, 5]) provides methods for predicting

the reliability of simple electrical and mechanical devices and also for combining

these reliabilities to predict the reliability of complex systems. These methods can

be applied in a straightforward fashion to make predictionsabout the reliability of

simple robots executing simple missions. For many robotic applications, however,

there are violations of the assumptions upon which the basicreliability engineering

methods are based. We address these shortcomings in Chapters3 and4.

In the mobile robotics literature there is little formal discussion of reliability and

9

failure. When reliability is mentioned, it is usually qualitatively, and in passing.

Reference [6], for example, mentions intermittent hardware failures asan explanation

for gaps in experimental data but makes no attempt at characterizing the failures.

A handful of prior papers ([1, 7, 8, 9]) make use of reliability engineering for

analysis of mobile robot failure rates. Reference [1] provides an overview of robot

failure rates at the system level (i.e., robot modelX failedY times inZ hours of

operation) and also breaks down failures according to the subsystem that failed

(actuators, control system, power, or communications). Reference [7] extends the

work in [1] both by the inclusion of additional failure data of the sametype and

also by addition of new categories of failure – those due to human error. Reference

[9] provides a detailed analysis of failures experienced by some of the robots used

in searching the World Trade Center wreckage in 2001. Reference [10] provides

failure data for robots used in long-term experiments as museum guides. While

these papers help us to begin to identify the causes of mobilerobot failure, they do

not provide methods for predicting failures.

In contrast to the mobile robot literature, there is considerable work in the area of

reliability of robotic manipulators. Examples include [11] and [12]. This work in

manipulator reliability has the same shortcomings with respect to mobile robots

as the basic reliability methods, in that manipulators are generally simpler devices

than mobile robots and are used in fairly static environments. There is some relevant

work in the manipulator literature describing how environmental conditions affect

10

reliability (e.g., [13]), although here the environmental factor involved is a constant

rather than varying with time and task, as is often the case with mobile robots.

There is also a significant body of mobile robot research thatdeals tangentially

with reliability by describing methods for detecting and recovering from failures.

An example is [14], in which fault detection is used to discard faulty sensor readings

among a group of redundant sensors. Our work differs from these in that we are

developing methods to predict the probability of failure occurring rather than

to respond to failure after it occurs. Our methods are complementary to these

since ana priori understanding of the relative probabilities of different failures

is helpful for failure diagnosis.

11

2.2 Reliability Background

Reliability is “the ability of a system or component to perform its required functions

under stated conditions for a specified period of time” [15, p. 170]. In other words,

reliability is the probability that no failures will occur before a given time. When

evaluating the reliability of a system, we must first identify the ways in which the

system may fail and then determine the probabilities of those failures occurring.

2.2.1 Types of robot failures

Mobile robots are complex systems, and as a result there are many factors that

can cause the failure of a robotic mission. The laboratory robots with which most

researchers are familiar usually fail due to errors in design, manufacturing, or

usage. The hardware breaks down due to being poorly designedor constructed;

the software has bugs that are revealed only under the stressof a demonstration;

and both hardware and software fail because the robots are used in situations

beyond the intentions of their designers.

While these types of failures are significant and in fact are the dominating failure

modes for most mobile robots today ([1],[7],[8]), we contend that these failure

modes are not in need of modeling so much as they are in need of correction.

These failures are the result of errors that can be reduced, if not eliminated, through

process control. Methods for reducing errors in design, manufacturing, software

development, and operation are widely used in industry (e.g., ISO 9001 Quality

12

Management). As mobile robots become more common and are produced in a

manufacturing rather than a research environment, these engineering methods will

be applied, yielding a reduction in failures due to errors.

We can see that this is possible because some of today’s mobile robots are already

built with a high degree of quality control in design, construction, and operation.

For instance, the planetary rovers built for NASA by JPL are built to very high

standards of quality and controlled by highly trained operators, resulting in a very

low incidence of failures due to errors. This is largely because much greater care

is given to their design, construction, and operation in comparison with most other

current mobile robots.

Once failures due to errors are largely eliminated, as with the NASA rovers, the

remaining failures are due mostly to inherent properties ofthe materials from

which the robot is constructed. An example of such a failure is the degradation

of the lubricant in a bearing and the subsequent failure of the bearing. There is no

process control that will change the physical reality that lubricants break down

and unlubricated bearings fail. Instead, the robot must be designed taking into

account the possibility of bearing failure so as to guarantee that there is only a

small chance of failure during the mission.

The need to address such failures is suggested by the long-term robot museum

guide experiments described in [10]. The robots described in that paper possessed

self-diagnostic and self-resetting capabilities that allowed them to overcome many

13

design and implementation errors. The “remaining failureswere eventually stochastic

and unpredictable, a tire failing here, and a light bulb failing there” [10, p. 4].

It is this latter type of failure with which we are primarily concerned. The reliability

engineering literature provides well-established modelsfor this type of failure.

In the rest of this chapter we demonstrate how these models can be used for the

prediction of mobile robot failures and for choosing an optimal set of robot

components with respect to reliability requirements.

It is possible that some of the other types of failure mentioned above can also be

incorporated into these predictions. For instance, modelsfor predicting software

errors have been proposed in the literature (e.g., [16],[17]). Incorporation of such

models would allow us to provide a more complete picture of mobile robot failure.

However, these models have been in existence for a much shorter time than hardware

reliability models and have been applied in very few cases, so their ability to predict

software failures is unproven. In addition, our goal is to produce tools that can be

used in the early stages of mission design. Most of the available software prediction

models require input data that are not available in those early stages. We therefore

confine ourselves in this work to the category of hardware failures described above.

2.2.2 Reliability model

Reliability models are descriptions of how the instantaneous failure rate (orhazard

rate) for a device changes over time. For many electronic and mechanical devices,

14

when the hazard rate is plotted as a function of time, the resulting curve resembles

Figure 2.1[4, p. 109]. This characteristic shape is referred to as thebathtub curve.

The bathtub curve arises from the superposition of three distinct failure patterns.

The first is an exponentially decreasing failure rate which is high at the beginning

of the product life (Figure 2.2a). This corresponds to the period during which

items fail largely due to defects in materials or construction. There are many early

failures, but as defective items drop out of the population,the remaining population

has a lower hazard rate. This is referred to as theburn-inor infant mortalityperiod.

The second pattern (Figure 2.2b) is an exponentially increasing failure rate which

becomes high when components have reached the ends of their useful lives and

begin to fail due to deterioration. This is referred to as thewearoutphase.

The third failure pattern (Figure 2.2c) is a constant failure rate due to random

Figure 2.1. The bathtub curve

15

(a) Infant mortality

(b) Wearout

(c) Random failures

Figure 2.2. Failure rates making up bathtub curve

16

failures. In the middle section of the bathtub curve this failure pattern dominates.

This period is referred to as theservice lifeor useful life.

In applying the bathtub model to robots, we assume that therewill be a period of

initial testing which allows burn-in failures to be dealt with before components are

placed into service. This is standard procedure for manufacturing of products with

small production runs or for products that use cutting-edgetechnology [18].

At the other end of the bathtub curve, we assume that the service life of components

will be specified by their manufacturers and observed in robot design and mission

planning so that robot modules will not wear out before the completion of the

mission for which they are being designed.

Given these two assumptions, the hazard rate of a robot component needs to be

known only during the service life phase. This hazard rate ismodeled as a constant,

which is represented in the literature byλ. It is also important to know when the

end of the service life is reached. The reliability of a module can therefore be

modeled with just two parameters – the (constant) hazard rate and the service life

length.

2.2.3 Consequences of constant hazard rate

The reliability of a device with a constant hazard rate is

R(t) = e−λt. (2.1)

17

Thus, the reliability of a device with a constant hazard rateis equal to one at the

beginning of the service life and decays exponentially towards zero.

Manufacturers usually specify the reliability of a device in terms ofmean time to

failure (MTTF). During the service life, the hazard rate and MTTF arerelated as

MTTF =1

λ. (2.2)

The relationships inEq. 2.1and2.2allow us to calculate the probability of failure

of a component from the manufacturer’s published MTTF. It isimportant to remember

that this MTTF applies only during the constant-hazard-rate portion of the bathtub

curve. It is a common mistake to assume that MTTF, since it hasunits of time,

measures how long an item will last. Most components will fail due to wearout

long before the time corresponding to MTTF is reached. Reference [19] has this to

say about the confusion:

Note that there is no direct connection or correlation betweenservice life and failure rate. It is possible to design a veryreliableproduct with a short life. A typical example is a missile for example:it has to be very, very reliable ([MTTF] of several million hours), butits service life is only 0.06 hours (4 minutes)! 25 year old humanshave an [MTTF] of about 800 years (about 0.1%/year) but not manyhave a comparable service life. Just because something has agood[MTTF], it does not necessarily have a long service life as well. [ 19,p. 5]

One of the reasons that the constant hazard rate model is commonly used is because

many reliability calculations are much simpler under this model than other models.

18

This model is closed under the operations of combining devices in serial and

parallel, while most other reliability models are not [20, p. 47]. Another useful

property is the “lack of memory” of the exponential function; i.e., the probability

that a device will fail in the next hour of operation is the same at any point within

the constant-failure-rate portion of the bathtub curve [20, p. 43].

Some devices used in mobile robots do not follow the constant-failure-rate model.

Devices that fail due to mechanical wearout, such as bearings, are better fitted by

more complex reliability models. However, the reliabilityof these devices can be

approximated piecewise by regions of constant failure rate. This allows for the

simpler calculations of the exponential model to be used within each segment of

the approximation [20, p. 44].

19

2.3 Robots and Tasks

2.3.1 Robot decomposition

In order to allow for a systematic evaluation of mobile robotreliability, we have

developed a formal method for representing robots and theirsubsystems. For

our analyses we consider robots to be made of multiple modules, as inFigure

2.3. We usemodulehere to refer to a specific instantiation of a robot subsystem.

A subsystem is a functional division of the robot that can be conceived as being

engineered, assembled and tested independently of other subsystems (Figure 2.4).

The methods presented here are not dependent on this particular definition of

module or subsystem, but this definition makes it possible toconsider modules

as interchangeable building blocks for robots, allowing usto use reliability and

other criteria to choose the best set of modules for a given mission.

Figure 2.3. Modular robot concept

20

Figure 2.4. NASA Hierarchical System Terminology[21]

Combining module reliabilities to obtain the reliability ofan entire robot is

straightforward when the constant-hazard-rate model is used. Modules are considered

to be either in series or parallel. In a series combination, all modules must be

functioning for the system to function. In a parallel combination, only one module

must be functioning for the system to function.

For a series combination the overall reliability is the product of the component

reliabilities, i.e.,

Rs =N∏

i=1

Ri, (2.3)

and the overall hazard rate is the sum of the hazard rates for the modules, i.e.,

λs =N

∑

i=1

λi. (2.4)

21

For modules in parallel, the overall unreliability (1 minusthe reliability) is the

product of the component unreliabilities:

(1 − RS) =N∏

i=1

(1 − Ri) . (2.5)

If the modules are identical (which is usually the case), then the overall hazard

rate for the parallel combination is

λS = λ · (1 +1

2+ ... +

1

N)−1. (2.6)

2.3.2 Module–task and robot–task reliability

We use task completion as our fundamental utility measure. We assume that the

mission can be decomposed into distinct tasks and that thesetasks are assigned to

particular robots. Using task completion as our fundamental measure allows us to

compare different robot and team configurations based on howmany tasks they

can complete, how quickly they can complete tasks, the percentage of a complex

mission that they can complete, etc.

To calculate the probability that a module will survive a mission task (module–task

reliability), the MTTF of the module must be known, along with the expected

usage of the module during that task. For instance, we might be told that Task

1 will take six hours, using modules A and B for the entire six hours and using

module C for three hours.

22

In order to discretize the calculations, we evaluate the probability of failure only

at the end of a task. We assume that the entire task is completed whether there is a

failure or not; i.e., all failures occur after completion ofthe task. This assumption

does not limit the usefulness of our method because if one needs to know whether

a robot failed in the middle of a task, the tasks can simply be restated into subtasks

to provide a desired level of granularity.

Given the module–task reliability for each module, we can use the equations for

combining reliabilities (given inSection 2.3.1) to determine the probability that

the robot will fail during the task (robot–task reliability).

2.3.3 Single-robot example

We now apply the formulas from the preceding sections to predict the probability

that a robot will complete a mission task. Consider a planetary exploration rover

that is tasked to extract core samples. The rover is composedof five modules:

• Power

• Computation and Sensing

• Mobility

• Communications

• Manipulator

23

Table 2.1.Module usage during sampling task

Module Usage (h)

Power 8Computation & Sensing 8

Mobility 6Communications 2

Manipulator 4

The duration of the task is eight hours, and the amount of timeeach module is

used during the task is given inTable 2.1.

For each module, we obtained reliability data from JPL that are representative

of components used in NASA’s planetary robots. As an example, the breakdown

of components and reliabilities for the power module is shown in Table 2.2. The

entire list of component reliabilities is provided inAppendix A.

Table 2.2.Components comprising power subsystem

Component Quantity MTTF (h)

Battery 2 4.8MBattery control board 2 2.5M

Mission clock 1 10MPower distribution unit 1 588k

Power control unit 1 5.3MShunt limiter 1 88k

Electrical heater 2 333kRadioisotope heater 2 73k

Thermal switch 2 11k

24

Table 2.3.Robot subsystem reliabilities

Module MTTF (h)

Power 4.20kComputation & Sensing 4.77k

Mobility 19.7kCommunications 11.9k

Manipulator 13.8k

These component reliabilities were combined for each module according toEq.

2.4, giving the module MTTFs listed inTable 2.3.

Using these overall module failure rates andEq. 2.1, we can calculate the probability

that each module will still be functioning at the end of the task. For the power

module, this gives

R = e(−8

4202) = 99.810%. (2.7)

The reliabilities for the other modules for this task are found similarly and are

shown inTable 2.4.

Table 2.4.Module reliabilities during sampling task

Module Module–Task Reliability

Power 99.810%Computation & Sensing 99.832%

Mobility 99.970%Communications 99.983%

Manipulator 99.971%

25

Finally, we combine all of the module reliabilities usingEq. 2.3to give an overall

robot–task reliability of 99.567%.

26

2.4 Summary

In this chapter, we introduced definitions and models from the reliability engineering

literature and provided a representation that can be used toapply these models

to mobile robots. We then demonstrated how our representation can be used to

predict the probability that a single robot will complete a given task.

This type of calculation is useful for selecting componentsfrom which to build a

robot to meet mission requirements. For example, given several mobility modules

with different reliabilities and costs, we can calculate the robot–task reliabilities

for robots using each alternative and then select the lowest-cost module that meets

the mission requirements.

27

Chapter 3

MULTIROBOT RELIABILITY

The reliability engineering methods presented in the previous section fall short

when applied to multirobot teams. The equations for combining reliabilities of

subsystems (Eq.2.3–2.6) assume that the failure of one subsystem is independent

of the failure of other subsystems. This is a reasonable assumption when combining

component reliabilities to create larger assemblies, and even when combining

assemblies to produce an entire robot. When combining robotsto make a robot

team, however, this assumption is not reasonable in many cases. For most multirobot

missions, the failure of one robot will affect the tasking ofother robots so that

their reliabilities are not independent. In this chapter wepresent a method that

overcomes this limitation, allowing us to calculate the probability of completing a

multirobot mission.

3.1 Related Work

There is considerable work in the multirobot domain that examines how to diagnose

and/or recover from robot failures. For example, [22] describes a behavior-based

28

robot control architecture that is able to adapt to robot failures and communication

failures, and [23] discusses detection and recovery from multiple types of failure

in a market-based planner. As in the single-robot domain, our work differs from

these in that we are developing methods to predict the probability of failure before

it occurs rather than to respond to failure after it occurs.

The only known work preceding ours in the area of predicting mobile robot team

reliability is [3]. That paper’s methods are similar to ours in that they are based in

the reliability engineering literature, but that work has anarrow focus on teams of

robots with cannibalistic repair capability. In contrast,we are developing a general

methodology that can be applied to a wide variety of robot teams and missions.

We revisit [3] in more depth inSection 3.5.

29

3.2 Analytical Solutions for Simple Multirobot Missions

For very simple missions, it is possible to enumerate by handall of the possible

outcomes. One way of doing this is by drawing a tree diagram such as in

Figure 3.1. We can use such a tree to derive an analytical solution for the probability

of mission completion (PoMC).

For the two-task, two-robot mission shown inFigure 3.1, the analytical solution is

PoMC = P (R1T1)P (R2T1)P (R1T2)P (R2T2), (3.1)

whereP (RnTm) is the probability that robotn survives taskm. If the robots are

identical, then this becomes

PoMC = P (T1)2P (T2)

2. (3.2)

Figure 3.1. Possible paths for simple mission. (R1+ = Robot 1 alive;R1− = Robot1 dead)

30

3.3 Stochastic Simulation for Complex Multirobot Missions

In more realistic mission scenarios, the failure of one robot will have an impact on

the probability of failure of the other robots on the team so that the probability

of mission completion cannot be calculated in a straightforward manner. The

simplest example of such dependence is when there are a fixed number of tasks

to be completed and the tasks will be allocated among available robots until all

tasks are completed or all robots have failed. In this case, when one robot fails,

there is a greater amount of work to be performed by the remaining robots, which

increases the probability that they will fail.

Robot reliabilities are also interdependent when robot tasks are not executed

independently. This is the case, for instance, when there are tasks that require two

or more robots to work together. If one of the robots performing a joint task fails,

perhaps the remaining robots can still complete the task, but with increased stress

on their components, which then increases their chance of failure. Or perhaps that

task is abandoned, in which case the remaining robots have a decreased chance of

failure.

Another type of reliability interdependence is introducedif the robot team is capable

of repairing a failed team member. Since repairing a failed robot requires action

on the part of other robots, the failed robot is repaired at the cost of increased

probability of failure for the robots executing the repair.Repairing a failed team

member may therefore in some cases decrease the probabilityof mission completion.

31

Figure 3.2illustrates how mission complexity increases when such interdependence

is introduced. This figure represents the same mission asFigure 3.1, but with the

addition of the ability to repair one failed robot. The addition of this single repair

capability has increased the number of leaf nodes from 7 to 25. For a realistic

scenario with several robots, multiple tasks, and perhaps dozens of spare parts,

the tree becomes complex enough that a direct analytical solution is infeasible.

For these more complex missions, we have developed a method of estimating

mission reliability using stochastic simulation. In this method, we represent the

mission using a state–transition diagram, as inFigure 3.3. (Details of the mission

represented byFigure 3.3are given inSection 3.4.)

The state machines represented by these diagrams can be implemented in software

in order to explore the space stochastically. At each task node, the state of the

robot team is evaluated by choosing a random value between zero and one for

each module and comparing that value with the module–task reliability for that

module for the current task. The branch in the diagram corresponding to the resulting

Figure 3.2. Same mission asFigure 3.1but with one repair allowed

32

team state is followed, and the process continues until the simulation reaches

eitherSuccessor Failure.

Start # Robots 0?

Return

N

N

Figure 3.3. State–transition diagram for complex mission

33

The simulation is repeated many times, with eachSuccessresult being assigned

a score of one and eachFailure result being assigned a score of zero. The average

score of a large number of trials then gives the overall probability of mission

completion.

While this method has computational limitations, it is a significant improvement

over the direct analytical method, which can require days oftedious hand calculations

and has a high potential for human error.

34

3.4 Example Results for a Complex Multirobot Mission

Consider a planetary exploration mission where a team of robots is tasked to install

a solar panel array for a measurement and observation outpost. The mission consists

of carrying solar panels from the landing site to the outpostand then assembling

them. The size of the solar panels is such that two robots are needed to carry and

assemble one panel.

For the purposes of this analysis, the task of assembling a solar panel is broken

down into three subtasks:

• Transit to the outpost;

• Assemble the panel; and

• Return to the landing site.

The state–transition diagram for this mission was shown inFigure 3.3. Working

through that figure from the top, we see that if there are fewerthan two robots

then the mission is a failure. If there are at least two robots, then if there are no

panels left to be installed, then the mission is a success. Ifthere are at least two

robots, and there are panels still remaining to be installed, then the robots will pair

off and carry panels to the outpost (Transit task). After theTransit task, if there

are fewer than two robots alive and if there are spare robots at the landing site,

then the spares willTransit to the outpost until at least two robots are available to

Assembleor until there are no more spare robots (in the latter case, the mission

35

fails). The robots then pair off toAssemblethe panels, and any robots that survive

that taskReturnto the landing zone.

For this example all of the robots on the team are identical. The usage times for

each module for each task are shown inTable 3.1. These usage times along with

the subsystem reliabilities fromTable 2.3are used to calculate the module–task

reliabilities for this mission, which are shown inTable 3.2.

For the example mission scenario described above, once the tasks, the task durations,

and the baseline module reliabilities are established, then the input variables for

the model are

• the number of robots on the team,

• the reliability of the components used, and

• the mission duration (number of panels to be installed).

Table 3.1.Subsystem usage by task (h)

Subsystem Transit Assemble Return

Power 6 8 6Computation & Sensing 6 4 6

Mobility 6 8 6Communications 2 4 2

Manipulator 0 8 0

36

By examining how the probability of mission success varies asthese inputs are

changed, we can answer questions such as

• For a given mission duration and component reliability, what is the fewest

number of robots needed to meet a certain probability of mission completion?

and

• If additional robots are added beyond the minimum number, can we use

lower reliability components, and if so, how much lower?

We explore these questions in Sections3.4.1and3.4.2, respectively.

3.4.1 Comparing teams having different numbers of robots

Figure 3.4compares the simulation results for teams with different numbers of

robots, with all robots having the component reliabilitieslisted in the above tables.

We see from this figure that adding even one robot beyond the minimum (two)

increases the probability of mission success dramatically, even for relatively short

missions. However, there is a diminishing improvement as additional robots are

Table 3.2.Module–task reliabilities

Subsystem Transit Assemble Return

Power 99.86% 99.81% 99.86%Computation & Sensing 99.87% 99.92% 99.87%

Mobility 99.97% 99.96% 99.97%Communications 99.98% 99.97% 99.98%

Manipulator 100% 99.94% 100%

37

added to the team. We can use this figure to answer the first question above. For

example, for a mission specifying that 30 panels are to be installed with a probability

of mission completion of at least 95%, then the team must include at least four

robots (Figure 3.5).

0

20

40

60

80

100

0 10 20 30 40 50 60

Pro

bab

ilit

y o

f m

issi

on

co

mp

leti

on

(%

)

Mission duration (number of panels)

2 robots 3 robots 4 robots 5 robots

Figure 3.4. Different numbers of robots

80

85

90

95

100

26 27 28 29 30 31 32 33 34

Pro

bab

ilit

y o

f m

issi

on

co

mp

leti

on

(%

)


Design point

2 robots 3 robots 4 robots 5 robots

Figure 3.5. Closeup of area of interest fromFigure 3.4

38

3.4.2 Comparing teams with robots having different reliabilities

If additional robots are added beyond the minimum required,it should be possible

to use less-reliable components in those robots and still achieve a required mission

reliability. Figure 3.6shows the simulation results for teams of four robots with

component reliabilities ranging from 10% to 100% of the baseline amounts from

Table 2.3.

When varying the reliability of the components, we apply a constant multiplier

to all of the subsystem MTTF values inTable 2.3. For instance, when we refer

to a team with 10% of the MTTF of the baseline team, we are multiplying all the

values inTable 2.3by 10%.

Figure 3.6shows that for very short missions a team of four robots with only 10%

of the reliability of the baseline team can provide a higher probability of mission

0

20

40

60

80

100

0 20 40 60 80 100 120 140

Pro

bab

ilit

y o

f m

issi

on

co

mp

leti

on

(%

)


2 robots (100) 4 robots (50) 4 robots (25) 4 robots (10)

Figure 3.6. Different component reliabilities

39

completion compared to the baseline two-robot team. As the length of the mission

increases, the reliability required for the four-robot team to equal the performance

of the baseline team increases, but the four-robot, 50%-lower-MTTF team still

outperforms the baseline team even for fairly long missions(on the order of a

year).

40

3.5 Example – Repairable vs. Nonrepairable Robot Teams

As mentioned earlier, there is one previous paper ([3]) in the literature that looks

at reliability as a design parameter for mobile robot teams.In this section we

compare our method to the one in that paper by analyzing the example mission

given in that paper.

The mission considered in [3] is one where a team of robots are moving dirt. The

dirt-moving task is a continuous task, where the amount of dirt moved is proportional

to the total robot lifetime, where total robot lifetime is the sum of the lifetimes of

all robots on the team.

The robots making up a team are identical and are made of discrete modules.

When an individual module fails, a robot is dead. During its lifetime each robot

moves dirt at a constant rate.

The basic comparison made in [3] is between teams of repairable and nonrepairable

robots. For repairable teams, a robot can be repaired by a teammate using spare

modules. The spare modules are taken from other failed robots – at the beginning

of the mission there are no spares. Two conditions are therefore necessary for

repair to take place: There must be a functional robot to execute the repair, and

there must be spare modules of the correct type available. Notime is elapsed

during a repair, and the repair task does not itself contribute to robot failure.

41

Using the method described inSection 3.3, we simulated this mission. Figures

3.7, 3.8, and3.9show, on the left, the results presented in [3] and, on the right, our

results. These figures show that, qualitatively, our results are very similar to those

in the previous paper.

One thing that is not specified in [3], and that makes exact comparison difficult, is

the failure rate,λ. Figure 3.10shows the same results asFigure 3.8for several

0

20000

40000

60000

80000

100000

0 100 200 300 400 500 600 700 800 900 1000

Uni

ts o

f wor

k co

mpl

eted

Number of robots

nonrepairablerepairable

Figure 3.7. Total work completed; two-component robots (left figure from [3])

0

20

40

60

80

100

0 20 40 60 80 100

Per

cent

incr

ease

in w

ork

com

plet

ed(r

epai

rabl

e/no

nrep

aira

ble)

Number of robots

Figure 3.8. Percent improvement of repairable team over nonrepairableteam;two-component robots (left figure from [3])

42

values ofλ. While the overall conclusion (that repairable teams are superior)

remains the same, the degree of superiority depends highly on the failure rate. The

effects of varying failure rate are not addressed in [3].

These results show that our method is capable of achieving results similar to the

method in [3]. What is different is that the method used in that paper is an analytical

method, similar to that presented inSection 3.2of this document and with all the

0

200

400

600

800

1000

1200

0 2 4 6 8 10 12 14

Uni

ts o

f wor

k co

mpl

eted

Number of robots

nonrepairablerepairable

Figure 3.9. Total work completed, six-component robots (left figure from [3])

0

20

40

60

80

100

120

10 20 30 40 50 60 70 80 90 100

Per

cent

incr

ease

in w

ork

com

plet

ed(r

epai

rabl

e/no

nrep

aira

ble)

Number of robots

λ = 0.80λ = 0.84λ = 0.88λ = 0.92λ = 0.96λ = 0.99

Figure 3.10.Effect of failure rate on repairable team superiority

43

shortcomings of that method. The mission scenarios addressed in [3] are very

simplistic, and that paper fails to address the difficulty ofusing analytical methods

for complex missions. The most complex mission scenario presented in that paper

considers a team with three robots and two nonidentical modules, for which the

solution is given as18l21 + 49l1 · l2 + 18l

22

(l1 + l2)(3l1 + 2l2)(2l1 + 3l2). (3.3)

The amount of time required to develop such analytical solutions, and the significant

likelihood for human error in their derivations, makes these methods undesirable

even for fairly simple missions. They become impractical for missions of any

significant complexity.

44

3.6 Summary

In this chapter, we showed how reliability prediction for multirobot teams is often

a different type of problem than for single robots due to the interdependence of

robot reliabilities, making analytical reliability solutions impractical for multirobot

missions that have significant complexity. We introduced a method using stochastic

simulation to estimate mission reliabilities for such missions, and we demonstrated

the use of this method to determine the optimal team size for amultirobot mission.

Finally, we used this method to analyze the relative effectiveness of repairable and

nonrepairable robot teams in revisiting a problem previously introduced into the

literature by [3]. Our results here demonstrate that our method can produce similar

results to the prior work, while also allowing for analysis beyond that shown in the

prior work.

45

Chapter 4

DESIGN TRADEOFFS

The methods presented in the previous chapters provide estimates of the probabilities

of task and mission completion. We have shown how these estimates can be used

to compare the performance of different robot teams. However, these reliability

estimates by themselves are not terribly useful for missiondesign. If reliability

existed in a vacuum, then we would simply build the most reliable robots possible

for every mission. In designing a real-world mission it is necessary to consider

other performance metrics and trade them off against reliability. In this chapter we

explore some of the possible tradeoffs that can be made.

4.1 Cost

One of the most important factors in robot mission design is cost. For a given

mission, we would like to be able to determine which team configuration will

meet the mission specifications, including reliability, atthe lowest cost.

The reliability of planetary rovers is related to overall mission cost in two ways.

46

First, there is the increased cost associated with buildinghigher-reliability rovers.

Second, there is the increased expected value of the missionwhen using

higher-reliability rovers due to a higher probability of mission success.

4.1.1 Cost of reliability

In choosing components from which to build rovers, a designer would usually

make choices among a small number of alternative components, each providing a

certain reliability for a certain cost. In the early stages of mission design, however,

the mission designer may not yet have information about specific components. In

this case, it is useful to have a parametric model of the cost–reliability relationship.

Reference [24] provides a general model for this relationship, which is given as

c = exp

{

(1 − f) ·(Ri − Rmin)

(Rmax − Ri)

}

, (4.1)

whereRi is a reliability of interest betweenRmin andRmax; f is the feasibility of

reliability improvement (a number between 0 and 1); andc is the ratio of the cost

of Ri to the cost ofRmin.

Figure 4.1shows the relative cost of rovers with differing component reliabilities.

The costs are plotted as a percentage of the baseline rover cost, usingRmin = 0,

Rmax = 1 andf = 0.95.

Launch costs are also significantly affected by rover reliability. More-reliable

rovers will weigh more, due to the generally-larger size of more-reliable components

47

and also due to increased component redundancy. We have not found a model for

the reliability–weight relationship in the literature. Asan initial approximation we

assume that the relationship between weight and reliability is directly linear and

that the relationship between launch costs and weight is also directly linear.

4.1.2 Expected mission reward

Any robotic mission must have some inherent value to it. For some missions there

will be an obvious economic or strategic value to which a dollar amount can be

assigned. For a mission that lacks such an obvious dollar value, the cost of the

mission itself can be used as a lower bound for this inherent mission value, since

the sponsors presumably expect some positive return on their investment.

Multiplying the probability of mission success by the inherent value of the mission

0

10

20

30

40

50

60

70

80

90

100

40 50 60 70 80 90 100

Ro

ver

co

st (

% o

f b

asel

ine

team

)

Component reliability (% of baseline)

f = 0.95f = 0.90f = 0.70

Figure 4.1. Relative cost of rovers as function of component reliability

48

gives an expected reward for a given team configuration. For example,Figure 4.2

shows the relationship between component reliability and expected mission value

for a six-rover team performing the solar-panel-assembly mission described in

Chapter 3.

4.1.3 Overall cost–reliability relationship

Taking the expected mission value calculated above and subtracting the rover

development and launch costs gives an estimate of the net expected gain for the

mission. We ignore operating costs here since we expect themto be roughly constant

with respect to rover reliability (probably slightly higher for lower-reliability

rovers due to the increased need for human intervention).

In order to combine these costs meaningfully, we assign realdollar values to the

0

20

40

60

80

100

60 65 70 75 80 85 90 95 100

Ex

pec

ted

val

ue

(% o

f m

ax v

alu

e)

Component reliability (% of Table 3 values)

Figure 4.2. Expected value of mission as a function of component reliability

49

various costs for the baseline team (Table 4.1). These values are estimated from

the costs of the MER mission, along with the assumption that the rovers for this

mission would be somewhat cheaper and smaller than the MER rovers due to

advances in technology and also because they are single-purpose machines.

These values are used to calculate the net expected gain, which is plotted in

Figure 4.3aalong with its constituent parts. The most significant thingrevealed

by this figure is that there is clearly an optimal reliabilityrange with respect to the

expected gain of the mission and that this optimal reliability is significantly lower

than the reliability of the baseline legacy design.

Figure 4.3ashows that for low-reliability rovers the cost of failure drives the net

expected gain down, while for very-high-reliability rovers the high cost of the

rovers themselves drives the expected gain down. The optimal reliability range

therefore lies in a middle region where neither of these costs is as high.

In order to evaluate the effects of some of our assumptions, we repeated the above

analysis for different values of the feasibility constant (since this value was arbitrary)

and of the mission inherent value (since we used a lower-bound estimate for this

Table 4.1.Baseline team costs and rewards

Item Cost ($ Millions)

Robot cost (entire team) 150Launch cost (entire team) 300Inherent value of mission 450

50

value). These results are shown in Figures4.3band4.3c. These figures show that

while the shape of the expected gain curve changes with theseparameters, the

overall trends remain the same: Both figures support the argument that the optimal

range for mission reliability with respect to mission gain is at a lower level than

we would intuitively expect.

51

-400

-300

-200

-100

0

100

200

300

400

500

55 60 65 70 75 80 85 90 95 100

$ (M

illio

ns)


Expected valueRover cost

Launch costExpected gain

(a) f = 0.95, value = $450M

-300

-200

-100

0

100

200

300

400

500

55 60 65 70 75 80 85 90 95 100

$ (M

illio

ns)




(b) f = 0.5, value = $450M

-400

-200

0

200

400

600

800

1000

55 60 65 70 75 80 85 90 95 100

$ (M

illio

ns)




(c) f = 0.95, value = $900M

Figure 4.3. Net expected gain

52

4.2 Example – Multirobot Team Size

Using the reliability–cost relationship presented inSection 4.1, we revisit the solar

panel mission fromChapter 2, with the goal of addressing a claim that has been

made in the literature about one benefit of multirobot systems.

4.2.1 Introduction

Applications of multirobot systems can be divided into two categories: those

where multiple robots are necessary for task completion andthose where a single

robot could complete the task but where multiple robots are desirable for reasons

other than task completion. An example application fallinginto the first category

is soccer – a single robot cannot play soccer. An example application in the second

category is area coverage – while in many cases an area can be covered by a single

robot, it may be preferable to use more than one robot in orderto cover the area

more quickly.

When the mission itself does not dictate a particular robot team configuration,

there are multiple requirements that a mission designer must consider. Three

important factors that we consider here are time, cost, and reliability.

Time can be a reason for using more robots than the minimum required because,

for some tasks, having extra robots can reduce the time required to complete the

53

task. For instance, in an area coverage task, multiple robots can work in parallel in

order to accomplish the task more quickly.

Cost is an important consideration in team size. There is the cost of additional

robots. There is the cost of robot components–more robust components cost more.

There are operating costs such as transportation and maintenance, which may be

higher for a larger team. Infrastructure costs are likely tobe greater for a larger

team; for instance, a larger team may require more communications bandwidth.

The third performance criterion we consider here is reliability, expressed as the

probability of mission completion (PoMC). A requirement fora mission to have a

certain probability of successful completion can dictate the minimum number of

robots required for the mission. For example, if one robot has a 90% probability

of surviving a task, but the mission requirement is for a 97% probability of having

one robot survive the task, then one way to meet this requirement is by sending

two robots (giving a 99% chance that one would survive).

These criteria (time, cost, reliability) are highly interdependent. As an example,

adding more robots to a mission increases the cost, but it canalso reduce the amount

of time required to complete the mission. Reducing the mission duration means

that the robots don’t need to survive as long, so they can be built of lower-reliability

components, which reduces the cost.

These relationships among team size, component reliability, cost, time, and mission

success have been mentioned in the robotics literature, butonly in passing and

54

only in qualitative terms. In particular, researchers often claim that multirobot

systems provide greater reliability than single-robot systems (e.g., [25], [26], [27],

[28]).

Superficially, such a claim seems obviously true – if three robots are sent to do a

task instead of one, there is a greater chance of completing the task. When one

examines the above claim in greater depth, however, finding the answer can be

complicated. In this example, the cost of completing the task has been tripled

by sending three robots. If these same additional funds wereinstead invested

to improve the reliability of a single robot, then which would be more likely to

complete the task – the three robots or the single superior robot? The answer is no

longer obvious.

4.2.2 Analysis

We briefly remind the reader of the mission previously described inChapter 2:

• A team of robots is tasked to transport and assemble solar panels.

• The solar panels are large, so that two robots are required tocarry and assemble

each panel.

The baseline team consists of a pair of highly reliable robots. Using the cost–reliability

relationship inEq. 4.1, we can determine alternative team configurations with

the same overall cost. For example, we find that a team with four robots, each

55

made of components with 40% of the MTTF of the baseline components, would

cost about the same, using a feasibility of 0.5.Figure 4.4shows the simulation

results for these two teams. We see here that the team with four lower-reliability

robots has a higher mission reliability than the baseline team for missions shorter

than 85 panels. The larger, lower-reliability team would therefore be the more

cost-effective solution for shorter missions, while the smaller, high-reliability team

would be more cost-effective for longer missions.

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160

Po

MC

(%

)


2R (100%)4R (40%)

Figure 4.4. Comparison of equal-cost teams

56

4.3 Operating Conditions

The reliability engineering methods presented inChapter 2lack an explicit accounting

of operating and environmental conditions. Much of reliability engineering was

originally developed for analysis of systems installed in fairly static environments

such as nuclear power plants. Mobile robot components are exposed to dynamic

operating and environmental conditions, particularly in the case of planetary exploration

rovers, which, for example, are subjected to temperature differences of hundreds

of degrees between day and night. The reliabilities of many of the components

in a mobile robot will vary under different operating conditions. It is therefore

necessary to examine how the standard reliability engineering methods can be

adapted to take into account varying operating conditions.

4.3.1 Extrapolation of MTTF to other operating points

The MTTF provided by a device manufacturer represents the hazard rate under a

single set of operating conditions. In order to make reliability predictions over a

range of operating conditions, we need to extrapolate MTTF at different operating

conditions from the single-point MTTF. Models relating howoperating conditions

affect reliability are available for many components. These relationships are used,

for instance, in accelerated-life testing, where devices are subjected to extreme

operating conditions in order to induce failure, and the observed failure rates are

then extrapolated back to normal operating conditions.

57

An example of a robot component whose reliability is affected by operating

conditions is a mechanical bearing. Such bearings are oftenfound in robot motors

and joints. The failure rate of mechanical bearings is significantly affected by

operating conditions such as temperature, rotational speed, and load. Here we

show how the single-point MTTF for a mechanical bearing can be extrapolated

over a range of temperature and load conditions.

Reliability of bearings is often expressed by theL10 life, which is the time at which

10% of the population has failed. For a mechanical bearing theL10 value is given

by

L10 =

(

C

P

)d

·

(

106

60n

)

, (4.2)

whereC is the rated bearing load,P is the actual bearing load,d reflects the type

of bearing (d = 3.0 for a ball bearing,d = 3.3 for a roller bearing), andn is the

rotational speed [29].

Holding the speed constant and usingd = 3.0, we find that the life is related to the

applied load asL10

L10,0=

(

P0

P

)3

, (4.3)

where the subscript 0 indicates the manufacturer’s published reliability data.

To relateL10 life and hazard rate, we useEq. 2.1with R = 90%, giving

λ =− ln (0.9)

L10. (4.4)

58

CombiningEq. 4.3with Eq. 4.4gives the relationship between hazard rate and

operating load:λ

λ0=

MTTF0

MTTF=

(

P

P0

)3

. (4.5)

Bearing life is also greatly affected by temperature since the lubricant in the bearing

breaks down faster at higher temperatures. The approximaterelationship used for

the effect of temperature on bearing failure is that every10◦C rise in temperature

doubles the failure rate [30], or

λ

λ0=

MTTF0

MTTF= 2(

T−T010

). (4.6)

We can combine multiple environmental factors, assuming that they are independent.

In this case we can determine the effect of combined load and temperature changes

on the MTTF of a bearing, which is

λ

λ0=

MTTF0

MTTF=

(

P

P0

)3

· 2(T−T0

10). (4.7)

Eq. 4.7is plotted inFigure 4.5. This figure shows that MTTF varies greatly even

over a fairly small range of temperatures and loads. This illustrates why the

single-point MTTF provided by manufacturers is inadequateto describe the reliability

of devices operating under significantly different conditions from those under

which the MTTF was established.

59

4.3.2 Operating envelope

Figure 4.6shows some of the lines of constant MTTF resulting fromEq. 4.7.

These lines illustrate how operating conditions can be traded off against one another

as well as against reliability. For instance, if a robot is tobe operated in a

high-temperature environment, it may be desirable to operate the robot motors

at lower speeds in order to compensate for the increased ambient temperature.

On the other hand, if the speed of the robot was a critical mission requirement,

then we could continue to operate the robot at full speed, butwith a quantitative

understanding of the tradeoff being made with respect to reliability.

Such tradeoffs could be automated in a sophisticated rover that would monitor

ambient conditions and modify its mission profile in order tomaintain a target

mission reliability, in much the same way that human workerswill slow down

when working under adverse environmental conditions.

Figure 4.5. Effect of operating conditions on bearing MTTF

60

Figure 4.6. Lines of constant MTTF

61

4.4 Summary

In this chapter, we showed how reliability can be traded off against other mission

design parameters. We first presented a cost–reliability relationship from the literature

and used this to examine how the various costs of a planetary mission contribute

to the overall expected value of the mission. Our results suggest that building

planetary rovers to the highest levels of reliability may not be cost-effective. We

also made use of the cost–reliability relationship to provide a quantitative evaluation

of the claim that teams with more lower-reliability robots are more reliable than

teams with fewer higher-reliability robots. Our results inthis case show that this

claim is not universally true but must be evaluated in the context of specific mission

parameters. Finally, we looked at how operating conditionsaffect reliability –

specifically looking at how temperature and operating load affect the expected

life of a mechanical bearing.

62

Chapter 5

MISSION PLANNING

The previous chapters demonstrate how reliability can be used in the design of

robots and multirobot teams. In this chapter and the next we consider the role

of robot reliability in the process of mission planning for mobile robot teams.

Specifically, we examine here how knowledge of robot reliabilities can be used

to improve task allocation in the context of the multirobot exploration problem.

We take a simple exhaustive planner and compare the plan it chooses against the

optimal plan that takes into account robot failures and the backup plans that occur

after failure. Our results show that for this problem domain, making an initial plan

without regard to individual robot reliabilities results in choosing a suboptimal

plan most of the time and that the difference in mission performance between the

chosen plan and the optimal plan is usually substantial.

63

5.1 Background

For multirobot missions, it is necessary to allocate tasks among the team members

as part of the mission planning process. The specific task allocation chosen affects

the probabilities of failure for the robots, and thus the mission, since failure is a

function of usage.

In reviewing the robot mission planning literature, we find that there has been

substantial work in the area of detecting and recovering from robot failures (e.g.,

[14, 22, 31, 32]) and that several multirobot mission planning systems provide

mechanisms for reallocation of tasks among surviving team members after a robot

failure (e.g., [23, 33, 34]). However, all of these methods are reactive rather than

predictive, dealing with failure only after it occurs. Reference [23], for example,

describes a mission planning system that is able to recover from robot failure

because tasks can be reallocated. In this system, tasks are auctioned off to the

robot with the highest bid (or lowest, depending on the utility metric). This system

allows for task reallocation during the mission when new information changes the

valuation of tasks. For instance, if a robot suffers a component failure that impairs

its ability to perform its assigned tasks, it will change itsvaluation for those tasks,

and it can then subcontract tasks to another robot that has a better valuation for

those tasks.

While it is important to recover from robot failures, it wouldbe better to minimize

the likelihood of such failures in the first place. One way to do so is to design

64

the robots to an appropriate level of reliability for the mission requirements, as

described in the previous chapters of this document. Another way is to operate the

robots in a way that minimizes the likelihood of failure. As discussed in

Chapter 4, operating conditions are one execution-time consideration in the probability

of robot failure.

Another operational consideration in robot failure is the assignment of tasks to

robots and the ordering of those tasks. The mission planningsystem influences

the likelihoods of robot failures because the initial assignment of tasks to robots

plays a role in determining the probabilities of robot failures during the mission.

For example, assigning a robot with a weak or damaged drive motor to a task that

requires it to travel a long distance results in a higher probability of that robot

failing than if the robot were assigned to a task that required less travel. This example

is intuitive because it assumes heterogeneous robots, but in this chapter we demonstrate

that even when team members are homogeneous, the assignmentof tasks to robots

has a significant influence on robot failure. We are not aware of any existing work

that addresses the use of robot reliability information to improve multirobot task

allocation in this way.

One way to incorporate such reliability concerns into multirobot mission planning

would be to introduce a reliability component into the utility metric used by the

planner. Such an approach is unsatisfying for two reasons. The first is the

incommensurability of different components of a utility measure – how do we

combine dollars spent, meters traveled, and probability offailure into a single

65

metric? The second is that establishing a numeric reliability requirement is itself

a difficult problem that has been minimally explored for the mobile robot domain

– i.e., how do we decide if the reliability requirement for a mission should be 95%

rather than 96%?

In order to avoid these difficulties, we take a different approach: Rather than devising

new utility metrics that explicitly incorporate reliability, we instead look at how

robot reliability affects the utility metrics already being used. In plain language,

what we arenot doing is taking “Find the solution with the shortest time” and

turning it into “Find the solution with the shortest time that also meets reliability

levelX” but instead turning it into “Find the solution with the shortestexpected

time” where the expected time takes into account the alternative outcomes that

occur when robots fail.

66

5.2 Illustrating Example

Consider a simple multirobot exploration mission with two identical robots and

two locations to be visited (Figure 5.1). The goal of the mission is for all target

locations to be visited in any order by any robot in the shortest total mission time.1

Time is assumed here to be proportional to distance traversed.

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16 18

R1

R2

T1

T2

ROBOTSTARGETS

Figure 5.1. Exploration mission

Each robot is defined by an(x, y) location and a reliabilityPt, which is the probability

of surviving a one-unit traverse. Each target is defined by its (x, y) location. The

robot and target parameters used for this example are listedin Table 5.1and illustrated

by Figure 5.1.

For a small number of robots and targets, it is feasible to exhaustively enumerate

the possible task assignments and then calculate the distance that each robot must

traverse to accomplish each plan (Table 5.2). The plan duration (dplan) is equal

1In other terms, to minimize the makespan.

67

Table 5.1.Robot and target parameters

x y Pt

Robot 1 4 12 0.99Robot 2 14 3 0.99Target 1 1 1 —Target 2 3 5 —

to the greatest distance that any robot travels during that plan. The plan with the

smallest duration is then chosen. In this example, Plan B (Figure 5.2a) would be

chosen.

Now consider what happens when a robot fails while executingthis plan. If Robot 1

fails, then Robot 2 is assigned to visit Target 1 after reaching Target 2

(Figure 5.2b). If Robot 2 fails, then Robot 1 is assigned to Target 2 after reaching

Target 1 (Figure 5.2c). We assume here that tasks are not interrupted, so new

targets are assigned to surviving robots only after they complete their current

tasks.

Table 5.2.Plan durations(red/italic text indicates best plan)

Plan d(R1) d(R2) dplan

A (R1T1 + R1T2) 15.9 0 15.9B (R1T1 + R2T2) 11.4 11.2 11.4C (R2T1 + R1T2) 7.62 13.2 13.2D (R2T1 + R2T2) 0 17.6 17.6E (R1T2 + R1T1) 11.5 0 11.5F (R2T2 + R2T1) 0 15.7 15.7

68

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16 18

R1

R2

T1

T2

ROBOTSTARGETS

(a) Chosen plan (Plan B)

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16 18

R1

R2

T1

T2

ROBOTSTARGETS

(b) Backup for Plan B when Robot 1 fails (dashed linerepresents robot failure)

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16 18

R1

R2

T1

T2

ROBOTSTARGETS

planning to fail - incorporating reliability into design ... · 1.2 overview this thesis ﬁrst...

Documents