planning to fail - incorporating reliability into design ... · 1.2 overview this thesis first...
TRANSCRIPT
-
Planning to Fail:Incorporating Reliability into Design
and Mission Planning for Mobile Robots
Stephen B. Stancliff
CMU-RI-TR-09-38
Submitted in partial fulfillment of therequirements for the degree of
Doctor of Philosophy in Robotics.
The Robotics InstituteCarnegie Mellon University
Pittsburgh, Pennsylvania 15213
September, 2009
Thesis Committee:John Dolan, ChairBrett Browning
Michael NechybaAshitey Trebi-Ollennu, California Institute of Technology, JPL
Copyright c©2009 by Stephen B. Stancliff. All rights reserved.
-
i
-
ABSTRACT
Current mobile robots generally fall into one of two categories as far as reliability
is concerned – highly unreliable, or very expensive. Most fall into the first category,
requiring teams of graduate students or staff engineers to coddle them in the days
and hours before a brief demonstration. The few robots that exhibit very high
reliability, such as those used by NASA for planetary exploration, are very expensive.
In order for mobile robots to become more widely used in real-world environments,
they will need to have reliability in between these two extremes. In many applications
some amount of unreliability is acceptable if it results in reduced costs. Even in
applications where a failure probability very near zero is desired (such as planetary
exploration), the ability to design robots to a specific reliability goal should allow
us to reduce the costs of these highly reliable robots by designing them to be “just
reliable enough” to complete the mission, rather than designing them to be “as
reliable as possible.”
In order to design mobile robots with respect to reliability, we need quantitative
models for predicting robot reliability and for relating reliability to other design
parameters such as cost. To date, however, there has been very little formal
discussion of reliability in the mobile robotics literature, and no general method
has been presented for quantitatively predicting the reliability of mobile robots.
ii
-
This thesis focuses on this problem of predicting reliability for mobile robots
and in particular for teams of mobile robots, and proposes solutions for using
reliability as a design input for several mobile robot design problems:
• Given a choice of components from which to assemble a robot, how do we
select the ones that will optimize the tradeoff of reliability against other
factors such as cost?
• Given a choice of robots from which to assemble a multirobot team, how do
we select the ones which will optimize the reliability tradeoffs for the entire
robot team?
• Given a multirobot team and a list of mission tasks, how do we assign tasks
to team members in order to maximize the probability of completing the
mission?
iii
-
Table of Contents
List of Tables vii
List of Figures viii
Chapter 1. INTRODUCTION 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2. SINGLE-ROBOT RELIABILITY 92.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Reliability Background . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Types of robot failures. . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Reliability model. . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Consequences of constant hazard rate. . . . . . . . . . . . . 17
2.3 Robots and Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Robot decomposition. . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Module–task and robot–task reliability. . . . . . . . . . . . 22
2.3.3 Single-robot example. . . . . . . . . . . . . . . . . . . . . 23
2.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3. MULTIROBOT RELIABILITY 283.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Analytical Solutions for Simple Multirobot Missions. . . . . . . . . 30
3.3 Stochastic Simulation for Complex Multirobot Missions. . . . . . . 31
3.4 Example Results for a Complex Multirobot Mission. . . . . . . . . 35
3.4.1 Comparing teams having different numbers of robots. . . . . 37
iv
-
3.4.2 Comparing teams with robots having different reliabilities . . 393.5 Example – Repairable vs. Nonrepairable Robot Teams. . . . . . . . 413.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 4. DESIGN TRADEOFFS 464.1 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Cost of reliability. . . . . . . . . . . . . . . . . . . . . . . . 474.1.2 Expected mission reward. . . . . . . . . . . . . . . . . . . . 484.1.3 Overall cost–reliability relationship. . . . . . . . . . . . . . 49
4.2 Example – Multirobot Team Size. . . . . . . . . . . . . . . . . . . 534.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Operating Conditions. . . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Extrapolation of MTTF to other operating points. . . . . . . 574.3.2 Operating envelope. . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 5. MISSION PLANNING 635.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Illustrating Example. . . . . . . . . . . . . . . . . . . . . . . . . . 675.3 Simulation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1 Minimax utility function . . . . . . . . . . . . . . . . . . . . 745.3.2 Differences in plan durations. . . . . . . . . . . . . . . . . 755.3.3 Overall planner performance metric. . . . . . . . . . . . . . 775.3.4 Minisum utility function . . . . . . . . . . . . . . . . . . . . 78
5.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 6. INCOMPLETE MISSION PLANNERS 826.1 Greedy Planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1.1 Incorporating reliability . . . . . . . . . . . . . . . . . . . . 846.1.2 Incorporating reliability – revised method. . . . . . . . . . . 85
6.2 Less-Greedy Planner. . . . . . . . . . . . . . . . . . . . . . . . . . 906.3 Compromise Planner. . . . . . . . . . . . . . . . . . . . . . . . . 926.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
v
-
Chapter 7. CONCLUSIONS 957.1 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.3 Future Research. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Appendix A. Subsystem Reliability Data 102
Appendix B. Expected Value Calculation 105
Bibliography 108
vi
-
List of Tables
2.1 Module usage during sampling task. . . . . . . . . . . . . . . . . 24
2.2 Components comprising power subsystem. . . . . . . . . . . . . . 24
2.3 Robot subsystem reliabilities. . . . . . . . . . . . . . . . . . . . . 25
2.4 Module reliabilities during sampling task. . . . . . . . . . . . . . 25
3.1 Subsystem usage by task. . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Module–task reliabilities. . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Baseline team costs and rewards. . . . . . . . . . . . . . . . . . . 50
5.1 Robot and target parameters. . . . . . . . . . . . . . . . . . . . . 68
5.2 Plan durations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Naive and expected durations. . . . . . . . . . . . . . . . . . . . . 72
A.1 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.2 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.3 Computation & Sensing. . . . . . . . . . . . . . . . . . . . . . . . 103
A.4 Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.5 Manipulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.1 Robot and target parameters. . . . . . . . . . . . . . . . . . . . . 107
B.2 Plan durations and probabilities. . . . . . . . . . . . . . . . . . . . 107
B.3 Plan durations – expected (minimax). . . . . . . . . . . . . . . . . 107
B.4 Plan durations – minimax (expected). . . . . . . . . . . . . . . . . 107
vii
-
List of Figures
2.1 The bathtub curve. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Failure rates making up bathtub curve. . . . . . . . . . . . . . . . 16
2.3 Modular robot concept. . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 NASAHierarchical System Terminology. . . . . . . . . . . . . . . 21
3.1 Possible paths for simple mission. . . . . . . . . . . . . . . . . . . 30
3.2 Same mission as Figure 3.1 but with one repair allowed. . . . . . . 32
3.3 State–transition diagram for complex mission. . . . . . . . . . . . 33
3.4 Different numbers of robots. . . . . . . . . . . . . . . . . . . . . . 38
3.5 Closeup of area of interest from Figure 3.4. . . . . . . . . . . . . . 38
3.6 Different component reliabilities. . . . . . . . . . . . . . . . . . . 39
3.7 Total work completed; two-component robots. . . . . . . . . . . . 42
3.8 Improvement of repairable team over nonrepairable team. . . . . . 42
3.9 Total work completed; six-component robots. . . . . . . . . . . . 43
3.10 Effect of failure rate on repairable team superiority. . . . . . . . . 43
4.1 Relative cost of rovers as function of component reliability . . . . . 48
4.2 Expected value of mission as a function of component reliability . . 49
4.3 Net expected gain. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Comparison of equal-cost teams. . . . . . . . . . . . . . . . . . . 56
4.5 Effect of operating conditions on bearing MTTF. . . . . . . . . . . 60
4.6 Lines of constant MTTF . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 Exploration mission. . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Chosen plan and backups. . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Plan with shortest expected duration. . . . . . . . . . . . . . . . . 73
5.4 Suboptimal allocations. . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Suboptimal allocations. . . . . . . . . . . . . . . . . . . . . . . . 76
viii
-
5.6 Average increase in mission duration as a function of target count . 77
5.7 Expected increase in duration as a function of target count . . . . . 78
5.8 Expected increase in mission duration. . . . . . . . . . . . . . . . 79
5.9 Expected increase in power consumption. . . . . . . . . . . . . . . 80
6.1 Planner comparison (first approach). . . . . . . . . . . . . . . . . 85
6.2 Planner comparison (second approach). . . . . . . . . . . . . . . . 87
6.3 How often reliability-enhanced planner chooses a better plan . . . . 88
6.4 Expected increase in mission duration (greedy planner). . . . . . . 89
6.5 Expected increase in mission duration (less-greedy planner). . . . . 91
6.6 Comparison of greedy and less-greedy planners. . . . . . . . . . . 91
6.7 Comparison of less-greedy and compromise planners. . . . . . . . 93
6.8 How the number of plans tested affects performance. . . . . . . . . 93
ix
-
Chapter 1
INTRODUCTION
1.1 Motivation
Many of the most promising applications for mobile robots are those that reduce
or eliminate the need for humans to perform tasks in dangerous environments.
Examples include space exploration, mining, and toxic waste cleanup. For mobile
robots to succeed in keeping humans from these dangers, these robots must be
highly reliable so that people do not have to enter the dangerous area to repair or
replace failed robots.
Unfortunately, most current mobile robots have poor reliability, requiring frequent
maintenance and repair. Historical failure data for small field robots reveal that
they are either broken or under repair approximately half ofthe time [1].
Notable exceptions to this observation are the planetary rovers built and operated
for NASA by the Jet Propulsion Laboratory (JPL). The currentMars Exploration
Rovers (MER), for instance, have now been in operation on Mars for more than
1
-
five years. There are few, if any, other mobile robots that have operated for as long
as a year without repair.
The reliability of NASA rovers is achieved through the use ofhighly robust
components as well as component redundancy, both of which lead to the robots
being very expensive. The cost of the first MER rover was approximately $150M
[2]. Other than space exploration and perhaps a few military applications, it is
hard to imagine many applications for which a robot price tagin the hundreds
of millions of dollars will be acceptable. Therefore, the current NASA design
paradigm of making robots as reliable as possible is not broadly applicable.
Even in the realm of planetary exploration, the current design paradigm may not
be able to provide the reliability required for future missions. In the near future,
NASA intends to send rovers to Mars for missions lasting an order of magnitude
longer than the original MER mission. Using the current design paradigm,
increasing the mission duration by an order of magnitude requires that the rover
be built using components with failure rates an order of magnitude lower. Since
NASA rovers already make use of some of the most reliable components available,
it is doubtful whether components with an order-of-magnitude greater reliability
are available, let alone affordable.
Both of these situations – the unreliability of mobile robotsin general and the high
cost of reliable NASA rovers – reveal the need for principledconsideration of
reliability as a design parameter for robots and robot missions. To date, however,
2
-
there has been very little formal discussion of reliabilityin the robotics literature,
and no general methods have been presented for using quantitative reliability as an
input for robot design or mission planning.
3
-
1.2 Overview
This thesis first addresses the question of how to apply existing quantitative reliability
estimation methods to mobile robots. Second, we present methods for using reliability
as an input parameter in the design of robots and multirobot teams. Finally, we
consider how knowledge of robot reliability can be used to improve the performance
of multirobot planners.
The methods developed in this thesis have their roots in the field of reliability
engineering. We have developed a formal representation formobile robots that
allows us to apply common reliability engineering models ina systematic way to
determine the probability that a robotic mission will be successfully completed.
As is often the case when applying outside fields of knowledgeto mobile robots,
we have discovered areas where the existing methods fall short and must be modified
to deal with the complexities of mobile robot systems. In particular, traditional
methods of combining component reliabilities into system reliability assume that
the components are independent in terms of reliability. This assumption fails for
many multirobot missions. We present a framework for dealing with the complexity
that these dependencies add to the problem.
We then apply these methods to several single-robot and multirobot design problems,
examining the tradeoffs between cost and reliability, between repairable and
4
-
nonrepairable robots, and between teams of many low-reliability robots versus
teams of fewer high-reliability robots.
Finally, we examine the role of reliability in multirobot task allocation. Specifically,
we evaluate the hypothesis that ignoring robot reliabilityinformation when generating
initial task allocations leads to suboptimal performance.Our results show that this
is indeed the case and that the difference in performance is substantial.
5
-
1.3 Contributions
The contributions of this dissertation to the robotics community are the following:
• Introduction of models for reliability prediction from thereliability
engineering literature into the mobile robotics literature.
• A general theoretical framework for applying these reliability engineering
methods to robots and multirobot teams.
• The first quantitative analysis of cost–reliability tradeoffs for planetary rover
missions.
• The first quantitative analysis of the tradeoff between robot reliability and
team size.
• The first analysis of the benefit of using reliability knowledgea priori in
multirobot mission planning, providing strong evidence that planners which
do not use this information choose suboptimal plans.
• A model for how reliability knowledge can be used to improve task allocation
for incomplete multirobot planners.
Overall, these contributions allow us to begin consideringrobot reliability as an
input parameter for robot design and operation, rather thanas an uncontrolled
output resulting from decisions made without regard to reliability.
6
-
1.4 Outline
Chapter 2 of this document introduces the relevant terminology and models we
borrow from the reliability engineering literature and describes the framework we
have developed for applying reliability engineering to thedesign of mobile robots.
This chapter presents an example problem in which we calculate the probability
that a planetary exploration rover will successfully complete a sampling mission.
Chapter 3 considers the design of multirobot teams. We first present a
straightforward method for finding analytical solutions tomultirobot missions.
Because such methods are impractical for missions of significant complexity, we
then introduce a method that uses stochastic simulation forevaluation of more
complex missions. In this chapter we evaluate a multirobot mission in which
planetary rovers must work cooperatively to install a solarpanel array. We also
analyze a problem introduced in [3] that compares the performance of a team of
repairable robots with that of a nonrepairable team.
In Chapter 4 we demonstrate how reliability can be integratedwith other design
parameters in order to optimize robot design across multiple design constraints.
The bulk of this chapter examines the relationship between reliability and cost
in the context of a planetary exploration mission. We also examine how operating
conditions affect reliability and how single-point reliability data can be extrapolated
to off-design operating conditions.
7
-
Chapters 5 and 6 examine the role of quantitative reliabilityin multirobot mission
planning. Specifically, in Chapter 5 we test the hypothesis that it is necessary to
consider robot reliability when generating initial task allocations, rather than, as
is currently practiced, dealing with reliability only after the fact, by reallocation of
tasks after robot failure occurs. In Chapter 6 we extend theseresults by
demonstrating that reliability information can be used to improve plan selection
for heuristic planners.
Finally, in Chapter 7 we summarize the contributions of this thesis and discuss
future directions for this research.
8
-
Chapter 2
SINGLE-ROBOT RELIABILITY
This section provides an overview of methods and models fromthe reliability
engineering literature, introduces the representation weuse for modeling mobile
robots using these methods, and shows how these methods can be applied in order
to predict the probability that a single robot will completea given task.
2.1 Related Work
The reliability engineering literature (e.g., [4, 5]) provides methods for predicting
the reliability of simple electrical and mechanical devices and also for combining
these reliabilities to predict the reliability of complex systems. These methods can
be applied in a straightforward fashion to make predictionsabout the reliability of
simple robots executing simple missions. For many robotic applications, however,
there are violations of the assumptions upon which the basicreliability engineering
methods are based. We address these shortcomings in Chapters3 and4.
In the mobile robotics literature there is little formal discussion of reliability and
9
-
failure. When reliability is mentioned, it is usually qualitatively, and in passing.
Reference [6], for example, mentions intermittent hardware failures asan explanation
for gaps in experimental data but makes no attempt at characterizing the failures.
A handful of prior papers ([1, 7, 8, 9]) make use of reliability engineering for
analysis of mobile robot failure rates. Reference [1] provides an overview of robot
failure rates at the system level (i.e., robot modelX failedY times inZ hours of
operation) and also breaks down failures according to the subsystem that failed
(actuators, control system, power, or communications). Reference [7] extends the
work in [1] both by the inclusion of additional failure data of the sametype and
also by addition of new categories of failure – those due to human error. Reference
[9] provides a detailed analysis of failures experienced by some of the robots used
in searching the World Trade Center wreckage in 2001. Reference [10] provides
failure data for robots used in long-term experiments as museum guides. While
these papers help us to begin to identify the causes of mobilerobot failure, they do
not provide methods for predicting failures.
In contrast to the mobile robot literature, there is considerable work in the area of
reliability of robotic manipulators. Examples include [11] and [12]. This work in
manipulator reliability has the same shortcomings with respect to mobile robots
as the basic reliability methods, in that manipulators are generally simpler devices
than mobile robots and are used in fairly static environments. There is some relevant
work in the manipulator literature describing how environmental conditions affect
10
-
reliability (e.g., [13]), although here the environmental factor involved is a constant
rather than varying with time and task, as is often the case with mobile robots.
There is also a significant body of mobile robot research thatdeals tangentially
with reliability by describing methods for detecting and recovering from failures.
An example is [14], in which fault detection is used to discard faulty sensor readings
among a group of redundant sensors. Our work differs from these in that we are
developing methods to predict the probability of failure occurring rather than
to respond to failure after it occurs. Our methods are complementary to these
since ana priori understanding of the relative probabilities of different failures
is helpful for failure diagnosis.
11
-
2.2 Reliability Background
Reliability is “the ability of a system or component to perform its required functions
under stated conditions for a specified period of time” [15, p. 170]. In other words,
reliability is the probability that no failures will occur before a given time. When
evaluating the reliability of a system, we must first identify the ways in which the
system may fail and then determine the probabilities of those failures occurring.
2.2.1 Types of robot failures
Mobile robots are complex systems, and as a result there are many factors that
can cause the failure of a robotic mission. The laboratory robots with which most
researchers are familiar usually fail due to errors in design, manufacturing, or
usage. The hardware breaks down due to being poorly designedor constructed;
the software has bugs that are revealed only under the stressof a demonstration;
and both hardware and software fail because the robots are used in situations
beyond the intentions of their designers.
While these types of failures are significant and in fact are the dominating failure
modes for most mobile robots today ([1],[7],[8]), we contend that these failure
modes are not in need of modeling so much as they are in need of correction.
These failures are the result of errors that can be reduced, if not eliminated, through
process control. Methods for reducing errors in design, manufacturing, software
development, and operation are widely used in industry (e.g., ISO 9001 Quality
12
-
Management). As mobile robots become more common and are produced in a
manufacturing rather than a research environment, these engineering methods will
be applied, yielding a reduction in failures due to errors.
We can see that this is possible because some of today’s mobile robots are already
built with a high degree of quality control in design, construction, and operation.
For instance, the planetary rovers built for NASA by JPL are built to very high
standards of quality and controlled by highly trained operators, resulting in a very
low incidence of failures due to errors. This is largely because much greater care
is given to their design, construction, and operation in comparison with most other
current mobile robots.
Once failures due to errors are largely eliminated, as with the NASA rovers, the
remaining failures are due mostly to inherent properties ofthe materials from
which the robot is constructed. An example of such a failure is the degradation
of the lubricant in a bearing and the subsequent failure of the bearing. There is no
process control that will change the physical reality that lubricants break down
and unlubricated bearings fail. Instead, the robot must be designed taking into
account the possibility of bearing failure so as to guarantee that there is only a
small chance of failure during the mission.
The need to address such failures is suggested by the long-term robot museum
guide experiments described in [10]. The robots described in that paper possessed
self-diagnostic and self-resetting capabilities that allowed them to overcome many
13
-
design and implementation errors. The “remaining failureswere eventually stochastic
and unpredictable, a tire failing here, and a light bulb failing there” [10, p. 4].
It is this latter type of failure with which we are primarily concerned. The reliability
engineering literature provides well-established modelsfor this type of failure.
In the rest of this chapter we demonstrate how these models can be used for the
prediction of mobile robot failures and for choosing an optimal set of robot
components with respect to reliability requirements.
It is possible that some of the other types of failure mentioned above can also be
incorporated into these predictions. For instance, modelsfor predicting software
errors have been proposed in the literature (e.g., [16],[17]). Incorporation of such
models would allow us to provide a more complete picture of mobile robot failure.
However, these models have been in existence for a much shorter time than hardware
reliability models and have been applied in very few cases, so their ability to predict
software failures is unproven. In addition, our goal is to produce tools that can be
used in the early stages of mission design. Most of the available software prediction
models require input data that are not available in those early stages. We therefore
confine ourselves in this work to the category of hardware failures described above.
2.2.2 Reliability model
Reliability models are descriptions of how the instantaneous failure rate (orhazard
rate) for a device changes over time. For many electronic and mechanical devices,
14
-
when the hazard rate is plotted as a function of time, the resulting curve resembles
Figure 2.1[4, p. 109]. This characteristic shape is referred to as thebathtub curve.
The bathtub curve arises from the superposition of three distinct failure patterns.
The first is an exponentially decreasing failure rate which is high at the beginning
of the product life (Figure 2.2a). This corresponds to the period during which
items fail largely due to defects in materials or construction. There are many early
failures, but as defective items drop out of the population,the remaining population
has a lower hazard rate. This is referred to as theburn-inor infant mortalityperiod.
The second pattern (Figure 2.2b) is an exponentially increasing failure rate which
becomes high when components have reached the ends of their useful lives and
begin to fail due to deterioration. This is referred to as thewearoutphase.
The third failure pattern (Figure 2.2c) is a constant failure rate due to random
Figure 2.1. The bathtub curve
15
-
(a) Infant mortality
(b) Wearout
(c) Random failures
Figure 2.2. Failure rates making up bathtub curve
16
-
failures. In the middle section of the bathtub curve this failure pattern dominates.
This period is referred to as theservice lifeor useful life.
In applying the bathtub model to robots, we assume that therewill be a period of
initial testing which allows burn-in failures to be dealt with before components are
placed into service. This is standard procedure for manufacturing of products with
small production runs or for products that use cutting-edgetechnology [18].
At the other end of the bathtub curve, we assume that the service life of components
will be specified by their manufacturers and observed in robot design and mission
planning so that robot modules will not wear out before the completion of the
mission for which they are being designed.
Given these two assumptions, the hazard rate of a robot component needs to be
known only during the service life phase. This hazard rate ismodeled as a constant,
which is represented in the literature byλ. It is also important to know when the
end of the service life is reached. The reliability of a module can therefore be
modeled with just two parameters – the (constant) hazard rate and the service life
length.
2.2.3 Consequences of constant hazard rate
The reliability of a device with a constant hazard rate is
R(t) = e−λt. (2.1)
17
-
Thus, the reliability of a device with a constant hazard rateis equal to one at the
beginning of the service life and decays exponentially towards zero.
Manufacturers usually specify the reliability of a device in terms ofmean time to
failure (MTTF). During the service life, the hazard rate and MTTF arerelated as
MTTF =1
λ. (2.2)
The relationships inEq. 2.1and2.2allow us to calculate the probability of failure
of a component from the manufacturer’s published MTTF. It isimportant to remember
that this MTTF applies only during the constant-hazard-rate portion of the bathtub
curve. It is a common mistake to assume that MTTF, since it hasunits of time,
measures how long an item will last. Most components will fail due to wearout
long before the time corresponding to MTTF is reached. Reference [19] has this to
say about the confusion:
Note that there is no direct connection or correlation betweenservice life and failure rate. It is possible to design a veryreliableproduct with a short life. A typical example is a missile for example:it has to be very, very reliable ([MTTF] of several million hours), butits service life is only 0.06 hours (4 minutes)! 25 year old humanshave an [MTTF] of about 800 years (about 0.1%/year) but not manyhave a comparable service life. Just because something has agood[MTTF], it does not necessarily have a long service life as well. [ 19,p. 5]
One of the reasons that the constant hazard rate model is commonly used is because
many reliability calculations are much simpler under this model than other models.
18
-
This model is closed under the operations of combining devices in serial and
parallel, while most other reliability models are not [20, p. 47]. Another useful
property is the “lack of memory” of the exponential function; i.e., the probability
that a device will fail in the next hour of operation is the same at any point within
the constant-failure-rate portion of the bathtub curve [20, p. 43].
Some devices used in mobile robots do not follow the constant-failure-rate model.
Devices that fail due to mechanical wearout, such as bearings, are better fitted by
more complex reliability models. However, the reliabilityof these devices can be
approximated piecewise by regions of constant failure rate. This allows for the
simpler calculations of the exponential model to be used within each segment of
the approximation [20, p. 44].
19
-
2.3 Robots and Tasks
2.3.1 Robot decomposition
In order to allow for a systematic evaluation of mobile robotreliability, we have
developed a formal method for representing robots and theirsubsystems. For
our analyses we consider robots to be made of multiple modules, as inFigure
2.3. We usemodulehere to refer to a specific instantiation of a robot subsystem.
A subsystem is a functional division of the robot that can be conceived as being
engineered, assembled and tested independently of other subsystems (Figure 2.4).
The methods presented here are not dependent on this particular definition of
module or subsystem, but this definition makes it possible toconsider modules
as interchangeable building blocks for robots, allowing usto use reliability and
other criteria to choose the best set of modules for a given mission.
Figure 2.3. Modular robot concept
20
-
Figure 2.4. NASA Hierarchical System Terminology[21]
Combining module reliabilities to obtain the reliability ofan entire robot is
straightforward when the constant-hazard-rate model is used. Modules are considered
to be either in series or parallel. In a series combination, all modules must be
functioning for the system to function. In a parallel combination, only one module
must be functioning for the system to function.
For a series combination the overall reliability is the product of the component
reliabilities, i.e.,
Rs =N∏
i=1
Ri, (2.3)
and the overall hazard rate is the sum of the hazard rates for the modules, i.e.,
λs =N
∑
i=1
λi. (2.4)
21
-
For modules in parallel, the overall unreliability (1 minusthe reliability) is the
product of the component unreliabilities:
(1 − RS) =N∏
i=1
(1 − Ri) . (2.5)
If the modules are identical (which is usually the case), then the overall hazard
rate for the parallel combination is
λS = λ · (1 +1
2+ ... +
1
N)−1. (2.6)
2.3.2 Module–task and robot–task reliability
We use task completion as our fundamental utility measure. We assume that the
mission can be decomposed into distinct tasks and that thesetasks are assigned to
particular robots. Using task completion as our fundamental measure allows us to
compare different robot and team configurations based on howmany tasks they
can complete, how quickly they can complete tasks, the percentage of a complex
mission that they can complete, etc.
To calculate the probability that a module will survive a mission task (module–task
reliability), the MTTF of the module must be known, along with the expected
usage of the module during that task. For instance, we might be told that Task
1 will take six hours, using modules A and B for the entire six hours and using
module C for three hours.
22
-
In order to discretize the calculations, we evaluate the probability of failure only
at the end of a task. We assume that the entire task is completed whether there is a
failure or not; i.e., all failures occur after completion ofthe task. This assumption
does not limit the usefulness of our method because if one needs to know whether
a robot failed in the middle of a task, the tasks can simply be restated into subtasks
to provide a desired level of granularity.
Given the module–task reliability for each module, we can use the equations for
combining reliabilities (given inSection 2.3.1) to determine the probability that
the robot will fail during the task (robot–task reliability).
2.3.3 Single-robot example
We now apply the formulas from the preceding sections to predict the probability
that a robot will complete a mission task. Consider a planetary exploration rover
that is tasked to extract core samples. The rover is composedof five modules:
• Power
• Computation and Sensing
• Mobility
• Communications
• Manipulator
23
-
Table 2.1.Module usage during sampling task
Module Usage (h)
Power 8Computation & Sensing 8
Mobility 6Communications 2
Manipulator 4
The duration of the task is eight hours, and the amount of timeeach module is
used during the task is given inTable 2.1.
For each module, we obtained reliability data from JPL that are representative
of components used in NASA’s planetary robots. As an example, the breakdown
of components and reliabilities for the power module is shown in Table 2.2. The
entire list of component reliabilities is provided inAppendix A.
Table 2.2.Components comprising power subsystem
Component Quantity MTTF (h)
Battery 2 4.8MBattery control board 2 2.5M
Mission clock 1 10MPower distribution unit 1 588k
Power control unit 1 5.3MShunt limiter 1 88k
Electrical heater 2 333kRadioisotope heater 2 73k
Thermal switch 2 11k
24
-
Table 2.3.Robot subsystem reliabilities
Module MTTF (h)
Power 4.20kComputation & Sensing 4.77k
Mobility 19.7kCommunications 11.9k
Manipulator 13.8k
These component reliabilities were combined for each module according toEq.
2.4, giving the module MTTFs listed inTable 2.3.
Using these overall module failure rates andEq. 2.1, we can calculate the probability
that each module will still be functioning at the end of the task. For the power
module, this gives
R = e(−8
4202) = 99.810%. (2.7)
The reliabilities for the other modules for this task are found similarly and are
shown inTable 2.4.
Table 2.4.Module reliabilities during sampling task
Module Module–Task Reliability
Power 99.810%Computation & Sensing 99.832%
Mobility 99.970%Communications 99.983%
Manipulator 99.971%
25
-
Finally, we combine all of the module reliabilities usingEq. 2.3to give an overall
robot–task reliability of 99.567%.
26
-
2.4 Summary
In this chapter, we introduced definitions and models from the reliability engineering
literature and provided a representation that can be used toapply these models
to mobile robots. We then demonstrated how our representation can be used to
predict the probability that a single robot will complete a given task.
This type of calculation is useful for selecting componentsfrom which to build a
robot to meet mission requirements. For example, given several mobility modules
with different reliabilities and costs, we can calculate the robot–task reliabilities
for robots using each alternative and then select the lowest-cost module that meets
the mission requirements.
27
-
Chapter 3
MULTIROBOT RELIABILITY
The reliability engineering methods presented in the previous section fall short
when applied to multirobot teams. The equations for combining reliabilities of
subsystems (Eq.2.3–2.6) assume that the failure of one subsystem is independent
of the failure of other subsystems. This is a reasonable assumption when combining
component reliabilities to create larger assemblies, and even when combining
assemblies to produce an entire robot. When combining robotsto make a robot
team, however, this assumption is not reasonable in many cases. For most multirobot
missions, the failure of one robot will affect the tasking ofother robots so that
their reliabilities are not independent. In this chapter wepresent a method that
overcomes this limitation, allowing us to calculate the probability of completing a
multirobot mission.
3.1 Related Work
There is considerable work in the multirobot domain that examines how to diagnose
and/or recover from robot failures. For example, [22] describes a behavior-based
28
-
robot control architecture that is able to adapt to robot failures and communication
failures, and [23] discusses detection and recovery from multiple types of failure
in a market-based planner. As in the single-robot domain, our work differs from
these in that we are developing methods to predict the probability of failure before
it occurs rather than to respond to failure after it occurs.
The only known work preceding ours in the area of predicting mobile robot team
reliability is [3]. That paper’s methods are similar to ours in that they are based in
the reliability engineering literature, but that work has anarrow focus on teams of
robots with cannibalistic repair capability. In contrast,we are developing a general
methodology that can be applied to a wide variety of robot teams and missions.
We revisit [3] in more depth inSection 3.5.
29
-
3.2 Analytical Solutions for Simple Multirobot Missions
For very simple missions, it is possible to enumerate by handall of the possible
outcomes. One way of doing this is by drawing a tree diagram such as in
Figure 3.1. We can use such a tree to derive an analytical solution for the probability
of mission completion (PoMC).
For the two-task, two-robot mission shown inFigure 3.1, the analytical solution is
PoMC = P (R1T1)P (R2T1)P (R1T2)P (R2T2), (3.1)
whereP (RnTm) is the probability that robotn survives taskm. If the robots are
identical, then this becomes
PoMC = P (T1)2P (T2)
2. (3.2)
Figure 3.1. Possible paths for simple mission. (R1+ = Robot 1 alive;R1− = Robot1 dead)
30
-
3.3 Stochastic Simulation for Complex Multirobot Missions
In more realistic mission scenarios, the failure of one robot will have an impact on
the probability of failure of the other robots on the team so that the probability
of mission completion cannot be calculated in a straightforward manner. The
simplest example of such dependence is when there are a fixed number of tasks
to be completed and the tasks will be allocated among available robots until all
tasks are completed or all robots have failed. In this case, when one robot fails,
there is a greater amount of work to be performed by the remaining robots, which
increases the probability that they will fail.
Robot reliabilities are also interdependent when robot tasks are not executed
independently. This is the case, for instance, when there are tasks that require two
or more robots to work together. If one of the robots performing a joint task fails,
perhaps the remaining robots can still complete the task, but with increased stress
on their components, which then increases their chance of failure. Or perhaps that
task is abandoned, in which case the remaining robots have a decreased chance of
failure.
Another type of reliability interdependence is introducedif the robot team is capable
of repairing a failed team member. Since repairing a failed robot requires action
on the part of other robots, the failed robot is repaired at the cost of increased
probability of failure for the robots executing the repair.Repairing a failed team
member may therefore in some cases decrease the probabilityof mission completion.
31
-
Figure 3.2illustrates how mission complexity increases when such interdependence
is introduced. This figure represents the same mission asFigure 3.1, but with the
addition of the ability to repair one failed robot. The addition of this single repair
capability has increased the number of leaf nodes from 7 to 25. For a realistic
scenario with several robots, multiple tasks, and perhaps dozens of spare parts,
the tree becomes complex enough that a direct analytical solution is infeasible.
For these more complex missions, we have developed a method of estimating
mission reliability using stochastic simulation. In this method, we represent the
mission using a state–transition diagram, as inFigure 3.3. (Details of the mission
represented byFigure 3.3are given inSection 3.4.)
The state machines represented by these diagrams can be implemented in software
in order to explore the space stochastically. At each task node, the state of the
robot team is evaluated by choosing a random value between zero and one for
each module and comparing that value with the module–task reliability for that
module for the current task. The branch in the diagram corresponding to the resulting
Figure 3.2. Same mission asFigure 3.1but with one repair allowed
32
-
team state is followed, and the process continues until the simulation reaches
eitherSuccessor Failure.
Start # Robots 0?
Return
N
N
Figure 3.3. State–transition diagram for complex mission
33
-
The simulation is repeated many times, with eachSuccessresult being assigned
a score of one and eachFailure result being assigned a score of zero. The average
score of a large number of trials then gives the overall probability of mission
completion.
While this method has computational limitations, it is a significant improvement
over the direct analytical method, which can require days oftedious hand calculations
and has a high potential for human error.
34
-
3.4 Example Results for a Complex Multirobot Mission
Consider a planetary exploration mission where a team of robots is tasked to install
a solar panel array for a measurement and observation outpost. The mission consists
of carrying solar panels from the landing site to the outpostand then assembling
them. The size of the solar panels is such that two robots are needed to carry and
assemble one panel.
For the purposes of this analysis, the task of assembling a solar panel is broken
down into three subtasks:
• Transit to the outpost;
• Assemble the panel; and
• Return to the landing site.
The state–transition diagram for this mission was shown inFigure 3.3. Working
through that figure from the top, we see that if there are fewerthan two robots
then the mission is a failure. If there are at least two robots, then if there are no
panels left to be installed, then the mission is a success. Ifthere are at least two
robots, and there are panels still remaining to be installed, then the robots will pair
off and carry panels to the outpost (Transit task). After theTransit task, if there
are fewer than two robots alive and if there are spare robots at the landing site,
then the spares willTransit to the outpost until at least two robots are available to
Assembleor until there are no more spare robots (in the latter case, the mission
35
-
fails). The robots then pair off toAssemblethe panels, and any robots that survive
that taskReturnto the landing zone.
For this example all of the robots on the team are identical. The usage times for
each module for each task are shown inTable 3.1. These usage times along with
the subsystem reliabilities fromTable 2.3are used to calculate the module–task
reliabilities for this mission, which are shown inTable 3.2.
For the example mission scenario described above, once the tasks, the task durations,
and the baseline module reliabilities are established, then the input variables for
the model are
• the number of robots on the team,
• the reliability of the components used, and
• the mission duration (number of panels to be installed).
Table 3.1.Subsystem usage by task (h)
Subsystem Transit Assemble Return
Power 6 8 6Computation & Sensing 6 4 6
Mobility 6 8 6Communications 2 4 2
Manipulator 0 8 0
36
-
By examining how the probability of mission success varies asthese inputs are
changed, we can answer questions such as
• For a given mission duration and component reliability, what is the fewest
number of robots needed to meet a certain probability of mission completion?
and
• If additional robots are added beyond the minimum number, can we use
lower reliability components, and if so, how much lower?
We explore these questions in Sections3.4.1and3.4.2, respectively.
3.4.1 Comparing teams having different numbers of robots
Figure 3.4compares the simulation results for teams with different numbers of
robots, with all robots having the component reliabilitieslisted in the above tables.
We see from this figure that adding even one robot beyond the minimum (two)
increases the probability of mission success dramatically, even for relatively short
missions. However, there is a diminishing improvement as additional robots are
Table 3.2.Module–task reliabilities
Subsystem Transit Assemble Return
Power 99.86% 99.81% 99.86%Computation & Sensing 99.87% 99.92% 99.87%
Mobility 99.97% 99.96% 99.97%Communications 99.98% 99.97% 99.98%
Manipulator 100% 99.94% 100%
37
-
added to the team. We can use this figure to answer the first question above. For
example, for a mission specifying that 30 panels are to be installed with a probability
of mission completion of at least 95%, then the team must include at least four
robots (Figure 3.5).
0
20
40
60
80
100
0 10 20 30 40 50 60
Pro
bab
ilit
y o
f m
issi
on
co
mp
leti
on
(%
)
Mission duration (number of panels)
2 robots 3 robots 4 robots 5 robots
Figure 3.4. Different numbers of robots
80
85
90
95
100
26 27 28 29 30 31 32 33 34
Pro
bab
ilit
y o
f m
issi
on
co
mp
leti
on
(%
)
Mission duration (number of panels)
Design point
2 robots 3 robots 4 robots 5 robots
Figure 3.5. Closeup of area of interest fromFigure 3.4
38
-
3.4.2 Comparing teams with robots having different reliabilities
If additional robots are added beyond the minimum required,it should be possible
to use less-reliable components in those robots and still achieve a required mission
reliability. Figure 3.6shows the simulation results for teams of four robots with
component reliabilities ranging from 10% to 100% of the baseline amounts from
Table 2.3.
When varying the reliability of the components, we apply a constant multiplier
to all of the subsystem MTTF values inTable 2.3. For instance, when we refer
to a team with 10% of the MTTF of the baseline team, we are multiplying all the
values inTable 2.3by 10%.
Figure 3.6shows that for very short missions a team of four robots with only 10%
of the reliability of the baseline team can provide a higher probability of mission
0
20
40
60
80
100
0 20 40 60 80 100 120 140
Pro
bab
ilit
y o
f m
issi
on
co
mp
leti
on
(%
)
Mission duration (number of panels)
2 robots (100) 4 robots (50) 4 robots (25) 4 robots (10)
Figure 3.6. Different component reliabilities
39
-
completion compared to the baseline two-robot team. As the length of the mission
increases, the reliability required for the four-robot team to equal the performance
of the baseline team increases, but the four-robot, 50%-lower-MTTF team still
outperforms the baseline team even for fairly long missions(on the order of a
year).
40
-
3.5 Example – Repairable vs. Nonrepairable Robot Teams
As mentioned earlier, there is one previous paper ([3]) in the literature that looks
at reliability as a design parameter for mobile robot teams.In this section we
compare our method to the one in that paper by analyzing the example mission
given in that paper.
The mission considered in [3] is one where a team of robots are moving dirt. The
dirt-moving task is a continuous task, where the amount of dirt moved is proportional
to the total robot lifetime, where total robot lifetime is the sum of the lifetimes of
all robots on the team.
The robots making up a team are identical and are made of discrete modules.
When an individual module fails, a robot is dead. During its lifetime each robot
moves dirt at a constant rate.
The basic comparison made in [3] is between teams of repairable and nonrepairable
robots. For repairable teams, a robot can be repaired by a teammate using spare
modules. The spare modules are taken from other failed robots – at the beginning
of the mission there are no spares. Two conditions are therefore necessary for
repair to take place: There must be a functional robot to execute the repair, and
there must be spare modules of the correct type available. Notime is elapsed
during a repair, and the repair task does not itself contribute to robot failure.
41
-
Using the method described inSection 3.3, we simulated this mission. Figures
3.7, 3.8, and3.9show, on the left, the results presented in [3] and, on the right, our
results. These figures show that, qualitatively, our results are very similar to those
in the previous paper.
One thing that is not specified in [3], and that makes exact comparison difficult, is
the failure rate,λ. Figure 3.10shows the same results asFigure 3.8for several
0
20000
40000
60000
80000
100000
0 100 200 300 400 500 600 700 800 900 1000
Uni
ts o
f wor
k co
mpl
eted
Number of robots
nonrepairablerepairable
Figure 3.7. Total work completed; two-component robots (left figure from [3])
0
20
40
60
80
100
0 20 40 60 80 100
Per
cent
incr
ease
in w
ork
com
plet
ed(r
epai
rabl
e/no
nrep
aira
ble)
Number of robots
Figure 3.8. Percent improvement of repairable team over nonrepairableteam;two-component robots (left figure from [3])
42
-
values ofλ. While the overall conclusion (that repairable teams are superior)
remains the same, the degree of superiority depends highly on the failure rate. The
effects of varying failure rate are not addressed in [3].
These results show that our method is capable of achieving results similar to the
method in [3]. What is different is that the method used in that paper is an analytical
method, similar to that presented inSection 3.2of this document and with all the
0
200
400
600
800
1000
1200
0 2 4 6 8 10 12 14
Uni
ts o
f wor
k co
mpl
eted
Number of robots
nonrepairablerepairable
Figure 3.9. Total work completed, six-component robots (left figure from [3])
0
20
40
60
80
100
120
10 20 30 40 50 60 70 80 90 100
Per
cent
incr
ease
in w
ork
com
plet
ed(r
epai
rabl
e/no
nrep
aira
ble)
Number of robots
λ = 0.80λ = 0.84λ = 0.88λ = 0.92λ = 0.96λ = 0.99
Figure 3.10.Effect of failure rate on repairable team superiority
43
-
shortcomings of that method. The mission scenarios addressed in [3] are very
simplistic, and that paper fails to address the difficulty ofusing analytical methods
for complex missions. The most complex mission scenario presented in that paper
considers a team with three robots and two nonidentical modules, for which the
solution is given as18l21 + 49l1 · l2 + 18l
22
(l1 + l2)(3l1 + 2l2)(2l1 + 3l2). (3.3)
The amount of time required to develop such analytical solutions, and the significant
likelihood for human error in their derivations, makes these methods undesirable
even for fairly simple missions. They become impractical for missions of any
significant complexity.
44
-
3.6 Summary
In this chapter, we showed how reliability prediction for multirobot teams is often
a different type of problem than for single robots due to the interdependence of
robot reliabilities, making analytical reliability solutions impractical for multirobot
missions that have significant complexity. We introduced a method using stochastic
simulation to estimate mission reliabilities for such missions, and we demonstrated
the use of this method to determine the optimal team size for amultirobot mission.
Finally, we used this method to analyze the relative effectiveness of repairable and
nonrepairable robot teams in revisiting a problem previously introduced into the
literature by [3]. Our results here demonstrate that our method can produce similar
results to the prior work, while also allowing for analysis beyond that shown in the
prior work.
45
-
Chapter 4
DESIGN TRADEOFFS
The methods presented in the previous chapters provide estimates of the probabilities
of task and mission completion. We have shown how these estimates can be used
to compare the performance of different robot teams. However, these reliability
estimates by themselves are not terribly useful for missiondesign. If reliability
existed in a vacuum, then we would simply build the most reliable robots possible
for every mission. In designing a real-world mission it is necessary to consider
other performance metrics and trade them off against reliability. In this chapter we
explore some of the possible tradeoffs that can be made.
4.1 Cost
One of the most important factors in robot mission design is cost. For a given
mission, we would like to be able to determine which team configuration will
meet the mission specifications, including reliability, atthe lowest cost.
The reliability of planetary rovers is related to overall mission cost in two ways.
46
-
First, there is the increased cost associated with buildinghigher-reliability rovers.
Second, there is the increased expected value of the missionwhen using
higher-reliability rovers due to a higher probability of mission success.
4.1.1 Cost of reliability
In choosing components from which to build rovers, a designer would usually
make choices among a small number of alternative components, each providing a
certain reliability for a certain cost. In the early stages of mission design, however,
the mission designer may not yet have information about specific components. In
this case, it is useful to have a parametric model of the cost–reliability relationship.
Reference [24] provides a general model for this relationship, which is given as
c = exp
{
(1 − f) ·(Ri − Rmin)
(Rmax − Ri)
}
, (4.1)
whereRi is a reliability of interest betweenRmin andRmax; f is the feasibility of
reliability improvement (a number between 0 and 1); andc is the ratio of the cost
of Ri to the cost ofRmin.
Figure 4.1shows the relative cost of rovers with differing component reliabilities.
The costs are plotted as a percentage of the baseline rover cost, usingRmin = 0,
Rmax = 1 andf = 0.95.
Launch costs are also significantly affected by rover reliability. More-reliable
rovers will weigh more, due to the generally-larger size of more-reliable components
47
-
and also due to increased component redundancy. We have not found a model for
the reliability–weight relationship in the literature. Asan initial approximation we
assume that the relationship between weight and reliability is directly linear and
that the relationship between launch costs and weight is also directly linear.
4.1.2 Expected mission reward
Any robotic mission must have some inherent value to it. For some missions there
will be an obvious economic or strategic value to which a dollar amount can be
assigned. For a mission that lacks such an obvious dollar value, the cost of the
mission itself can be used as a lower bound for this inherent mission value, since
the sponsors presumably expect some positive return on their investment.
Multiplying the probability of mission success by the inherent value of the mission
0
10
20
30
40
50
60
70
80
90
100
40 50 60 70 80 90 100
Ro
ver
co
st (
% o
f b
asel
ine
team
)
Component reliability (% of baseline)
f = 0.95f = 0.90f = 0.70
Figure 4.1. Relative cost of rovers as function of component reliability
48
-
gives an expected reward for a given team configuration. For example,Figure 4.2
shows the relationship between component reliability and expected mission value
for a six-rover team performing the solar-panel-assembly mission described in
Chapter 3.
4.1.3 Overall cost–reliability relationship
Taking the expected mission value calculated above and subtracting the rover
development and launch costs gives an estimate of the net expected gain for the
mission. We ignore operating costs here since we expect themto be roughly constant
with respect to rover reliability (probably slightly higher for lower-reliability
rovers due to the increased need for human intervention).
In order to combine these costs meaningfully, we assign realdollar values to the
0
20
40
60
80
100
60 65 70 75 80 85 90 95 100
Ex
pec
ted
val
ue
(% o
f m
ax v
alu
e)
Component reliability (% of Table 3 values)
Figure 4.2. Expected value of mission as a function of component reliability
49
-
various costs for the baseline team (Table 4.1). These values are estimated from
the costs of the MER mission, along with the assumption that the rovers for this
mission would be somewhat cheaper and smaller than the MER rovers due to
advances in technology and also because they are single-purpose machines.
These values are used to calculate the net expected gain, which is plotted in
Figure 4.3aalong with its constituent parts. The most significant thingrevealed
by this figure is that there is clearly an optimal reliabilityrange with respect to the
expected gain of the mission and that this optimal reliability is significantly lower
than the reliability of the baseline legacy design.
Figure 4.3ashows that for low-reliability rovers the cost of failure drives the net
expected gain down, while for very-high-reliability rovers the high cost of the
rovers themselves drives the expected gain down. The optimal reliability range
therefore lies in a middle region where neither of these costs is as high.
In order to evaluate the effects of some of our assumptions, we repeated the above
analysis for different values of the feasibility constant (since this value was arbitrary)
and of the mission inherent value (since we used a lower-bound estimate for this
Table 4.1.Baseline team costs and rewards
Item Cost ($ Millions)
Robot cost (entire team) 150Launch cost (entire team) 300Inherent value of mission 450
50
-
value). These results are shown in Figures4.3band4.3c. These figures show that
while the shape of the expected gain curve changes with theseparameters, the
overall trends remain the same: Both figures support the argument that the optimal
range for mission reliability with respect to mission gain is at a lower level than
we would intuitively expect.
51
-
-400
-300
-200
-100
0
100
200
300
400
500
55 60 65 70 75 80 85 90 95 100
$ (M
illio
ns)
Component reliability (% of baseline)
Expected valueRover cost
Launch costExpected gain
(a) f = 0.95, value = $450M
-300
-200
-100
0
100
200
300
400
500
55 60 65 70 75 80 85 90 95 100
$ (M
illio
ns)
Component reliability (% of baseline)
Expected valueRover cost
Launch costExpected gain
(b) f = 0.5, value = $450M
-400
-200
0
200
400
600
800
1000
55 60 65 70 75 80 85 90 95 100
$ (M
illio
ns)
Component reliability (% of baseline)
Expected valueRover cost
Launch costExpected gain
(c) f = 0.95, value = $900M
Figure 4.3. Net expected gain
52
-
4.2 Example – Multirobot Team Size
Using the reliability–cost relationship presented inSection 4.1, we revisit the solar
panel mission fromChapter 2, with the goal of addressing a claim that has been
made in the literature about one benefit of multirobot systems.
4.2.1 Introduction
Applications of multirobot systems can be divided into two categories: those
where multiple robots are necessary for task completion andthose where a single
robot could complete the task but where multiple robots are desirable for reasons
other than task completion. An example application fallinginto the first category
is soccer – a single robot cannot play soccer. An example application in the second
category is area coverage – while in many cases an area can be covered by a single
robot, it may be preferable to use more than one robot in orderto cover the area
more quickly.
When the mission itself does not dictate a particular robot team configuration,
there are multiple requirements that a mission designer must consider. Three
important factors that we consider here are time, cost, and reliability.
Time can be a reason for using more robots than the minimum required because,
for some tasks, having extra robots can reduce the time required to complete the
53
-
task. For instance, in an area coverage task, multiple robots can work in parallel in
order to accomplish the task more quickly.
Cost is an important consideration in team size. There is the cost of additional
robots. There is the cost of robot components–more robust components cost more.
There are operating costs such as transportation and maintenance, which may be
higher for a larger team. Infrastructure costs are likely tobe greater for a larger
team; for instance, a larger team may require more communications bandwidth.
The third performance criterion we consider here is reliability, expressed as the
probability of mission completion (PoMC). A requirement fora mission to have a
certain probability of successful completion can dictate the minimum number of
robots required for the mission. For example, if one robot has a 90% probability
of surviving a task, but the mission requirement is for a 97% probability of having
one robot survive the task, then one way to meet this requirement is by sending
two robots (giving a 99% chance that one would survive).
These criteria (time, cost, reliability) are highly interdependent. As an example,
adding more robots to a mission increases the cost, but it canalso reduce the amount
of time required to complete the mission. Reducing the mission duration means
that the robots don’t need to survive as long, so they can be built of lower-reliability
components, which reduces the cost.
These relationships among team size, component reliability, cost, time, and mission
success have been mentioned in the robotics literature, butonly in passing and
54
-
only in qualitative terms. In particular, researchers often claim that multirobot
systems provide greater reliability than single-robot systems (e.g., [25], [26], [27],
[28]).
Superficially, such a claim seems obviously true – if three robots are sent to do a
task instead of one, there is a greater chance of completing the task. When one
examines the above claim in greater depth, however, finding the answer can be
complicated. In this example, the cost of completing the task has been tripled
by sending three robots. If these same additional funds wereinstead invested
to improve the reliability of a single robot, then which would be more likely to
complete the task – the three robots or the single superior robot? The answer is no
longer obvious.
4.2.2 Analysis
We briefly remind the reader of the mission previously described inChapter 2:
• A team of robots is tasked to transport and assemble solar panels.
• The solar panels are large, so that two robots are required tocarry and assemble
each panel.
The baseline team consists of a pair of highly reliable robots. Using the cost–reliability
relationship inEq. 4.1, we can determine alternative team configurations with
the same overall cost. For example, we find that a team with four robots, each
55
-
made of components with 40% of the MTTF of the baseline components, would
cost about the same, using a feasibility of 0.5.Figure 4.4shows the simulation
results for these two teams. We see here that the team with four lower-reliability
robots has a higher mission reliability than the baseline team for missions shorter
than 85 panels. The larger, lower-reliability team would therefore be the more
cost-effective solution for shorter missions, while the smaller, high-reliability team
would be more cost-effective for longer missions.
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160
Po
MC
(%
)
Mission duration (number of panels)
2R (100%)4R (40%)
Figure 4.4. Comparison of equal-cost teams
56
-
4.3 Operating Conditions
The reliability engineering methods presented inChapter 2lack an explicit accounting
of operating and environmental conditions. Much of reliability engineering was
originally developed for analysis of systems installed in fairly static environments
such as nuclear power plants. Mobile robot components are exposed to dynamic
operating and environmental conditions, particularly in the case of planetary exploration
rovers, which, for example, are subjected to temperature differences of hundreds
of degrees between day and night. The reliabilities of many of the components
in a mobile robot will vary under different operating conditions. It is therefore
necessary to examine how the standard reliability engineering methods can be
adapted to take into account varying operating conditions.
4.3.1 Extrapolation of MTTF to other operating points
The MTTF provided by a device manufacturer represents the hazard rate under a
single set of operating conditions. In order to make reliability predictions over a
range of operating conditions, we need to extrapolate MTTF at different operating
conditions from the single-point MTTF. Models relating howoperating conditions
affect reliability are available for many components. These relationships are used,
for instance, in accelerated-life testing, where devices are subjected to extreme
operating conditions in order to induce failure, and the observed failure rates are
then extrapolated back to normal operating conditions.
57
-
An example of a robot component whose reliability is affected by operating
conditions is a mechanical bearing. Such bearings are oftenfound in robot motors
and joints. The failure rate of mechanical bearings is significantly affected by
operating conditions such as temperature, rotational speed, and load. Here we
show how the single-point MTTF for a mechanical bearing can be extrapolated
over a range of temperature and load conditions.
Reliability of bearings is often expressed by theL10 life, which is the time at which
10% of the population has failed. For a mechanical bearing theL10 value is given
by
L10 =
(
C
P
)d
·
(
106
60n
)
, (4.2)
whereC is the rated bearing load,P is the actual bearing load,d reflects the type
of bearing (d = 3.0 for a ball bearing,d = 3.3 for a roller bearing), andn is the
rotational speed [29].
Holding the speed constant and usingd = 3.0, we find that the life is related to the
applied load asL10
L10,0=
(
P0
P
)3
, (4.3)
where the subscript 0 indicates the manufacturer’s published reliability data.
To relateL10 life and hazard rate, we useEq. 2.1with R = 90%, giving
λ =− ln (0.9)
L10. (4.4)
58
-
CombiningEq. 4.3with Eq. 4.4gives the relationship between hazard rate and
operating load:λ
λ0=
MTTF0
MTTF=
(
P
P0
)3
. (4.5)
Bearing life is also greatly affected by temperature since the lubricant in the bearing
breaks down faster at higher temperatures. The approximaterelationship used for
the effect of temperature on bearing failure is that every10◦C rise in temperature
doubles the failure rate [30], or
λ
λ0=
MTTF0
MTTF= 2(
T−T010
). (4.6)
We can combine multiple environmental factors, assuming that they are independent.
In this case we can determine the effect of combined load and temperature changes
on the MTTF of a bearing, which is
λ
λ0=
MTTF0
MTTF=
(
P
P0
)3
· 2(T−T0
10). (4.7)
Eq. 4.7is plotted inFigure 4.5. This figure shows that MTTF varies greatly even
over a fairly small range of temperatures and loads. This illustrates why the
single-point MTTF provided by manufacturers is inadequateto describe the reliability
of devices operating under significantly different conditions from those under
which the MTTF was established.
59
-
4.3.2 Operating envelope
Figure 4.6shows some of the lines of constant MTTF resulting fromEq. 4.7.
These lines illustrate how operating conditions can be traded off against one another
as well as against reliability. For instance, if a robot is tobe operated in a
high-temperature environment, it may be desirable to operate the robot motors
at lower speeds in order to compensate for the increased ambient temperature.
On the other hand, if the speed of the robot was a critical mission requirement,
then we could continue to operate the robot at full speed, butwith a quantitative
understanding of the tradeoff being made with respect to reliability.
Such tradeoffs could be automated in a sophisticated rover that would monitor
ambient conditions and modify its mission profile in order tomaintain a target
mission reliability, in much the same way that human workerswill slow down
when working under adverse environmental conditions.
Figure 4.5. Effect of operating conditions on bearing MTTF
60
-
Figure 4.6. Lines of constant MTTF
61
-
4.4 Summary
In this chapter, we showed how reliability can be traded off against other mission
design parameters. We first presented a cost–reliability relationship from the literature
and used this to examine how the various costs of a planetary mission contribute
to the overall expected value of the mission. Our results suggest that building
planetary rovers to the highest levels of reliability may not be cost-effective. We
also made use of the cost–reliability relationship to provide a quantitative evaluation
of the claim that teams with more lower-reliability robots are more reliable than
teams with fewer higher-reliability robots. Our results inthis case show that this
claim is not universally true but must be evaluated in the context of specific mission
parameters. Finally, we looked at how operating conditionsaffect reliability –
specifically looking at how temperature and operating load affect the expected
life of a mechanical bearing.
62
-
Chapter 5
MISSION PLANNING
The previous chapters demonstrate how reliability can be used in the design of
robots and multirobot teams. In this chapter and the next we consider the role
of robot reliability in the process of mission planning for mobile robot teams.
Specifically, we examine here how knowledge of robot reliabilities can be used
to improve task allocation in the context of the multirobot exploration problem.
We take a simple exhaustive planner and compare the plan it chooses against the
optimal plan that takes into account robot failures and the backup plans that occur
after failure. Our results show that for this problem domain, making an initial plan
without regard to individual robot reliabilities results in choosing a suboptimal
plan most of the time and that the difference in mission performance between the
chosen plan and the optimal plan is usually substantial.
63
-
5.1 Background
For multirobot missions, it is necessary to allocate tasks among the team members
as part of the mission planning process. The specific task allocation chosen affects
the probabilities of failure for the robots, and thus the mission, since failure is a
function of usage.
In reviewing the robot mission planning literature, we find that there has been
substantial work in the area of detecting and recovering from robot failures (e.g.,
[14, 22, 31, 32]) and that several multirobot mission planning systems provide
mechanisms for reallocation of tasks among surviving team members after a robot
failure (e.g., [23, 33, 34]). However, all of these methods are reactive rather than
predictive, dealing with failure only after it occurs. Reference [23], for example,
describes a mission planning system that is able to recover from robot failure
because tasks can be reallocated. In this system, tasks are auctioned off to the
robot with the highest bid (or lowest, depending on the utility metric). This system
allows for task reallocation during the mission when new information changes the
valuation of tasks. For instance, if a robot suffers a component failure that impairs
its ability to perform its assigned tasks, it will change itsvaluation for those tasks,
and it can then subcontract tasks to another robot that has a better valuation for
those tasks.
While it is important to recover from robot failures, it wouldbe better to minimize
the likelihood of such failures in the first place. One way to do so is to design
64
-
the robots to an appropriate level of reliability for the mission requirements, as
described in the previous chapters of this document. Another way is to operate the
robots in a way that minimizes the likelihood of failure. As discussed in
Chapter 4, operating conditions are one execution-time consideration in the probability
of robot failure.
Another operational consideration in robot failure is the assignment of tasks to
robots and the ordering of those tasks. The mission planningsystem influences
the likelihoods of robot failures because the initial assignment of tasks to robots
plays a role in determining the probabilities of robot failures during the mission.
For example, assigning a robot with a weak or damaged drive motor to a task that
requires it to travel a long distance results in a higher probability of that robot
failing than if the robot were assigned to a task that required less travel. This example
is intuitive because it assumes heterogeneous robots, but in this chapter we demonstrate
that even when team members are homogeneous, the assignmentof tasks to robots
has a significant influence on robot failure. We are not aware of any existing work
that addresses the use of robot reliability information to improve multirobot task
allocation in this way.
One way to incorporate such reliability concerns into multirobot mission planning
would be to introduce a reliability component into the utility metric used by the
planner. Such an approach is unsatisfying for two reasons. The first is the
incommensurability of different components of a utility measure – how do we
combine dollars spent, meters traveled, and probability offailure into a single
65
-
metric? The second is that establishing a numeric reliability requirement is itself
a difficult problem that has been minimally explored for the mobile robot domain
– i.e., how do we decide if the reliability requirement for a mission should be 95%
rather than 96%?
In order to avoid these difficulties, we take a different approach: Rather than devising
new utility metrics that explicitly incorporate reliability, we instead look at how
robot reliability affects the utility metrics already being used. In plain language,
what we arenot doing is taking “Find the solution with the shortest time” and
turning it into “Find the solution with the shortest time that also meets reliability
levelX” but instead turning it into “Find the solution with the shortestexpected
time” where the expected time takes into account the alternative outcomes that
occur when robots fail.
66
-
5.2 Illustrating Example
Consider a simple multirobot exploration mission with two identical robots and
two locations to be visited (Figure 5.1). The goal of the mission is for all target
locations to be visited in any order by any robot in the shortest total mission time.1
Time is assumed here to be proportional to distance traversed.
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14 16 18
R1
R2
T1
T2
ROBOTSTARGETS
Figure 5.1. Exploration mission
Each robot is defined by an(x, y) location and a reliabilityPt, which is the probability
of surviving a one-unit traverse. Each target is defined by its (x, y) location. The
robot and target parameters used for this example are listedin Table 5.1and illustrated
by Figure 5.1.
For a small number of robots and targets, it is feasible to exhaustively enumerate
the possible task assignments and then calculate the distance that each robot must
traverse to accomplish each plan (Table 5.2). The plan duration (dplan) is equal
1In other terms, to minimize the makespan.
67
-
Table 5.1.Robot and target parameters
x y Pt
Robot 1 4 12 0.99Robot 2 14 3 0.99Target 1 1 1 —Target 2 3 5 —
to the greatest distance that any robot travels during that plan. The plan with the
smallest duration is then chosen. In this example, Plan B (Figure 5.2a) would be
chosen.
Now consider what happens when a robot fails while executingthis plan. If Robot 1
fails, then Robot 2 is assigned to visit Target 1 after reaching Target 2
(Figure 5.2b). If Robot 2 fails, then Robot 1 is assigned to Target 2 after reaching
Target 1 (Figure 5.2c). We assume here that tasks are not interrupted, so new
targets are assigned to surviving robots only after they complete their current
tasks.
Table 5.2.Plan durations(red/italic text indicates best plan)
Plan d(R1) d(R2) dplan
A (R1T1 + R1T2) 15.9 0 15.9B (R1T1 + R2T2) 11.4 11.2 11.4C (R2T1 + R1T2) 7.62 13.2 13.2D (R2T1 + R2T2) 0 17.6 17.6E (R1T2 + R1T1) 11.5 0 11.5F (R2T2 + R2T1) 0 15.7 15.7
68
-
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14 16 18
R1
R2
T1
T2
ROBOTSTARGETS
(a) Chosen plan (Plan B)
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14 16 18
R1
R2
T1
T2
ROBOTSTARGETS
(b) Backup for Plan B when Robot 1 fails (dashed linerepresents robot failure)
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14 16 18
R1
R2
T1
T2
ROBOTSTARGETS