planning to fail - incorporating reliability into design ... · 1.2 overview this thesis first...

121
Planning to Fail: Incorporating Reliability into Design and Mission Planning for Mobile Robots Stephen B. Stancliff CMU-RI-TR-09-38 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Robotics. The Robotics Institute Carnegie Mellon University Pittsburgh, Pennsylvania 15213 September, 2009 Thesis Committee: John Dolan, Chair Brett Browning Michael Nechyba Ashitey Trebi-Ollennu, California Institute of Technology, JPL Copyright c 2009 by Stephen B. Stancliff. All rights reserved.

Upload: others

Post on 21-Sep-2020

6 views

Category:

Documents


1 download

TRANSCRIPT

  • Planning to Fail:Incorporating Reliability into Design

    and Mission Planning for Mobile Robots

    Stephen B. Stancliff

    CMU-RI-TR-09-38

    Submitted in partial fulfillment of therequirements for the degree of

    Doctor of Philosophy in Robotics.

    The Robotics InstituteCarnegie Mellon University

    Pittsburgh, Pennsylvania 15213

    September, 2009

    Thesis Committee:John Dolan, ChairBrett Browning

    Michael NechybaAshitey Trebi-Ollennu, California Institute of Technology, JPL

    Copyright c©2009 by Stephen B. Stancliff. All rights reserved.

  • i

  • ABSTRACT

    Current mobile robots generally fall into one of two categories as far as reliability

    is concerned – highly unreliable, or very expensive. Most fall into the first category,

    requiring teams of graduate students or staff engineers to coddle them in the days

    and hours before a brief demonstration. The few robots that exhibit very high

    reliability, such as those used by NASA for planetary exploration, are very expensive.

    In order for mobile robots to become more widely used in real-world environments,

    they will need to have reliability in between these two extremes. In many applications

    some amount of unreliability is acceptable if it results in reduced costs. Even in

    applications where a failure probability very near zero is desired (such as planetary

    exploration), the ability to design robots to a specific reliability goal should allow

    us to reduce the costs of these highly reliable robots by designing them to be “just

    reliable enough” to complete the mission, rather than designing them to be “as

    reliable as possible.”

    In order to design mobile robots with respect to reliability, we need quantitative

    models for predicting robot reliability and for relating reliability to other design

    parameters such as cost. To date, however, there has been very little formal

    discussion of reliability in the mobile robotics literature, and no general method

    has been presented for quantitatively predicting the reliability of mobile robots.

    ii

  • This thesis focuses on this problem of predicting reliability for mobile robots

    and in particular for teams of mobile robots, and proposes solutions for using

    reliability as a design input for several mobile robot design problems:

    • Given a choice of components from which to assemble a robot, how do we

    select the ones that will optimize the tradeoff of reliability against other

    factors such as cost?

    • Given a choice of robots from which to assemble a multirobot team, how do

    we select the ones which will optimize the reliability tradeoffs for the entire

    robot team?

    • Given a multirobot team and a list of mission tasks, how do we assign tasks

    to team members in order to maximize the probability of completing the

    mission?

    iii

  • Table of Contents

    List of Tables vii

    List of Figures viii

    Chapter 1. INTRODUCTION 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Chapter 2. SINGLE-ROBOT RELIABILITY 92.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Reliability Background . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2.1 Types of robot failures. . . . . . . . . . . . . . . . . . . . . 12

    2.2.2 Reliability model. . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.3 Consequences of constant hazard rate. . . . . . . . . . . . . 17

    2.3 Robots and Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.3.1 Robot decomposition. . . . . . . . . . . . . . . . . . . . . . 20

    2.3.2 Module–task and robot–task reliability. . . . . . . . . . . . 22

    2.3.3 Single-robot example. . . . . . . . . . . . . . . . . . . . . 23

    2.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    Chapter 3. MULTIROBOT RELIABILITY 283.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.2 Analytical Solutions for Simple Multirobot Missions. . . . . . . . . 30

    3.3 Stochastic Simulation for Complex Multirobot Missions. . . . . . . 31

    3.4 Example Results for a Complex Multirobot Mission. . . . . . . . . 35

    3.4.1 Comparing teams having different numbers of robots. . . . . 37

    iv

  • 3.4.2 Comparing teams with robots having different reliabilities . . 393.5 Example – Repairable vs. Nonrepairable Robot Teams. . . . . . . . 413.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    Chapter 4. DESIGN TRADEOFFS 464.1 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    4.1.1 Cost of reliability. . . . . . . . . . . . . . . . . . . . . . . . 474.1.2 Expected mission reward. . . . . . . . . . . . . . . . . . . . 484.1.3 Overall cost–reliability relationship. . . . . . . . . . . . . . 49

    4.2 Example – Multirobot Team Size. . . . . . . . . . . . . . . . . . . 534.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.3 Operating Conditions. . . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Extrapolation of MTTF to other operating points. . . . . . . 574.3.2 Operating envelope. . . . . . . . . . . . . . . . . . . . . . . 60

    4.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    Chapter 5. MISSION PLANNING 635.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Illustrating Example. . . . . . . . . . . . . . . . . . . . . . . . . . 675.3 Simulation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    5.3.1 Minimax utility function . . . . . . . . . . . . . . . . . . . . 745.3.2 Differences in plan durations. . . . . . . . . . . . . . . . . 755.3.3 Overall planner performance metric. . . . . . . . . . . . . . 775.3.4 Minisum utility function . . . . . . . . . . . . . . . . . . . . 78

    5.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    Chapter 6. INCOMPLETE MISSION PLANNERS 826.1 Greedy Planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    6.1.1 Incorporating reliability . . . . . . . . . . . . . . . . . . . . 846.1.2 Incorporating reliability – revised method. . . . . . . . . . . 85

    6.2 Less-Greedy Planner. . . . . . . . . . . . . . . . . . . . . . . . . . 906.3 Compromise Planner. . . . . . . . . . . . . . . . . . . . . . . . . 926.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    v

  • Chapter 7. CONCLUSIONS 957.1 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    7.3 Future Research. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    Appendix A. Subsystem Reliability Data 102

    Appendix B. Expected Value Calculation 105

    Bibliography 108

    vi

  • List of Tables

    2.1 Module usage during sampling task. . . . . . . . . . . . . . . . . 24

    2.2 Components comprising power subsystem. . . . . . . . . . . . . . 24

    2.3 Robot subsystem reliabilities. . . . . . . . . . . . . . . . . . . . . 25

    2.4 Module reliabilities during sampling task. . . . . . . . . . . . . . 25

    3.1 Subsystem usage by task. . . . . . . . . . . . . . . . . . . . . . . 36

    3.2 Module–task reliabilities. . . . . . . . . . . . . . . . . . . . . . . 37

    4.1 Baseline team costs and rewards. . . . . . . . . . . . . . . . . . . 50

    5.1 Robot and target parameters. . . . . . . . . . . . . . . . . . . . . 68

    5.2 Plan durations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    5.3 Naive and expected durations. . . . . . . . . . . . . . . . . . . . . 72

    A.1 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    A.2 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    A.3 Computation & Sensing. . . . . . . . . . . . . . . . . . . . . . . . 103

    A.4 Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    A.5 Manipulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    B.1 Robot and target parameters. . . . . . . . . . . . . . . . . . . . . 107

    B.2 Plan durations and probabilities. . . . . . . . . . . . . . . . . . . . 107

    B.3 Plan durations – expected (minimax). . . . . . . . . . . . . . . . . 107

    B.4 Plan durations – minimax (expected). . . . . . . . . . . . . . . . . 107

    vii

  • List of Figures

    2.1 The bathtub curve. . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.2 Failure rates making up bathtub curve. . . . . . . . . . . . . . . . 16

    2.3 Modular robot concept. . . . . . . . . . . . . . . . . . . . . . . . 20

    2.4 NASAHierarchical System Terminology. . . . . . . . . . . . . . . 21

    3.1 Possible paths for simple mission. . . . . . . . . . . . . . . . . . . 30

    3.2 Same mission as Figure 3.1 but with one repair allowed. . . . . . . 32

    3.3 State–transition diagram for complex mission. . . . . . . . . . . . 33

    3.4 Different numbers of robots. . . . . . . . . . . . . . . . . . . . . . 38

    3.5 Closeup of area of interest from Figure 3.4. . . . . . . . . . . . . . 38

    3.6 Different component reliabilities. . . . . . . . . . . . . . . . . . . 39

    3.7 Total work completed; two-component robots. . . . . . . . . . . . 42

    3.8 Improvement of repairable team over nonrepairable team. . . . . . 42

    3.9 Total work completed; six-component robots. . . . . . . . . . . . 43

    3.10 Effect of failure rate on repairable team superiority. . . . . . . . . 43

    4.1 Relative cost of rovers as function of component reliability . . . . . 48

    4.2 Expected value of mission as a function of component reliability . . 49

    4.3 Net expected gain. . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.4 Comparison of equal-cost teams. . . . . . . . . . . . . . . . . . . 56

    4.5 Effect of operating conditions on bearing MTTF. . . . . . . . . . . 60

    4.6 Lines of constant MTTF . . . . . . . . . . . . . . . . . . . . . . . 61

    5.1 Exploration mission. . . . . . . . . . . . . . . . . . . . . . . . . . 67

    5.2 Chosen plan and backups. . . . . . . . . . . . . . . . . . . . . . . 69

    5.3 Plan with shortest expected duration. . . . . . . . . . . . . . . . . 73

    5.4 Suboptimal allocations. . . . . . . . . . . . . . . . . . . . . . . . 75

    5.5 Suboptimal allocations. . . . . . . . . . . . . . . . . . . . . . . . 76

    viii

  • 5.6 Average increase in mission duration as a function of target count . 77

    5.7 Expected increase in duration as a function of target count . . . . . 78

    5.8 Expected increase in mission duration. . . . . . . . . . . . . . . . 79

    5.9 Expected increase in power consumption. . . . . . . . . . . . . . . 80

    6.1 Planner comparison (first approach). . . . . . . . . . . . . . . . . 85

    6.2 Planner comparison (second approach). . . . . . . . . . . . . . . . 87

    6.3 How often reliability-enhanced planner chooses a better plan . . . . 88

    6.4 Expected increase in mission duration (greedy planner). . . . . . . 89

    6.5 Expected increase in mission duration (less-greedy planner). . . . . 91

    6.6 Comparison of greedy and less-greedy planners. . . . . . . . . . . 91

    6.7 Comparison of less-greedy and compromise planners. . . . . . . . 93

    6.8 How the number of plans tested affects performance. . . . . . . . . 93

    ix

  • Chapter 1

    INTRODUCTION

    1.1 Motivation

    Many of the most promising applications for mobile robots are those that reduce

    or eliminate the need for humans to perform tasks in dangerous environments.

    Examples include space exploration, mining, and toxic waste cleanup. For mobile

    robots to succeed in keeping humans from these dangers, these robots must be

    highly reliable so that people do not have to enter the dangerous area to repair or

    replace failed robots.

    Unfortunately, most current mobile robots have poor reliability, requiring frequent

    maintenance and repair. Historical failure data for small field robots reveal that

    they are either broken or under repair approximately half ofthe time [1].

    Notable exceptions to this observation are the planetary rovers built and operated

    for NASA by the Jet Propulsion Laboratory (JPL). The currentMars Exploration

    Rovers (MER), for instance, have now been in operation on Mars for more than

    1

  • five years. There are few, if any, other mobile robots that have operated for as long

    as a year without repair.

    The reliability of NASA rovers is achieved through the use ofhighly robust

    components as well as component redundancy, both of which lead to the robots

    being very expensive. The cost of the first MER rover was approximately $150M

    [2]. Other than space exploration and perhaps a few military applications, it is

    hard to imagine many applications for which a robot price tagin the hundreds

    of millions of dollars will be acceptable. Therefore, the current NASA design

    paradigm of making robots as reliable as possible is not broadly applicable.

    Even in the realm of planetary exploration, the current design paradigm may not

    be able to provide the reliability required for future missions. In the near future,

    NASA intends to send rovers to Mars for missions lasting an order of magnitude

    longer than the original MER mission. Using the current design paradigm,

    increasing the mission duration by an order of magnitude requires that the rover

    be built using components with failure rates an order of magnitude lower. Since

    NASA rovers already make use of some of the most reliable components available,

    it is doubtful whether components with an order-of-magnitude greater reliability

    are available, let alone affordable.

    Both of these situations – the unreliability of mobile robotsin general and the high

    cost of reliable NASA rovers – reveal the need for principledconsideration of

    reliability as a design parameter for robots and robot missions. To date, however,

    2

  • there has been very little formal discussion of reliabilityin the robotics literature,

    and no general methods have been presented for using quantitative reliability as an

    input for robot design or mission planning.

    3

  • 1.2 Overview

    This thesis first addresses the question of how to apply existing quantitative reliability

    estimation methods to mobile robots. Second, we present methods for using reliability

    as an input parameter in the design of robots and multirobot teams. Finally, we

    consider how knowledge of robot reliability can be used to improve the performance

    of multirobot planners.

    The methods developed in this thesis have their roots in the field of reliability

    engineering. We have developed a formal representation formobile robots that

    allows us to apply common reliability engineering models ina systematic way to

    determine the probability that a robotic mission will be successfully completed.

    As is often the case when applying outside fields of knowledgeto mobile robots,

    we have discovered areas where the existing methods fall short and must be modified

    to deal with the complexities of mobile robot systems. In particular, traditional

    methods of combining component reliabilities into system reliability assume that

    the components are independent in terms of reliability. This assumption fails for

    many multirobot missions. We present a framework for dealing with the complexity

    that these dependencies add to the problem.

    We then apply these methods to several single-robot and multirobot design problems,

    examining the tradeoffs between cost and reliability, between repairable and

    4

  • nonrepairable robots, and between teams of many low-reliability robots versus

    teams of fewer high-reliability robots.

    Finally, we examine the role of reliability in multirobot task allocation. Specifically,

    we evaluate the hypothesis that ignoring robot reliabilityinformation when generating

    initial task allocations leads to suboptimal performance.Our results show that this

    is indeed the case and that the difference in performance is substantial.

    5

  • 1.3 Contributions

    The contributions of this dissertation to the robotics community are the following:

    • Introduction of models for reliability prediction from thereliability

    engineering literature into the mobile robotics literature.

    • A general theoretical framework for applying these reliability engineering

    methods to robots and multirobot teams.

    • The first quantitative analysis of cost–reliability tradeoffs for planetary rover

    missions.

    • The first quantitative analysis of the tradeoff between robot reliability and

    team size.

    • The first analysis of the benefit of using reliability knowledgea priori in

    multirobot mission planning, providing strong evidence that planners which

    do not use this information choose suboptimal plans.

    • A model for how reliability knowledge can be used to improve task allocation

    for incomplete multirobot planners.

    Overall, these contributions allow us to begin consideringrobot reliability as an

    input parameter for robot design and operation, rather thanas an uncontrolled

    output resulting from decisions made without regard to reliability.

    6

  • 1.4 Outline

    Chapter 2 of this document introduces the relevant terminology and models we

    borrow from the reliability engineering literature and describes the framework we

    have developed for applying reliability engineering to thedesign of mobile robots.

    This chapter presents an example problem in which we calculate the probability

    that a planetary exploration rover will successfully complete a sampling mission.

    Chapter 3 considers the design of multirobot teams. We first present a

    straightforward method for finding analytical solutions tomultirobot missions.

    Because such methods are impractical for missions of significant complexity, we

    then introduce a method that uses stochastic simulation forevaluation of more

    complex missions. In this chapter we evaluate a multirobot mission in which

    planetary rovers must work cooperatively to install a solarpanel array. We also

    analyze a problem introduced in [3] that compares the performance of a team of

    repairable robots with that of a nonrepairable team.

    In Chapter 4 we demonstrate how reliability can be integratedwith other design

    parameters in order to optimize robot design across multiple design constraints.

    The bulk of this chapter examines the relationship between reliability and cost

    in the context of a planetary exploration mission. We also examine how operating

    conditions affect reliability and how single-point reliability data can be extrapolated

    to off-design operating conditions.

    7

  • Chapters 5 and 6 examine the role of quantitative reliabilityin multirobot mission

    planning. Specifically, in Chapter 5 we test the hypothesis that it is necessary to

    consider robot reliability when generating initial task allocations, rather than, as

    is currently practiced, dealing with reliability only after the fact, by reallocation of

    tasks after robot failure occurs. In Chapter 6 we extend theseresults by

    demonstrating that reliability information can be used to improve plan selection

    for heuristic planners.

    Finally, in Chapter 7 we summarize the contributions of this thesis and discuss

    future directions for this research.

    8

  • Chapter 2

    SINGLE-ROBOT RELIABILITY

    This section provides an overview of methods and models fromthe reliability

    engineering literature, introduces the representation weuse for modeling mobile

    robots using these methods, and shows how these methods can be applied in order

    to predict the probability that a single robot will completea given task.

    2.1 Related Work

    The reliability engineering literature (e.g., [4, 5]) provides methods for predicting

    the reliability of simple electrical and mechanical devices and also for combining

    these reliabilities to predict the reliability of complex systems. These methods can

    be applied in a straightforward fashion to make predictionsabout the reliability of

    simple robots executing simple missions. For many robotic applications, however,

    there are violations of the assumptions upon which the basicreliability engineering

    methods are based. We address these shortcomings in Chapters3 and4.

    In the mobile robotics literature there is little formal discussion of reliability and

    9

  • failure. When reliability is mentioned, it is usually qualitatively, and in passing.

    Reference [6], for example, mentions intermittent hardware failures asan explanation

    for gaps in experimental data but makes no attempt at characterizing the failures.

    A handful of prior papers ([1, 7, 8, 9]) make use of reliability engineering for

    analysis of mobile robot failure rates. Reference [1] provides an overview of robot

    failure rates at the system level (i.e., robot modelX failedY times inZ hours of

    operation) and also breaks down failures according to the subsystem that failed

    (actuators, control system, power, or communications). Reference [7] extends the

    work in [1] both by the inclusion of additional failure data of the sametype and

    also by addition of new categories of failure – those due to human error. Reference

    [9] provides a detailed analysis of failures experienced by some of the robots used

    in searching the World Trade Center wreckage in 2001. Reference [10] provides

    failure data for robots used in long-term experiments as museum guides. While

    these papers help us to begin to identify the causes of mobilerobot failure, they do

    not provide methods for predicting failures.

    In contrast to the mobile robot literature, there is considerable work in the area of

    reliability of robotic manipulators. Examples include [11] and [12]. This work in

    manipulator reliability has the same shortcomings with respect to mobile robots

    as the basic reliability methods, in that manipulators are generally simpler devices

    than mobile robots and are used in fairly static environments. There is some relevant

    work in the manipulator literature describing how environmental conditions affect

    10

  • reliability (e.g., [13]), although here the environmental factor involved is a constant

    rather than varying with time and task, as is often the case with mobile robots.

    There is also a significant body of mobile robot research thatdeals tangentially

    with reliability by describing methods for detecting and recovering from failures.

    An example is [14], in which fault detection is used to discard faulty sensor readings

    among a group of redundant sensors. Our work differs from these in that we are

    developing methods to predict the probability of failure occurring rather than

    to respond to failure after it occurs. Our methods are complementary to these

    since ana priori understanding of the relative probabilities of different failures

    is helpful for failure diagnosis.

    11

  • 2.2 Reliability Background

    Reliability is “the ability of a system or component to perform its required functions

    under stated conditions for a specified period of time” [15, p. 170]. In other words,

    reliability is the probability that no failures will occur before a given time. When

    evaluating the reliability of a system, we must first identify the ways in which the

    system may fail and then determine the probabilities of those failures occurring.

    2.2.1 Types of robot failures

    Mobile robots are complex systems, and as a result there are many factors that

    can cause the failure of a robotic mission. The laboratory robots with which most

    researchers are familiar usually fail due to errors in design, manufacturing, or

    usage. The hardware breaks down due to being poorly designedor constructed;

    the software has bugs that are revealed only under the stressof a demonstration;

    and both hardware and software fail because the robots are used in situations

    beyond the intentions of their designers.

    While these types of failures are significant and in fact are the dominating failure

    modes for most mobile robots today ([1],[7],[8]), we contend that these failure

    modes are not in need of modeling so much as they are in need of correction.

    These failures are the result of errors that can be reduced, if not eliminated, through

    process control. Methods for reducing errors in design, manufacturing, software

    development, and operation are widely used in industry (e.g., ISO 9001 Quality

    12

  • Management). As mobile robots become more common and are produced in a

    manufacturing rather than a research environment, these engineering methods will

    be applied, yielding a reduction in failures due to errors.

    We can see that this is possible because some of today’s mobile robots are already

    built with a high degree of quality control in design, construction, and operation.

    For instance, the planetary rovers built for NASA by JPL are built to very high

    standards of quality and controlled by highly trained operators, resulting in a very

    low incidence of failures due to errors. This is largely because much greater care

    is given to their design, construction, and operation in comparison with most other

    current mobile robots.

    Once failures due to errors are largely eliminated, as with the NASA rovers, the

    remaining failures are due mostly to inherent properties ofthe materials from

    which the robot is constructed. An example of such a failure is the degradation

    of the lubricant in a bearing and the subsequent failure of the bearing. There is no

    process control that will change the physical reality that lubricants break down

    and unlubricated bearings fail. Instead, the robot must be designed taking into

    account the possibility of bearing failure so as to guarantee that there is only a

    small chance of failure during the mission.

    The need to address such failures is suggested by the long-term robot museum

    guide experiments described in [10]. The robots described in that paper possessed

    self-diagnostic and self-resetting capabilities that allowed them to overcome many

    13

  • design and implementation errors. The “remaining failureswere eventually stochastic

    and unpredictable, a tire failing here, and a light bulb failing there” [10, p. 4].

    It is this latter type of failure with which we are primarily concerned. The reliability

    engineering literature provides well-established modelsfor this type of failure.

    In the rest of this chapter we demonstrate how these models can be used for the

    prediction of mobile robot failures and for choosing an optimal set of robot

    components with respect to reliability requirements.

    It is possible that some of the other types of failure mentioned above can also be

    incorporated into these predictions. For instance, modelsfor predicting software

    errors have been proposed in the literature (e.g., [16],[17]). Incorporation of such

    models would allow us to provide a more complete picture of mobile robot failure.

    However, these models have been in existence for a much shorter time than hardware

    reliability models and have been applied in very few cases, so their ability to predict

    software failures is unproven. In addition, our goal is to produce tools that can be

    used in the early stages of mission design. Most of the available software prediction

    models require input data that are not available in those early stages. We therefore

    confine ourselves in this work to the category of hardware failures described above.

    2.2.2 Reliability model

    Reliability models are descriptions of how the instantaneous failure rate (orhazard

    rate) for a device changes over time. For many electronic and mechanical devices,

    14

  • when the hazard rate is plotted as a function of time, the resulting curve resembles

    Figure 2.1[4, p. 109]. This characteristic shape is referred to as thebathtub curve.

    The bathtub curve arises from the superposition of three distinct failure patterns.

    The first is an exponentially decreasing failure rate which is high at the beginning

    of the product life (Figure 2.2a). This corresponds to the period during which

    items fail largely due to defects in materials or construction. There are many early

    failures, but as defective items drop out of the population,the remaining population

    has a lower hazard rate. This is referred to as theburn-inor infant mortalityperiod.

    The second pattern (Figure 2.2b) is an exponentially increasing failure rate which

    becomes high when components have reached the ends of their useful lives and

    begin to fail due to deterioration. This is referred to as thewearoutphase.

    The third failure pattern (Figure 2.2c) is a constant failure rate due to random

    Figure 2.1. The bathtub curve

    15

  • (a) Infant mortality

    (b) Wearout

    (c) Random failures

    Figure 2.2. Failure rates making up bathtub curve

    16

  • failures. In the middle section of the bathtub curve this failure pattern dominates.

    This period is referred to as theservice lifeor useful life.

    In applying the bathtub model to robots, we assume that therewill be a period of

    initial testing which allows burn-in failures to be dealt with before components are

    placed into service. This is standard procedure for manufacturing of products with

    small production runs or for products that use cutting-edgetechnology [18].

    At the other end of the bathtub curve, we assume that the service life of components

    will be specified by their manufacturers and observed in robot design and mission

    planning so that robot modules will not wear out before the completion of the

    mission for which they are being designed.

    Given these two assumptions, the hazard rate of a robot component needs to be

    known only during the service life phase. This hazard rate ismodeled as a constant,

    which is represented in the literature byλ. It is also important to know when the

    end of the service life is reached. The reliability of a module can therefore be

    modeled with just two parameters – the (constant) hazard rate and the service life

    length.

    2.2.3 Consequences of constant hazard rate

    The reliability of a device with a constant hazard rate is

    R(t) = e−λt. (2.1)

    17

  • Thus, the reliability of a device with a constant hazard rateis equal to one at the

    beginning of the service life and decays exponentially towards zero.

    Manufacturers usually specify the reliability of a device in terms ofmean time to

    failure (MTTF). During the service life, the hazard rate and MTTF arerelated as

    MTTF =1

    λ. (2.2)

    The relationships inEq. 2.1and2.2allow us to calculate the probability of failure

    of a component from the manufacturer’s published MTTF. It isimportant to remember

    that this MTTF applies only during the constant-hazard-rate portion of the bathtub

    curve. It is a common mistake to assume that MTTF, since it hasunits of time,

    measures how long an item will last. Most components will fail due to wearout

    long before the time corresponding to MTTF is reached. Reference [19] has this to

    say about the confusion:

    Note that there is no direct connection or correlation betweenservice life and failure rate. It is possible to design a veryreliableproduct with a short life. A typical example is a missile for example:it has to be very, very reliable ([MTTF] of several million hours), butits service life is only 0.06 hours (4 minutes)! 25 year old humanshave an [MTTF] of about 800 years (about 0.1%/year) but not manyhave a comparable service life. Just because something has agood[MTTF], it does not necessarily have a long service life as well. [ 19,p. 5]

    One of the reasons that the constant hazard rate model is commonly used is because

    many reliability calculations are much simpler under this model than other models.

    18

  • This model is closed under the operations of combining devices in serial and

    parallel, while most other reliability models are not [20, p. 47]. Another useful

    property is the “lack of memory” of the exponential function; i.e., the probability

    that a device will fail in the next hour of operation is the same at any point within

    the constant-failure-rate portion of the bathtub curve [20, p. 43].

    Some devices used in mobile robots do not follow the constant-failure-rate model.

    Devices that fail due to mechanical wearout, such as bearings, are better fitted by

    more complex reliability models. However, the reliabilityof these devices can be

    approximated piecewise by regions of constant failure rate. This allows for the

    simpler calculations of the exponential model to be used within each segment of

    the approximation [20, p. 44].

    19

  • 2.3 Robots and Tasks

    2.3.1 Robot decomposition

    In order to allow for a systematic evaluation of mobile robotreliability, we have

    developed a formal method for representing robots and theirsubsystems. For

    our analyses we consider robots to be made of multiple modules, as inFigure

    2.3. We usemodulehere to refer to a specific instantiation of a robot subsystem.

    A subsystem is a functional division of the robot that can be conceived as being

    engineered, assembled and tested independently of other subsystems (Figure 2.4).

    The methods presented here are not dependent on this particular definition of

    module or subsystem, but this definition makes it possible toconsider modules

    as interchangeable building blocks for robots, allowing usto use reliability and

    other criteria to choose the best set of modules for a given mission.

    Figure 2.3. Modular robot concept

    20

  • Figure 2.4. NASA Hierarchical System Terminology[21]

    Combining module reliabilities to obtain the reliability ofan entire robot is

    straightforward when the constant-hazard-rate model is used. Modules are considered

    to be either in series or parallel. In a series combination, all modules must be

    functioning for the system to function. In a parallel combination, only one module

    must be functioning for the system to function.

    For a series combination the overall reliability is the product of the component

    reliabilities, i.e.,

    Rs =N∏

    i=1

    Ri, (2.3)

    and the overall hazard rate is the sum of the hazard rates for the modules, i.e.,

    λs =N

    i=1

    λi. (2.4)

    21

  • For modules in parallel, the overall unreliability (1 minusthe reliability) is the

    product of the component unreliabilities:

    (1 − RS) =N∏

    i=1

    (1 − Ri) . (2.5)

    If the modules are identical (which is usually the case), then the overall hazard

    rate for the parallel combination is

    λS = λ · (1 +1

    2+ ... +

    1

    N)−1. (2.6)

    2.3.2 Module–task and robot–task reliability

    We use task completion as our fundamental utility measure. We assume that the

    mission can be decomposed into distinct tasks and that thesetasks are assigned to

    particular robots. Using task completion as our fundamental measure allows us to

    compare different robot and team configurations based on howmany tasks they

    can complete, how quickly they can complete tasks, the percentage of a complex

    mission that they can complete, etc.

    To calculate the probability that a module will survive a mission task (module–task

    reliability), the MTTF of the module must be known, along with the expected

    usage of the module during that task. For instance, we might be told that Task

    1 will take six hours, using modules A and B for the entire six hours and using

    module C for three hours.

    22

  • In order to discretize the calculations, we evaluate the probability of failure only

    at the end of a task. We assume that the entire task is completed whether there is a

    failure or not; i.e., all failures occur after completion ofthe task. This assumption

    does not limit the usefulness of our method because if one needs to know whether

    a robot failed in the middle of a task, the tasks can simply be restated into subtasks

    to provide a desired level of granularity.

    Given the module–task reliability for each module, we can use the equations for

    combining reliabilities (given inSection 2.3.1) to determine the probability that

    the robot will fail during the task (robot–task reliability).

    2.3.3 Single-robot example

    We now apply the formulas from the preceding sections to predict the probability

    that a robot will complete a mission task. Consider a planetary exploration rover

    that is tasked to extract core samples. The rover is composedof five modules:

    • Power

    • Computation and Sensing

    • Mobility

    • Communications

    • Manipulator

    23

  • Table 2.1.Module usage during sampling task

    Module Usage (h)

    Power 8Computation & Sensing 8

    Mobility 6Communications 2

    Manipulator 4

    The duration of the task is eight hours, and the amount of timeeach module is

    used during the task is given inTable 2.1.

    For each module, we obtained reliability data from JPL that are representative

    of components used in NASA’s planetary robots. As an example, the breakdown

    of components and reliabilities for the power module is shown in Table 2.2. The

    entire list of component reliabilities is provided inAppendix A.

    Table 2.2.Components comprising power subsystem

    Component Quantity MTTF (h)

    Battery 2 4.8MBattery control board 2 2.5M

    Mission clock 1 10MPower distribution unit 1 588k

    Power control unit 1 5.3MShunt limiter 1 88k

    Electrical heater 2 333kRadioisotope heater 2 73k

    Thermal switch 2 11k

    24

  • Table 2.3.Robot subsystem reliabilities

    Module MTTF (h)

    Power 4.20kComputation & Sensing 4.77k

    Mobility 19.7kCommunications 11.9k

    Manipulator 13.8k

    These component reliabilities were combined for each module according toEq.

    2.4, giving the module MTTFs listed inTable 2.3.

    Using these overall module failure rates andEq. 2.1, we can calculate the probability

    that each module will still be functioning at the end of the task. For the power

    module, this gives

    R = e(−8

    4202) = 99.810%. (2.7)

    The reliabilities for the other modules for this task are found similarly and are

    shown inTable 2.4.

    Table 2.4.Module reliabilities during sampling task

    Module Module–Task Reliability

    Power 99.810%Computation & Sensing 99.832%

    Mobility 99.970%Communications 99.983%

    Manipulator 99.971%

    25

  • Finally, we combine all of the module reliabilities usingEq. 2.3to give an overall

    robot–task reliability of 99.567%.

    26

  • 2.4 Summary

    In this chapter, we introduced definitions and models from the reliability engineering

    literature and provided a representation that can be used toapply these models

    to mobile robots. We then demonstrated how our representation can be used to

    predict the probability that a single robot will complete a given task.

    This type of calculation is useful for selecting componentsfrom which to build a

    robot to meet mission requirements. For example, given several mobility modules

    with different reliabilities and costs, we can calculate the robot–task reliabilities

    for robots using each alternative and then select the lowest-cost module that meets

    the mission requirements.

    27

  • Chapter 3

    MULTIROBOT RELIABILITY

    The reliability engineering methods presented in the previous section fall short

    when applied to multirobot teams. The equations for combining reliabilities of

    subsystems (Eq.2.3–2.6) assume that the failure of one subsystem is independent

    of the failure of other subsystems. This is a reasonable assumption when combining

    component reliabilities to create larger assemblies, and even when combining

    assemblies to produce an entire robot. When combining robotsto make a robot

    team, however, this assumption is not reasonable in many cases. For most multirobot

    missions, the failure of one robot will affect the tasking ofother robots so that

    their reliabilities are not independent. In this chapter wepresent a method that

    overcomes this limitation, allowing us to calculate the probability of completing a

    multirobot mission.

    3.1 Related Work

    There is considerable work in the multirobot domain that examines how to diagnose

    and/or recover from robot failures. For example, [22] describes a behavior-based

    28

  • robot control architecture that is able to adapt to robot failures and communication

    failures, and [23] discusses detection and recovery from multiple types of failure

    in a market-based planner. As in the single-robot domain, our work differs from

    these in that we are developing methods to predict the probability of failure before

    it occurs rather than to respond to failure after it occurs.

    The only known work preceding ours in the area of predicting mobile robot team

    reliability is [3]. That paper’s methods are similar to ours in that they are based in

    the reliability engineering literature, but that work has anarrow focus on teams of

    robots with cannibalistic repair capability. In contrast,we are developing a general

    methodology that can be applied to a wide variety of robot teams and missions.

    We revisit [3] in more depth inSection 3.5.

    29

  • 3.2 Analytical Solutions for Simple Multirobot Missions

    For very simple missions, it is possible to enumerate by handall of the possible

    outcomes. One way of doing this is by drawing a tree diagram such as in

    Figure 3.1. We can use such a tree to derive an analytical solution for the probability

    of mission completion (PoMC).

    For the two-task, two-robot mission shown inFigure 3.1, the analytical solution is

    PoMC = P (R1T1)P (R2T1)P (R1T2)P (R2T2), (3.1)

    whereP (RnTm) is the probability that robotn survives taskm. If the robots are

    identical, then this becomes

    PoMC = P (T1)2P (T2)

    2. (3.2)

    Figure 3.1. Possible paths for simple mission. (R1+ = Robot 1 alive;R1− = Robot1 dead)

    30

  • 3.3 Stochastic Simulation for Complex Multirobot Missions

    In more realistic mission scenarios, the failure of one robot will have an impact on

    the probability of failure of the other robots on the team so that the probability

    of mission completion cannot be calculated in a straightforward manner. The

    simplest example of such dependence is when there are a fixed number of tasks

    to be completed and the tasks will be allocated among available robots until all

    tasks are completed or all robots have failed. In this case, when one robot fails,

    there is a greater amount of work to be performed by the remaining robots, which

    increases the probability that they will fail.

    Robot reliabilities are also interdependent when robot tasks are not executed

    independently. This is the case, for instance, when there are tasks that require two

    or more robots to work together. If one of the robots performing a joint task fails,

    perhaps the remaining robots can still complete the task, but with increased stress

    on their components, which then increases their chance of failure. Or perhaps that

    task is abandoned, in which case the remaining robots have a decreased chance of

    failure.

    Another type of reliability interdependence is introducedif the robot team is capable

    of repairing a failed team member. Since repairing a failed robot requires action

    on the part of other robots, the failed robot is repaired at the cost of increased

    probability of failure for the robots executing the repair.Repairing a failed team

    member may therefore in some cases decrease the probabilityof mission completion.

    31

  • Figure 3.2illustrates how mission complexity increases when such interdependence

    is introduced. This figure represents the same mission asFigure 3.1, but with the

    addition of the ability to repair one failed robot. The addition of this single repair

    capability has increased the number of leaf nodes from 7 to 25. For a realistic

    scenario with several robots, multiple tasks, and perhaps dozens of spare parts,

    the tree becomes complex enough that a direct analytical solution is infeasible.

    For these more complex missions, we have developed a method of estimating

    mission reliability using stochastic simulation. In this method, we represent the

    mission using a state–transition diagram, as inFigure 3.3. (Details of the mission

    represented byFigure 3.3are given inSection 3.4.)

    The state machines represented by these diagrams can be implemented in software

    in order to explore the space stochastically. At each task node, the state of the

    robot team is evaluated by choosing a random value between zero and one for

    each module and comparing that value with the module–task reliability for that

    module for the current task. The branch in the diagram corresponding to the resulting

    Figure 3.2. Same mission asFigure 3.1but with one repair allowed

    32

  • team state is followed, and the process continues until the simulation reaches

    eitherSuccessor Failure.

    Start # Robots 0?

    Return

    N

    N

    Figure 3.3. State–transition diagram for complex mission

    33

  • The simulation is repeated many times, with eachSuccessresult being assigned

    a score of one and eachFailure result being assigned a score of zero. The average

    score of a large number of trials then gives the overall probability of mission

    completion.

    While this method has computational limitations, it is a significant improvement

    over the direct analytical method, which can require days oftedious hand calculations

    and has a high potential for human error.

    34

  • 3.4 Example Results for a Complex Multirobot Mission

    Consider a planetary exploration mission where a team of robots is tasked to install

    a solar panel array for a measurement and observation outpost. The mission consists

    of carrying solar panels from the landing site to the outpostand then assembling

    them. The size of the solar panels is such that two robots are needed to carry and

    assemble one panel.

    For the purposes of this analysis, the task of assembling a solar panel is broken

    down into three subtasks:

    • Transit to the outpost;

    • Assemble the panel; and

    • Return to the landing site.

    The state–transition diagram for this mission was shown inFigure 3.3. Working

    through that figure from the top, we see that if there are fewerthan two robots

    then the mission is a failure. If there are at least two robots, then if there are no

    panels left to be installed, then the mission is a success. Ifthere are at least two

    robots, and there are panels still remaining to be installed, then the robots will pair

    off and carry panels to the outpost (Transit task). After theTransit task, if there

    are fewer than two robots alive and if there are spare robots at the landing site,

    then the spares willTransit to the outpost until at least two robots are available to

    Assembleor until there are no more spare robots (in the latter case, the mission

    35

  • fails). The robots then pair off toAssemblethe panels, and any robots that survive

    that taskReturnto the landing zone.

    For this example all of the robots on the team are identical. The usage times for

    each module for each task are shown inTable 3.1. These usage times along with

    the subsystem reliabilities fromTable 2.3are used to calculate the module–task

    reliabilities for this mission, which are shown inTable 3.2.

    For the example mission scenario described above, once the tasks, the task durations,

    and the baseline module reliabilities are established, then the input variables for

    the model are

    • the number of robots on the team,

    • the reliability of the components used, and

    • the mission duration (number of panels to be installed).

    Table 3.1.Subsystem usage by task (h)

    Subsystem Transit Assemble Return

    Power 6 8 6Computation & Sensing 6 4 6

    Mobility 6 8 6Communications 2 4 2

    Manipulator 0 8 0

    36

  • By examining how the probability of mission success varies asthese inputs are

    changed, we can answer questions such as

    • For a given mission duration and component reliability, what is the fewest

    number of robots needed to meet a certain probability of mission completion?

    and

    • If additional robots are added beyond the minimum number, can we use

    lower reliability components, and if so, how much lower?

    We explore these questions in Sections3.4.1and3.4.2, respectively.

    3.4.1 Comparing teams having different numbers of robots

    Figure 3.4compares the simulation results for teams with different numbers of

    robots, with all robots having the component reliabilitieslisted in the above tables.

    We see from this figure that adding even one robot beyond the minimum (two)

    increases the probability of mission success dramatically, even for relatively short

    missions. However, there is a diminishing improvement as additional robots are

    Table 3.2.Module–task reliabilities

    Subsystem Transit Assemble Return

    Power 99.86% 99.81% 99.86%Computation & Sensing 99.87% 99.92% 99.87%

    Mobility 99.97% 99.96% 99.97%Communications 99.98% 99.97% 99.98%

    Manipulator 100% 99.94% 100%

    37

  • added to the team. We can use this figure to answer the first question above. For

    example, for a mission specifying that 30 panels are to be installed with a probability

    of mission completion of at least 95%, then the team must include at least four

    robots (Figure 3.5).

    0

    20

    40

    60

    80

    100

    0 10 20 30 40 50 60

    Pro

    bab

    ilit

    y o

    f m

    issi

    on

    co

    mp

    leti

    on

    (%

    )

    Mission duration (number of panels)

    2 robots 3 robots 4 robots 5 robots

    Figure 3.4. Different numbers of robots

    80

    85

    90

    95

    100

    26 27 28 29 30 31 32 33 34

    Pro

    bab

    ilit

    y o

    f m

    issi

    on

    co

    mp

    leti

    on

    (%

    )

    Mission duration (number of panels)

    Design point

    2 robots 3 robots 4 robots 5 robots

    Figure 3.5. Closeup of area of interest fromFigure 3.4

    38

  • 3.4.2 Comparing teams with robots having different reliabilities

    If additional robots are added beyond the minimum required,it should be possible

    to use less-reliable components in those robots and still achieve a required mission

    reliability. Figure 3.6shows the simulation results for teams of four robots with

    component reliabilities ranging from 10% to 100% of the baseline amounts from

    Table 2.3.

    When varying the reliability of the components, we apply a constant multiplier

    to all of the subsystem MTTF values inTable 2.3. For instance, when we refer

    to a team with 10% of the MTTF of the baseline team, we are multiplying all the

    values inTable 2.3by 10%.

    Figure 3.6shows that for very short missions a team of four robots with only 10%

    of the reliability of the baseline team can provide a higher probability of mission

    0

    20

    40

    60

    80

    100

    0 20 40 60 80 100 120 140

    Pro

    bab

    ilit

    y o

    f m

    issi

    on

    co

    mp

    leti

    on

    (%

    )

    Mission duration (number of panels)

    2 robots (100) 4 robots (50) 4 robots (25) 4 robots (10)

    Figure 3.6. Different component reliabilities

    39

  • completion compared to the baseline two-robot team. As the length of the mission

    increases, the reliability required for the four-robot team to equal the performance

    of the baseline team increases, but the four-robot, 50%-lower-MTTF team still

    outperforms the baseline team even for fairly long missions(on the order of a

    year).

    40

  • 3.5 Example – Repairable vs. Nonrepairable Robot Teams

    As mentioned earlier, there is one previous paper ([3]) in the literature that looks

    at reliability as a design parameter for mobile robot teams.In this section we

    compare our method to the one in that paper by analyzing the example mission

    given in that paper.

    The mission considered in [3] is one where a team of robots are moving dirt. The

    dirt-moving task is a continuous task, where the amount of dirt moved is proportional

    to the total robot lifetime, where total robot lifetime is the sum of the lifetimes of

    all robots on the team.

    The robots making up a team are identical and are made of discrete modules.

    When an individual module fails, a robot is dead. During its lifetime each robot

    moves dirt at a constant rate.

    The basic comparison made in [3] is between teams of repairable and nonrepairable

    robots. For repairable teams, a robot can be repaired by a teammate using spare

    modules. The spare modules are taken from other failed robots – at the beginning

    of the mission there are no spares. Two conditions are therefore necessary for

    repair to take place: There must be a functional robot to execute the repair, and

    there must be spare modules of the correct type available. Notime is elapsed

    during a repair, and the repair task does not itself contribute to robot failure.

    41

  • Using the method described inSection 3.3, we simulated this mission. Figures

    3.7, 3.8, and3.9show, on the left, the results presented in [3] and, on the right, our

    results. These figures show that, qualitatively, our results are very similar to those

    in the previous paper.

    One thing that is not specified in [3], and that makes exact comparison difficult, is

    the failure rate,λ. Figure 3.10shows the same results asFigure 3.8for several

    0

    20000

    40000

    60000

    80000

    100000

    0 100 200 300 400 500 600 700 800 900 1000

    Uni

    ts o

    f wor

    k co

    mpl

    eted

    Number of robots

    nonrepairablerepairable

    Figure 3.7. Total work completed; two-component robots (left figure from [3])

    0

    20

    40

    60

    80

    100

    0 20 40 60 80 100

    Per

    cent

    incr

    ease

    in w

    ork

    com

    plet

    ed(r

    epai

    rabl

    e/no

    nrep

    aira

    ble)

    Number of robots

    Figure 3.8. Percent improvement of repairable team over nonrepairableteam;two-component robots (left figure from [3])

    42

  • values ofλ. While the overall conclusion (that repairable teams are superior)

    remains the same, the degree of superiority depends highly on the failure rate. The

    effects of varying failure rate are not addressed in [3].

    These results show that our method is capable of achieving results similar to the

    method in [3]. What is different is that the method used in that paper is an analytical

    method, similar to that presented inSection 3.2of this document and with all the

    0

    200

    400

    600

    800

    1000

    1200

    0 2 4 6 8 10 12 14

    Uni

    ts o

    f wor

    k co

    mpl

    eted

    Number of robots

    nonrepairablerepairable

    Figure 3.9. Total work completed, six-component robots (left figure from [3])

    0

    20

    40

    60

    80

    100

    120

    10 20 30 40 50 60 70 80 90 100

    Per

    cent

    incr

    ease

    in w

    ork

    com

    plet

    ed(r

    epai

    rabl

    e/no

    nrep

    aira

    ble)

    Number of robots

    λ = 0.80λ = 0.84λ = 0.88λ = 0.92λ = 0.96λ = 0.99

    Figure 3.10.Effect of failure rate on repairable team superiority

    43

  • shortcomings of that method. The mission scenarios addressed in [3] are very

    simplistic, and that paper fails to address the difficulty ofusing analytical methods

    for complex missions. The most complex mission scenario presented in that paper

    considers a team with three robots and two nonidentical modules, for which the

    solution is given as18l21 + 49l1 · l2 + 18l

    22

    (l1 + l2)(3l1 + 2l2)(2l1 + 3l2). (3.3)

    The amount of time required to develop such analytical solutions, and the significant

    likelihood for human error in their derivations, makes these methods undesirable

    even for fairly simple missions. They become impractical for missions of any

    significant complexity.

    44

  • 3.6 Summary

    In this chapter, we showed how reliability prediction for multirobot teams is often

    a different type of problem than for single robots due to the interdependence of

    robot reliabilities, making analytical reliability solutions impractical for multirobot

    missions that have significant complexity. We introduced a method using stochastic

    simulation to estimate mission reliabilities for such missions, and we demonstrated

    the use of this method to determine the optimal team size for amultirobot mission.

    Finally, we used this method to analyze the relative effectiveness of repairable and

    nonrepairable robot teams in revisiting a problem previously introduced into the

    literature by [3]. Our results here demonstrate that our method can produce similar

    results to the prior work, while also allowing for analysis beyond that shown in the

    prior work.

    45

  • Chapter 4

    DESIGN TRADEOFFS

    The methods presented in the previous chapters provide estimates of the probabilities

    of task and mission completion. We have shown how these estimates can be used

    to compare the performance of different robot teams. However, these reliability

    estimates by themselves are not terribly useful for missiondesign. If reliability

    existed in a vacuum, then we would simply build the most reliable robots possible

    for every mission. In designing a real-world mission it is necessary to consider

    other performance metrics and trade them off against reliability. In this chapter we

    explore some of the possible tradeoffs that can be made.

    4.1 Cost

    One of the most important factors in robot mission design is cost. For a given

    mission, we would like to be able to determine which team configuration will

    meet the mission specifications, including reliability, atthe lowest cost.

    The reliability of planetary rovers is related to overall mission cost in two ways.

    46

  • First, there is the increased cost associated with buildinghigher-reliability rovers.

    Second, there is the increased expected value of the missionwhen using

    higher-reliability rovers due to a higher probability of mission success.

    4.1.1 Cost of reliability

    In choosing components from which to build rovers, a designer would usually

    make choices among a small number of alternative components, each providing a

    certain reliability for a certain cost. In the early stages of mission design, however,

    the mission designer may not yet have information about specific components. In

    this case, it is useful to have a parametric model of the cost–reliability relationship.

    Reference [24] provides a general model for this relationship, which is given as

    c = exp

    {

    (1 − f) ·(Ri − Rmin)

    (Rmax − Ri)

    }

    , (4.1)

    whereRi is a reliability of interest betweenRmin andRmax; f is the feasibility of

    reliability improvement (a number between 0 and 1); andc is the ratio of the cost

    of Ri to the cost ofRmin.

    Figure 4.1shows the relative cost of rovers with differing component reliabilities.

    The costs are plotted as a percentage of the baseline rover cost, usingRmin = 0,

    Rmax = 1 andf = 0.95.

    Launch costs are also significantly affected by rover reliability. More-reliable

    rovers will weigh more, due to the generally-larger size of more-reliable components

    47

  • and also due to increased component redundancy. We have not found a model for

    the reliability–weight relationship in the literature. Asan initial approximation we

    assume that the relationship between weight and reliability is directly linear and

    that the relationship between launch costs and weight is also directly linear.

    4.1.2 Expected mission reward

    Any robotic mission must have some inherent value to it. For some missions there

    will be an obvious economic or strategic value to which a dollar amount can be

    assigned. For a mission that lacks such an obvious dollar value, the cost of the

    mission itself can be used as a lower bound for this inherent mission value, since

    the sponsors presumably expect some positive return on their investment.

    Multiplying the probability of mission success by the inherent value of the mission

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    40 50 60 70 80 90 100

    Ro

    ver

    co

    st (

    % o

    f b

    asel

    ine

    team

    )

    Component reliability (% of baseline)

    f = 0.95f = 0.90f = 0.70

    Figure 4.1. Relative cost of rovers as function of component reliability

    48

  • gives an expected reward for a given team configuration. For example,Figure 4.2

    shows the relationship between component reliability and expected mission value

    for a six-rover team performing the solar-panel-assembly mission described in

    Chapter 3.

    4.1.3 Overall cost–reliability relationship

    Taking the expected mission value calculated above and subtracting the rover

    development and launch costs gives an estimate of the net expected gain for the

    mission. We ignore operating costs here since we expect themto be roughly constant

    with respect to rover reliability (probably slightly higher for lower-reliability

    rovers due to the increased need for human intervention).

    In order to combine these costs meaningfully, we assign realdollar values to the

    0

    20

    40

    60

    80

    100

    60 65 70 75 80 85 90 95 100

    Ex

    pec

    ted

    val

    ue

    (% o

    f m

    ax v

    alu

    e)

    Component reliability (% of Table 3 values)

    Figure 4.2. Expected value of mission as a function of component reliability

    49

  • various costs for the baseline team (Table 4.1). These values are estimated from

    the costs of the MER mission, along with the assumption that the rovers for this

    mission would be somewhat cheaper and smaller than the MER rovers due to

    advances in technology and also because they are single-purpose machines.

    These values are used to calculate the net expected gain, which is plotted in

    Figure 4.3aalong with its constituent parts. The most significant thingrevealed

    by this figure is that there is clearly an optimal reliabilityrange with respect to the

    expected gain of the mission and that this optimal reliability is significantly lower

    than the reliability of the baseline legacy design.

    Figure 4.3ashows that for low-reliability rovers the cost of failure drives the net

    expected gain down, while for very-high-reliability rovers the high cost of the

    rovers themselves drives the expected gain down. The optimal reliability range

    therefore lies in a middle region where neither of these costs is as high.

    In order to evaluate the effects of some of our assumptions, we repeated the above

    analysis for different values of the feasibility constant (since this value was arbitrary)

    and of the mission inherent value (since we used a lower-bound estimate for this

    Table 4.1.Baseline team costs and rewards

    Item Cost ($ Millions)

    Robot cost (entire team) 150Launch cost (entire team) 300Inherent value of mission 450

    50

  • value). These results are shown in Figures4.3band4.3c. These figures show that

    while the shape of the expected gain curve changes with theseparameters, the

    overall trends remain the same: Both figures support the argument that the optimal

    range for mission reliability with respect to mission gain is at a lower level than

    we would intuitively expect.

    51

  • -400

    -300

    -200

    -100

    0

    100

    200

    300

    400

    500

    55 60 65 70 75 80 85 90 95 100

    $ (M

    illio

    ns)

    Component reliability (% of baseline)

    Expected valueRover cost

    Launch costExpected gain

    (a) f = 0.95, value = $450M

    -300

    -200

    -100

    0

    100

    200

    300

    400

    500

    55 60 65 70 75 80 85 90 95 100

    $ (M

    illio

    ns)

    Component reliability (% of baseline)

    Expected valueRover cost

    Launch costExpected gain

    (b) f = 0.5, value = $450M

    -400

    -200

    0

    200

    400

    600

    800

    1000

    55 60 65 70 75 80 85 90 95 100

    $ (M

    illio

    ns)

    Component reliability (% of baseline)

    Expected valueRover cost

    Launch costExpected gain

    (c) f = 0.95, value = $900M

    Figure 4.3. Net expected gain

    52

  • 4.2 Example – Multirobot Team Size

    Using the reliability–cost relationship presented inSection 4.1, we revisit the solar

    panel mission fromChapter 2, with the goal of addressing a claim that has been

    made in the literature about one benefit of multirobot systems.

    4.2.1 Introduction

    Applications of multirobot systems can be divided into two categories: those

    where multiple robots are necessary for task completion andthose where a single

    robot could complete the task but where multiple robots are desirable for reasons

    other than task completion. An example application fallinginto the first category

    is soccer – a single robot cannot play soccer. An example application in the second

    category is area coverage – while in many cases an area can be covered by a single

    robot, it may be preferable to use more than one robot in orderto cover the area

    more quickly.

    When the mission itself does not dictate a particular robot team configuration,

    there are multiple requirements that a mission designer must consider. Three

    important factors that we consider here are time, cost, and reliability.

    Time can be a reason for using more robots than the minimum required because,

    for some tasks, having extra robots can reduce the time required to complete the

    53

  • task. For instance, in an area coverage task, multiple robots can work in parallel in

    order to accomplish the task more quickly.

    Cost is an important consideration in team size. There is the cost of additional

    robots. There is the cost of robot components–more robust components cost more.

    There are operating costs such as transportation and maintenance, which may be

    higher for a larger team. Infrastructure costs are likely tobe greater for a larger

    team; for instance, a larger team may require more communications bandwidth.

    The third performance criterion we consider here is reliability, expressed as the

    probability of mission completion (PoMC). A requirement fora mission to have a

    certain probability of successful completion can dictate the minimum number of

    robots required for the mission. For example, if one robot has a 90% probability

    of surviving a task, but the mission requirement is for a 97% probability of having

    one robot survive the task, then one way to meet this requirement is by sending

    two robots (giving a 99% chance that one would survive).

    These criteria (time, cost, reliability) are highly interdependent. As an example,

    adding more robots to a mission increases the cost, but it canalso reduce the amount

    of time required to complete the mission. Reducing the mission duration means

    that the robots don’t need to survive as long, so they can be built of lower-reliability

    components, which reduces the cost.

    These relationships among team size, component reliability, cost, time, and mission

    success have been mentioned in the robotics literature, butonly in passing and

    54

  • only in qualitative terms. In particular, researchers often claim that multirobot

    systems provide greater reliability than single-robot systems (e.g., [25], [26], [27],

    [28]).

    Superficially, such a claim seems obviously true – if three robots are sent to do a

    task instead of one, there is a greater chance of completing the task. When one

    examines the above claim in greater depth, however, finding the answer can be

    complicated. In this example, the cost of completing the task has been tripled

    by sending three robots. If these same additional funds wereinstead invested

    to improve the reliability of a single robot, then which would be more likely to

    complete the task – the three robots or the single superior robot? The answer is no

    longer obvious.

    4.2.2 Analysis

    We briefly remind the reader of the mission previously described inChapter 2:

    • A team of robots is tasked to transport and assemble solar panels.

    • The solar panels are large, so that two robots are required tocarry and assemble

    each panel.

    The baseline team consists of a pair of highly reliable robots. Using the cost–reliability

    relationship inEq. 4.1, we can determine alternative team configurations with

    the same overall cost. For example, we find that a team with four robots, each

    55

  • made of components with 40% of the MTTF of the baseline components, would

    cost about the same, using a feasibility of 0.5.Figure 4.4shows the simulation

    results for these two teams. We see here that the team with four lower-reliability

    robots has a higher mission reliability than the baseline team for missions shorter

    than 85 panels. The larger, lower-reliability team would therefore be the more

    cost-effective solution for shorter missions, while the smaller, high-reliability team

    would be more cost-effective for longer missions.

    0

    20

    40

    60

    80

    100

    0 20 40 60 80 100 120 140 160

    Po

    MC

    (%

    )

    Mission duration (number of panels)

    2R (100%)4R (40%)

    Figure 4.4. Comparison of equal-cost teams

    56

  • 4.3 Operating Conditions

    The reliability engineering methods presented inChapter 2lack an explicit accounting

    of operating and environmental conditions. Much of reliability engineering was

    originally developed for analysis of systems installed in fairly static environments

    such as nuclear power plants. Mobile robot components are exposed to dynamic

    operating and environmental conditions, particularly in the case of planetary exploration

    rovers, which, for example, are subjected to temperature differences of hundreds

    of degrees between day and night. The reliabilities of many of the components

    in a mobile robot will vary under different operating conditions. It is therefore

    necessary to examine how the standard reliability engineering methods can be

    adapted to take into account varying operating conditions.

    4.3.1 Extrapolation of MTTF to other operating points

    The MTTF provided by a device manufacturer represents the hazard rate under a

    single set of operating conditions. In order to make reliability predictions over a

    range of operating conditions, we need to extrapolate MTTF at different operating

    conditions from the single-point MTTF. Models relating howoperating conditions

    affect reliability are available for many components. These relationships are used,

    for instance, in accelerated-life testing, where devices are subjected to extreme

    operating conditions in order to induce failure, and the observed failure rates are

    then extrapolated back to normal operating conditions.

    57

  • An example of a robot component whose reliability is affected by operating

    conditions is a mechanical bearing. Such bearings are oftenfound in robot motors

    and joints. The failure rate of mechanical bearings is significantly affected by

    operating conditions such as temperature, rotational speed, and load. Here we

    show how the single-point MTTF for a mechanical bearing can be extrapolated

    over a range of temperature and load conditions.

    Reliability of bearings is often expressed by theL10 life, which is the time at which

    10% of the population has failed. For a mechanical bearing theL10 value is given

    by

    L10 =

    (

    C

    P

    )d

    ·

    (

    106

    60n

    )

    , (4.2)

    whereC is the rated bearing load,P is the actual bearing load,d reflects the type

    of bearing (d = 3.0 for a ball bearing,d = 3.3 for a roller bearing), andn is the

    rotational speed [29].

    Holding the speed constant and usingd = 3.0, we find that the life is related to the

    applied load asL10

    L10,0=

    (

    P0

    P

    )3

    , (4.3)

    where the subscript 0 indicates the manufacturer’s published reliability data.

    To relateL10 life and hazard rate, we useEq. 2.1with R = 90%, giving

    λ =− ln (0.9)

    L10. (4.4)

    58

  • CombiningEq. 4.3with Eq. 4.4gives the relationship between hazard rate and

    operating load:λ

    λ0=

    MTTF0

    MTTF=

    (

    P

    P0

    )3

    . (4.5)

    Bearing life is also greatly affected by temperature since the lubricant in the bearing

    breaks down faster at higher temperatures. The approximaterelationship used for

    the effect of temperature on bearing failure is that every10◦C rise in temperature

    doubles the failure rate [30], or

    λ

    λ0=

    MTTF0

    MTTF= 2(

    T−T010

    ). (4.6)

    We can combine multiple environmental factors, assuming that they are independent.

    In this case we can determine the effect of combined load and temperature changes

    on the MTTF of a bearing, which is

    λ

    λ0=

    MTTF0

    MTTF=

    (

    P

    P0

    )3

    · 2(T−T0

    10). (4.7)

    Eq. 4.7is plotted inFigure 4.5. This figure shows that MTTF varies greatly even

    over a fairly small range of temperatures and loads. This illustrates why the

    single-point MTTF provided by manufacturers is inadequateto describe the reliability

    of devices operating under significantly different conditions from those under

    which the MTTF was established.

    59

  • 4.3.2 Operating envelope

    Figure 4.6shows some of the lines of constant MTTF resulting fromEq. 4.7.

    These lines illustrate how operating conditions can be traded off against one another

    as well as against reliability. For instance, if a robot is tobe operated in a

    high-temperature environment, it may be desirable to operate the robot motors

    at lower speeds in order to compensate for the increased ambient temperature.

    On the other hand, if the speed of the robot was a critical mission requirement,

    then we could continue to operate the robot at full speed, butwith a quantitative

    understanding of the tradeoff being made with respect to reliability.

    Such tradeoffs could be automated in a sophisticated rover that would monitor

    ambient conditions and modify its mission profile in order tomaintain a target

    mission reliability, in much the same way that human workerswill slow down

    when working under adverse environmental conditions.

    Figure 4.5. Effect of operating conditions on bearing MTTF

    60

  • Figure 4.6. Lines of constant MTTF

    61

  • 4.4 Summary

    In this chapter, we showed how reliability can be traded off against other mission

    design parameters. We first presented a cost–reliability relationship from the literature

    and used this to examine how the various costs of a planetary mission contribute

    to the overall expected value of the mission. Our results suggest that building

    planetary rovers to the highest levels of reliability may not be cost-effective. We

    also made use of the cost–reliability relationship to provide a quantitative evaluation

    of the claim that teams with more lower-reliability robots are more reliable than

    teams with fewer higher-reliability robots. Our results inthis case show that this

    claim is not universally true but must be evaluated in the context of specific mission

    parameters. Finally, we looked at how operating conditionsaffect reliability –

    specifically looking at how temperature and operating load affect the expected

    life of a mechanical bearing.

    62

  • Chapter 5

    MISSION PLANNING

    The previous chapters demonstrate how reliability can be used in the design of

    robots and multirobot teams. In this chapter and the next we consider the role

    of robot reliability in the process of mission planning for mobile robot teams.

    Specifically, we examine here how knowledge of robot reliabilities can be used

    to improve task allocation in the context of the multirobot exploration problem.

    We take a simple exhaustive planner and compare the plan it chooses against the

    optimal plan that takes into account robot failures and the backup plans that occur

    after failure. Our results show that for this problem domain, making an initial plan

    without regard to individual robot reliabilities results in choosing a suboptimal

    plan most of the time and that the difference in mission performance between the

    chosen plan and the optimal plan is usually substantial.

    63

  • 5.1 Background

    For multirobot missions, it is necessary to allocate tasks among the team members

    as part of the mission planning process. The specific task allocation chosen affects

    the probabilities of failure for the robots, and thus the mission, since failure is a

    function of usage.

    In reviewing the robot mission planning literature, we find that there has been

    substantial work in the area of detecting and recovering from robot failures (e.g.,

    [14, 22, 31, 32]) and that several multirobot mission planning systems provide

    mechanisms for reallocation of tasks among surviving team members after a robot

    failure (e.g., [23, 33, 34]). However, all of these methods are reactive rather than

    predictive, dealing with failure only after it occurs. Reference [23], for example,

    describes a mission planning system that is able to recover from robot failure

    because tasks can be reallocated. In this system, tasks are auctioned off to the

    robot with the highest bid (or lowest, depending on the utility metric). This system

    allows for task reallocation during the mission when new information changes the

    valuation of tasks. For instance, if a robot suffers a component failure that impairs

    its ability to perform its assigned tasks, it will change itsvaluation for those tasks,

    and it can then subcontract tasks to another robot that has a better valuation for

    those tasks.

    While it is important to recover from robot failures, it wouldbe better to minimize

    the likelihood of such failures in the first place. One way to do so is to design

    64

  • the robots to an appropriate level of reliability for the mission requirements, as

    described in the previous chapters of this document. Another way is to operate the

    robots in a way that minimizes the likelihood of failure. As discussed in

    Chapter 4, operating conditions are one execution-time consideration in the probability

    of robot failure.

    Another operational consideration in robot failure is the assignment of tasks to

    robots and the ordering of those tasks. The mission planningsystem influences

    the likelihoods of robot failures because the initial assignment of tasks to robots

    plays a role in determining the probabilities of robot failures during the mission.

    For example, assigning a robot with a weak or damaged drive motor to a task that

    requires it to travel a long distance results in a higher probability of that robot

    failing than if the robot were assigned to a task that required less travel. This example

    is intuitive because it assumes heterogeneous robots, but in this chapter we demonstrate

    that even when team members are homogeneous, the assignmentof tasks to robots

    has a significant influence on robot failure. We are not aware of any existing work

    that addresses the use of robot reliability information to improve multirobot task

    allocation in this way.

    One way to incorporate such reliability concerns into multirobot mission planning

    would be to introduce a reliability component into the utility metric used by the

    planner. Such an approach is unsatisfying for two reasons. The first is the

    incommensurability of different components of a utility measure – how do we

    combine dollars spent, meters traveled, and probability offailure into a single

    65

  • metric? The second is that establishing a numeric reliability requirement is itself

    a difficult problem that has been minimally explored for the mobile robot domain

    – i.e., how do we decide if the reliability requirement for a mission should be 95%

    rather than 96%?

    In order to avoid these difficulties, we take a different approach: Rather than devising

    new utility metrics that explicitly incorporate reliability, we instead look at how

    robot reliability affects the utility metrics already being used. In plain language,

    what we arenot doing is taking “Find the solution with the shortest time” and

    turning it into “Find the solution with the shortest time that also meets reliability

    levelX” but instead turning it into “Find the solution with the shortestexpected

    time” where the expected time takes into account the alternative outcomes that

    occur when robots fail.

    66

  • 5.2 Illustrating Example

    Consider a simple multirobot exploration mission with two identical robots and

    two locations to be visited (Figure 5.1). The goal of the mission is for all target

    locations to be visited in any order by any robot in the shortest total mission time.1

    Time is assumed here to be proportional to distance traversed.

    0

    2

    4

    6

    8

    10

    12

    14

    0 2 4 6 8 10 12 14 16 18

    R1

    R2

    T1

    T2

    ROBOTSTARGETS

    Figure 5.1. Exploration mission

    Each robot is defined by an(x, y) location and a reliabilityPt, which is the probability

    of surviving a one-unit traverse. Each target is defined by its (x, y) location. The

    robot and target parameters used for this example are listedin Table 5.1and illustrated

    by Figure 5.1.

    For a small number of robots and targets, it is feasible to exhaustively enumerate

    the possible task assignments and then calculate the distance that each robot must

    traverse to accomplish each plan (Table 5.2). The plan duration (dplan) is equal

    1In other terms, to minimize the makespan.

    67

  • Table 5.1.Robot and target parameters

    x y Pt

    Robot 1 4 12 0.99Robot 2 14 3 0.99Target 1 1 1 —Target 2 3 5 —

    to the greatest distance that any robot travels during that plan. The plan with the

    smallest duration is then chosen. In this example, Plan B (Figure 5.2a) would be

    chosen.

    Now consider what happens when a robot fails while executingthis plan. If Robot 1

    fails, then Robot 2 is assigned to visit Target 1 after reaching Target 2

    (Figure 5.2b). If Robot 2 fails, then Robot 1 is assigned to Target 2 after reaching

    Target 1 (Figure 5.2c). We assume here that tasks are not interrupted, so new

    targets are assigned to surviving robots only after they complete their current

    tasks.

    Table 5.2.Plan durations(red/italic text indicates best plan)

    Plan d(R1) d(R2) dplan

    A (R1T1 + R1T2) 15.9 0 15.9B (R1T1 + R2T2) 11.4 11.2 11.4C (R2T1 + R1T2) 7.62 13.2 13.2D (R2T1 + R2T2) 0 17.6 17.6E (R1T2 + R1T1) 11.5 0 11.5F (R2T2 + R2T1) 0 15.7 15.7

    68

  • 0

    2

    4

    6

    8

    10

    12

    14

    0 2 4 6 8 10 12 14 16 18

    R1

    R2

    T1

    T2

    ROBOTSTARGETS

    (a) Chosen plan (Plan B)

    0

    2

    4

    6

    8

    10

    12

    14

    0 2 4 6 8 10 12 14 16 18

    R1

    R2

    T1

    T2

    ROBOTSTARGETS

    (b) Backup for Plan B when Robot 1 fails (dashed linerepresents robot failure)

    0

    2

    4

    6

    8

    10

    12

    14

    0 2 4 6 8 10 12 14 16 18

    R1

    R2

    T1

    T2

    ROBOTSTARGETS