team exploration vs exploitation with finite budgets david castañón boston university center for...

Team Exploration vs Exploitation with Finite Budgets

David Castañón

Boston University

Center for Information and Systems Engineering

Introduction

• Exploration vs exploitation is a classic tradeoff in decision problems with uncertainty- Exploit available information vs improving information- Numerous applications in finance, adaptive control, machine

learning, …

• Interested in paradigms for teams of agents - Search and exploit information with limited resources- Task partitioning, motivated by applications (e.g. surveillance)

• Objective: techniques for team control of activities- Improve coordination among team members- Allow for variations on human roles in team- Understand aspects of task partitioning for mixed

human/automata teams

Experimental Platform

• Multiple robots search for and perform tasks - Can provide varying levels of operator control: human-automata teams- Control information displayed, risk to each operator using video

• Possible model for interesting problems

Boeing BU

Illustration: Dynamic Search/Classify Problems

• Multiple agents with different fields of regard

• Multiple sites to search for potential objects using noisy measurements, classify them

• Agents have search modes/exploitation modes, finite budget

Related Work

• Quickest detection problem - Noisy measurements of alternative hypotheses

- Trade off decision accuracy versus time (cost) to make decision

• Results: optimal strategy is to make decision when log-likelihood ratio leaves a threshold interval- Sequential Probability Ratio Test- Log-likelihood ratio drift-diffusion model (Cohen-Holmes, …)- Correlates well with human decision strategies on sequential

repeated tasks

• Single sensing mode, single site

Multi-Armed Bandits

• Robbins (50s), Gittins (70s), many others

• Independent Machines- Each machine has individual state with random dynamics- State evolves when machine is played, stationary otherwise- Random payoff depending on state of machine

• Objective: infinite horizon sum of discounted rewards- Repeated decisions among finite alternatives

• Result: Optimal policy based on Gittins indices, computed independently for each machine based on current state- Select machine with largest current Gittins index to play next

Human Exploration/Exploitation inMulti-armed Bandit Problems

• Multi-armed bandit paradigm- Complex choice task with simple structure to normative solution- Can model human choice with heuristics/simple strategies that

approximate Gittins indices and include parameters for human variability

• Daw et al (05,06): Time varying environments, propose alternative models for exploration/ exploitation using soft-max and other random decision rules

• Steyvers et al (09), Zhang et al (09): Finite horizon total reward paradigm, with models using a latent variable to encourage exploration vs exploitation

• Yu-Dayan (05), Aston-Jones & Cohen (05), others: Propose mechanisms underlying brain activity in aspects of exploration vs exploitation

• Cohen et al (07): Highlight limitations of the multi-armed bandit paradigm and discuss directions for future work

Limits of Bandit Paradigm

• Stationary environments- Tasks do not evolve unless acted on- Likelihood of success at task plus follow-on task is time invariant- Set of possible tasks is time-invariant

• Single action per time- Not geared to team activities

• Infinite horizon objective- Unbounded resources

• Single type of action per bandit- Can’t vary choice

Interested in other paradigms that remove these limitations

A Different Paradigm: Spatial Resource Allocation

• Multiple agents with finite resources- Multiple actions per agent- Different action types per agent/location- Similar to classical search theory- Focus on effectiveness of collective team actions

• Problem: M team members, N locations, actions xij from member j to location i:

Solution mechanism: Pricing of resources + Duality- Allocation, but no tradeoff in exploration vs exploitation

Extension: Dynamics

• Allow learning of information based on outcomes of action- Actions on cell i from agent j at time t: xij(t) result in information as to the content of

cell- Different actions different quality of information, different effort- Bayesian integration of information from multiple actions, times- Exploration: inexpensive actions to identify valuable locations (e.g. detect activity)- Exploitation: expensive actions to collect value (e.g. identify objects, etc)

• Allow dynamic arrivals and departures of objects in cells

• Problem: Normative solution of such problems requires full stochastic dynamic programming - Very large information space: probability measures on product space of possible

cell contents per time- Combinatorial explosion in number of control actions Complex problem to solve, even for simple versions

Problem Setup

• N cells, T decision stages- State of cell i at stage t: xi(t) can evolve according to a Markov chain (arrivals,

departures) independently across cells

• M agents, each capable of allocating resource Rj(t) in each stage t by choosing actions (modes) to act on cells- Decisions uijm(t) = 1 if agent j uses action m on cell i at stage t, 0 otherwise- cijm are resources needed for action: constraint- Some actions m are exploratory, others exploit

• Action m on cell i at stage t yields noisy measurement yijm(t) of current cell content- Evolves information state: i(t) is probability distribution over content of cell i at

stage t- Conditionally independent likelihoods p(yijm(t)|xi(t))

• Additional decision at stage t: identify- For each cell, can identify content correctly and get reward (content-dependent)- Objective: Collect total cumulative reward given resources at each time

Constraint Relaxation

• As stated previously, optimal solution requires feedback strategies based on the product space of information states across all states- State of cell i at stage t: xi(t) can evolve according to a Markov chain (arrivals,

departures) independently across cells- Can solve conceptually, but hard to do except for very small number of cells

• Reformulate problem: relax resource constraints at future times- Replace sample path constraints with average constraints- Can be optimistic: will allocate more resources on difficult problems than available,

balance by allocating less resources on easy problems

- Resulting problem has special structure that allows for easier solution- Solution provides optimistic bounds on performance of original problem

Results: Duality

• Theorem: - Given resource prices j(t) for agents at each stage, stochastic optimization

problem decouples into N independent cell problems- Optimal solution can be found in terms of feedback strategies that use only

information on the current information state of a cell to select actions for that cell- Overall information state is product of marginal information state for each cell

• Implication: efficient solution algorithms- Merge pricing approaches from resource allocation with single cell subproblem

solutions that use stochastic dynamic optimization- Reduce joint N-cell optimization problem into N decoupled problems,

coordinated by prices- Replaces combinatorial optimization across cells by pricing mechanisms

• May provide tractable models for human choice in resource allocation and optional stopping- Replace detailed enumeration of outcomes with price estimates

Feasible Decisions: Model-Predictive Control

• Strategies designed using relaxed constraints may run out of resources in future stages- Approximate dynamic programming technique requires on-line computation of

decisions instead of off-line computation of strategies

• Restore feasibility by re-solving with receding horizon given most recent information- Adds robustness to model errors through replanning

ResourcePrice

Update

Cell 1Subproblem

Cell NSubproblem

PricesResourceUtilization

Team Computation

• Interesting problem: agents negotiate based on local problems to agree on prices - Concave maximization problem, but non-differentiable challenging to

establish converging algorithms- Can truncate negotiation with heuristics current approach- Topic of current PhD effort

• Mixed Initiative Variations- Human team members as equals control subset of resources, negotiate on

prices with automata- Human team members as leaders select actions for own resources, automata

select complementary actions for others- Humans as controllers impose constraints (e.g. cell responsibility allocations) on

automata

• Algorithms have been developed to implement above variations and explore potential experiments

Experiment 1

• Team of 3 agents, 100 cells, partial overlap in coverage

• 3 levels of resources per agent

• 3 sensing modes per agent- Search, low-res ID, high-res ID

• 3 object types, 1 of them high-valued- Vary relative error for

missing high-value- Potential initial arrival per cell

no departures

• Discrete-valued observations- Tentative classifications

• Strategy Parameters- Horizon, truncation strategy

• Problem definition studied by Hero et al (07)using stochastic dynamic programming

• Large number of cells, few of them occupied- Want to find occupied cells- Cell i state Ii = 1 if occupied, 0 otherwise- Prior probability Ii = 1 given

• Allocate energy to cells controls signal/noiseratio of measurement

• Two-stage problem: initial search, followed by refined

location of occupied cells • Adapt second stage allocations based on first stage search• Limited energy budget

Experiment 2: Adaptive Radar Search

Algorithms and Results

• Equal allocation among cells- Motivated by standard operations in wide-area Synthetic Aperture Radar (SAR)

• Model Predictive Control using relaxed resource constraints- Two stage allocation: low energy image to detect pixels of interest (high intensity

reflections), followed by higher energy in pixels of interest

• Results similar to prior work by Hero et al (2007) with simpler algorithm 2 orders of magnitude faster

Paradigm Extension: Mobile Agents

• Viewable sites depend on agent positions- Slower time scale control- Focus on trajectory selection and mode- Sequencing of sites critical to set up future sites

• Mobile agents: trajectory and focus of attention control- Models where electronic steering is not feasible- Sequence-dependent setup cost for activities

• Additional uncertainty: risk of travel- Visiting a site accomplishes task that gains task value- Traversing among sites can result in vehicle failure and loss

Experimental Platform for Research

• Multiple robots search for and perform tasks - Can provide varying levels of operator control: human-automata teams- Control information displayed, risk to each operator using video

Future Activities

• Evaluate algorithms on experiments with dynamic arrivals/departures

• Develop algorithms for motion-constrained mobile agents

• Implement experiments involving tasks with performance uncertainty in robot test facility- Vary tempo, size, uncertainty, information

• Implement autonomous team control algorithms to interact with humans in alternative roles- Supervisory control, Team partners, others

• Extend existing algorithms to different classes of tasks- Area search, task discovery, risk to platforms

• Collaborate with MURI team to design and analyze experiments involving alternative structures for human-automata teams

team exploration vs exploitation with finite budgets david castañón boston university center for...

Documents

human decision strategies

exploration exploitation

human variabilitydaw

human roles

decision accuracy

decision problems

machine learning

current stateselect