team exploration vs exploitation with finite budgets david castañón boston university center for...

Post on 29-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Team Exploration vs Exploitation with Finite Budgets

David Castañón

Boston University

Center for Information and Systems Engineering

Introduction

• Exploration vs exploitation is a classic tradeoff in decision problems with uncertainty- Exploit available information vs improving information- Numerous applications in finance, adaptive control, machine

learning, …

• Interested in paradigms for teams of agents - Search and exploit information with limited resources- Task partitioning, motivated by applications (e.g. surveillance)

• Objective: techniques for team control of activities- Improve coordination among team members- Allow for variations on human roles in team- Understand aspects of task partitioning for mixed

human/automata teams

Experimental Platform

• Multiple robots search for and perform tasks - Can provide varying levels of operator control: human-automata teams- Control information displayed, risk to each operator using video

• Possible model for interesting problems

Boeing BU

Illustration: Dynamic Search/Classify Problems

• Multiple agents with different fields of regard

• Multiple sites to search for potential objects using noisy measurements, classify them

• Agents have search modes/exploitation modes, finite budget

Related Work

• Quickest detection problem - Noisy measurements of alternative hypotheses

- Trade off decision accuracy versus time (cost) to make decision

• Results: optimal strategy is to make decision when log-likelihood ratio leaves a threshold interval- Sequential Probability Ratio Test- Log-likelihood ratio drift-diffusion model (Cohen-Holmes, …)- Correlates well with human decision strategies on sequential

repeated tasks

• Single sensing mode, single site

Multi-Armed Bandits

• Robbins (50s), Gittins (70s), many others

• Independent Machines- Each machine has individual state with random dynamics- State evolves when machine is played, stationary otherwise- Random payoff depending on state of machine

• Objective: infinite horizon sum of discounted rewards- Repeated decisions among finite alternatives

• Result: Optimal policy based on Gittins indices, computed independently for each machine based on current state- Select machine with largest current Gittins index to play next

Human Exploration/Exploitation inMulti-armed Bandit Problems

• Multi-armed bandit paradigm- Complex choice task with simple structure to normative solution- Can model human choice with heuristics/simple strategies that

approximate Gittins indices and include parameters for human variability

• Daw et al (05,06): Time varying environments, propose alternative models for exploration/ exploitation using soft-max and other random decision rules

• Steyvers et al (09), Zhang et al (09): Finite horizon total reward paradigm, with models using a latent variable to encourage exploration vs exploitation

• Yu-Dayan (05), Aston-Jones & Cohen (05), others: Propose mechanisms underlying brain activity in aspects of exploration vs exploitation

• Cohen et al (07): Highlight limitations of the multi-armed bandit paradigm and discuss directions for future work

Limits of Bandit Paradigm

• Stationary environments- Tasks do not evolve unless acted on- Likelihood of success at task plus follow-on task is time invariant- Set of possible tasks is time-invariant

• Single action per time- Not geared to team activities

• Infinite horizon objective- Unbounded resources

• Single type of action per bandit- Can’t vary choice

Interested in other paradigms that remove these limitations

A Different Paradigm: Spatial Resource Allocation

• Multiple agents with finite resources- Multiple actions per agent- Different action types per agent/location- Similar to classical search theory- Focus on effectiveness of collective team actions

• Problem: M team members, N locations, actions xij from member j to location i:

Solution mechanism: Pricing of resources + Duality- Allocation, but no tradeoff in exploration vs exploitation

Extension: Dynamics

• Allow learning of information based on outcomes of action- Actions on cell i from agent j at time t: xij(t) result in information as to the content of

cell- Different actions different quality of information, different effort- Bayesian integration of information from multiple actions, times- Exploration: inexpensive actions to identify valuable locations (e.g. detect activity)- Exploitation: expensive actions to collect value (e.g. identify objects, etc)

• Allow dynamic arrivals and departures of objects in cells

• Problem: Normative solution of such problems requires full stochastic dynamic programming - Very large information space: probability measures on product space of possible

cell contents per time- Combinatorial explosion in number of control actions Complex problem to solve, even for simple versions

Problem Setup

• N cells, T decision stages- State of cell i at stage t: xi(t) can evolve according to a Markov chain (arrivals,

departures) independently across cells

• M agents, each capable of allocating resource Rj(t) in each stage t by choosing actions (modes) to act on cells- Decisions uijm(t) = 1 if agent j uses action m on cell i at stage t, 0 otherwise- cijm are resources needed for action: constraint- Some actions m are exploratory, others exploit

• Action m on cell i at stage t yields noisy measurement yijm(t) of current cell content- Evolves information state: i(t) is probability distribution over content of cell i at

stage t- Conditionally independent likelihoods p(yijm(t)|xi(t))

• Additional decision at stage t: identify- For each cell, can identify content correctly and get reward (content-dependent)- Objective: Collect total cumulative reward given resources at each time

Constraint Relaxation

• As stated previously, optimal solution requires feedback strategies based on the product space of information states across all states- State of cell i at stage t: xi(t) can evolve according to a Markov chain (arrivals,

departures) independently across cells- Can solve conceptually, but hard to do except for very small number of cells

• Reformulate problem: relax resource constraints at future times- Replace sample path constraints with average constraints- Can be optimistic: will allocate more resources on difficult problems than available,

balance by allocating less resources on easy problems

- Resulting problem has special structure that allows for easier solution- Solution provides optimistic bounds on performance of original problem

Results: Duality

• Theorem: - Given resource prices j(t) for agents at each stage, stochastic optimization

problem decouples into N independent cell problems- Optimal solution can be found in terms of feedback strategies that use only

information on the current information state of a cell to select actions for that cell- Overall information state is product of marginal information state for each cell

• Implication: efficient solution algorithms- Merge pricing approaches from resource allocation with single cell subproblem

solutions that use stochastic dynamic optimization- Reduce joint N-cell optimization problem into N decoupled problems,

coordinated by prices- Replaces combinatorial optimization across cells by pricing mechanisms

• May provide tractable models for human choice in resource allocation and optional stopping- Replace detailed enumeration of outcomes with price estimates

Feasible Decisions: Model-Predictive Control

• Strategies designed using relaxed constraints may run out of resources in future stages- Approximate dynamic programming technique requires on-line computation of

decisions instead of off-line computation of strategies

• Restore feasibility by re-solving with receding horizon given most recent information- Adds robustness to model errors through replanning

ResourcePrice

Update

Cell 1Subproblem

Cell NSubproblem

PricesResourceUtilization

Team Computation

• Interesting problem: agents negotiate based on local problems to agree on prices - Concave maximization problem, but non-differentiable challenging to

establish converging algorithms- Can truncate negotiation with heuristics current approach- Topic of current PhD effort

• Mixed Initiative Variations- Human team members as equals control subset of resources, negotiate on

prices with automata- Human team members as leaders select actions for own resources, automata

select complementary actions for others- Humans as controllers impose constraints (e.g. cell responsibility allocations) on

automata

• Algorithms have been developed to implement above variations and explore potential experiments

Experiment 1

• Team of 3 agents, 100 cells, partial overlap in coverage

• 3 levels of resources per agent

• 3 sensing modes per agent- Search, low-res ID, high-res ID

• 3 object types, 1 of them high-valued- Vary relative error for

missing high-value- Potential initial arrival per cell

no departures

• Discrete-valued observations- Tentative classifications

• Strategy Parameters- Horizon, truncation strategy

• Problem definition studied by Hero et al (07)using stochastic dynamic programming

• Large number of cells, few of them occupied- Want to find occupied cells- Cell i state Ii = 1 if occupied, 0 otherwise- Prior probability Ii = 1 given

• Allocate energy to cells controls signal/noiseratio of measurement

• Two-stage problem: initial search, followed by refined

location of occupied cells • Adapt second stage allocations based on first stage search• Limited energy budget

Experiment 2: Adaptive Radar Search

Algorithms and Results

• Equal allocation among cells- Motivated by standard operations in wide-area Synthetic Aperture Radar (SAR)

• Model Predictive Control using relaxed resource constraints- Two stage allocation: low energy image to detect pixels of interest (high intensity

reflections), followed by higher energy in pixels of interest

• Results similar to prior work by Hero et al (2007) with simpler algorithm 2 orders of magnitude faster

Paradigm Extension: Mobile Agents

• Viewable sites depend on agent positions- Slower time scale control- Focus on trajectory selection and mode- Sequencing of sites critical to set up future sites

• Mobile agents: trajectory and focus of attention control- Models where electronic steering is not feasible- Sequence-dependent setup cost for activities

• Additional uncertainty: risk of travel- Visiting a site accomplishes task that gains task value- Traversing among sites can result in vehicle failure and loss

Experimental Platform for Research

• Multiple robots search for and perform tasks - Can provide varying levels of operator control: human-automata teams- Control information displayed, risk to each operator using video

Future Activities

• Evaluate algorithms on experiments with dynamic arrivals/departures

• Develop algorithms for motion-constrained mobile agents

• Implement experiments involving tasks with performance uncertainty in robot test facility- Vary tempo, size, uncertainty, information

• Implement autonomous team control algorithms to interact with humans in alternative roles- Supervisory control, Team partners, others

• Extend existing algorithms to different classes of tasks- Area search, task discovery, risk to platforms

• Collaborate with MURI team to design and analyze experiments involving alternative structures for human-automata teams

top related