team exploration vs exploitation with finite budgets david castañón boston university center for...
Post on 29-Dec-2015
217 Views
Preview:
TRANSCRIPT
Team Exploration vs Exploitation with Finite Budgets
David Castañón
Boston University
Center for Information and Systems Engineering
Introduction
• Exploration vs exploitation is a classic tradeoff in decision problems with uncertainty- Exploit available information vs improving information- Numerous applications in finance, adaptive control, machine
learning, …
• Interested in paradigms for teams of agents - Search and exploit information with limited resources- Task partitioning, motivated by applications (e.g. surveillance)
• Objective: techniques for team control of activities- Improve coordination among team members- Allow for variations on human roles in team- Understand aspects of task partitioning for mixed
human/automata teams
Experimental Platform
• Multiple robots search for and perform tasks - Can provide varying levels of operator control: human-automata teams- Control information displayed, risk to each operator using video
• Possible model for interesting problems
Boeing BU
Illustration: Dynamic Search/Classify Problems
• Multiple agents with different fields of regard
• Multiple sites to search for potential objects using noisy measurements, classify them
• Agents have search modes/exploitation modes, finite budget
Related Work
• Quickest detection problem - Noisy measurements of alternative hypotheses
- Trade off decision accuracy versus time (cost) to make decision
• Results: optimal strategy is to make decision when log-likelihood ratio leaves a threshold interval- Sequential Probability Ratio Test- Log-likelihood ratio drift-diffusion model (Cohen-Holmes, …)- Correlates well with human decision strategies on sequential
repeated tasks
• Single sensing mode, single site
Multi-Armed Bandits
• Robbins (50s), Gittins (70s), many others
• Independent Machines- Each machine has individual state with random dynamics- State evolves when machine is played, stationary otherwise- Random payoff depending on state of machine
• Objective: infinite horizon sum of discounted rewards- Repeated decisions among finite alternatives
• Result: Optimal policy based on Gittins indices, computed independently for each machine based on current state- Select machine with largest current Gittins index to play next
Human Exploration/Exploitation inMulti-armed Bandit Problems
• Multi-armed bandit paradigm- Complex choice task with simple structure to normative solution- Can model human choice with heuristics/simple strategies that
approximate Gittins indices and include parameters for human variability
• Daw et al (05,06): Time varying environments, propose alternative models for exploration/ exploitation using soft-max and other random decision rules
• Steyvers et al (09), Zhang et al (09): Finite horizon total reward paradigm, with models using a latent variable to encourage exploration vs exploitation
• Yu-Dayan (05), Aston-Jones & Cohen (05), others: Propose mechanisms underlying brain activity in aspects of exploration vs exploitation
• Cohen et al (07): Highlight limitations of the multi-armed bandit paradigm and discuss directions for future work
Limits of Bandit Paradigm
• Stationary environments- Tasks do not evolve unless acted on- Likelihood of success at task plus follow-on task is time invariant- Set of possible tasks is time-invariant
• Single action per time- Not geared to team activities
• Infinite horizon objective- Unbounded resources
• Single type of action per bandit- Can’t vary choice
Interested in other paradigms that remove these limitations
A Different Paradigm: Spatial Resource Allocation
• Multiple agents with finite resources- Multiple actions per agent- Different action types per agent/location- Similar to classical search theory- Focus on effectiveness of collective team actions
• Problem: M team members, N locations, actions xij from member j to location i:
Solution mechanism: Pricing of resources + Duality- Allocation, but no tradeoff in exploration vs exploitation
Extension: Dynamics
• Allow learning of information based on outcomes of action- Actions on cell i from agent j at time t: xij(t) result in information as to the content of
cell- Different actions different quality of information, different effort- Bayesian integration of information from multiple actions, times- Exploration: inexpensive actions to identify valuable locations (e.g. detect activity)- Exploitation: expensive actions to collect value (e.g. identify objects, etc)
• Allow dynamic arrivals and departures of objects in cells
• Problem: Normative solution of such problems requires full stochastic dynamic programming - Very large information space: probability measures on product space of possible
cell contents per time- Combinatorial explosion in number of control actions Complex problem to solve, even for simple versions
Problem Setup
• N cells, T decision stages- State of cell i at stage t: xi(t) can evolve according to a Markov chain (arrivals,
departures) independently across cells
• M agents, each capable of allocating resource Rj(t) in each stage t by choosing actions (modes) to act on cells- Decisions uijm(t) = 1 if agent j uses action m on cell i at stage t, 0 otherwise- cijm are resources needed for action: constraint- Some actions m are exploratory, others exploit
• Action m on cell i at stage t yields noisy measurement yijm(t) of current cell content- Evolves information state: i(t) is probability distribution over content of cell i at
stage t- Conditionally independent likelihoods p(yijm(t)|xi(t))
• Additional decision at stage t: identify- For each cell, can identify content correctly and get reward (content-dependent)- Objective: Collect total cumulative reward given resources at each time
Constraint Relaxation
• As stated previously, optimal solution requires feedback strategies based on the product space of information states across all states- State of cell i at stage t: xi(t) can evolve according to a Markov chain (arrivals,
departures) independently across cells- Can solve conceptually, but hard to do except for very small number of cells
• Reformulate problem: relax resource constraints at future times- Replace sample path constraints with average constraints- Can be optimistic: will allocate more resources on difficult problems than available,
balance by allocating less resources on easy problems
- Resulting problem has special structure that allows for easier solution- Solution provides optimistic bounds on performance of original problem
Results: Duality
• Theorem: - Given resource prices j(t) for agents at each stage, stochastic optimization
problem decouples into N independent cell problems- Optimal solution can be found in terms of feedback strategies that use only
information on the current information state of a cell to select actions for that cell- Overall information state is product of marginal information state for each cell
• Implication: efficient solution algorithms- Merge pricing approaches from resource allocation with single cell subproblem
solutions that use stochastic dynamic optimization- Reduce joint N-cell optimization problem into N decoupled problems,
coordinated by prices- Replaces combinatorial optimization across cells by pricing mechanisms
• May provide tractable models for human choice in resource allocation and optional stopping- Replace detailed enumeration of outcomes with price estimates
Feasible Decisions: Model-Predictive Control
• Strategies designed using relaxed constraints may run out of resources in future stages- Approximate dynamic programming technique requires on-line computation of
decisions instead of off-line computation of strategies
• Restore feasibility by re-solving with receding horizon given most recent information- Adds robustness to model errors through replanning
ResourcePrice
Update
Cell 1Subproblem
Cell NSubproblem
PricesResourceUtilization
Team Computation
• Interesting problem: agents negotiate based on local problems to agree on prices - Concave maximization problem, but non-differentiable challenging to
establish converging algorithms- Can truncate negotiation with heuristics current approach- Topic of current PhD effort
• Mixed Initiative Variations- Human team members as equals control subset of resources, negotiate on
prices with automata- Human team members as leaders select actions for own resources, automata
select complementary actions for others- Humans as controllers impose constraints (e.g. cell responsibility allocations) on
automata
• Algorithms have been developed to implement above variations and explore potential experiments
Experiment 1
• Team of 3 agents, 100 cells, partial overlap in coverage
• 3 levels of resources per agent
• 3 sensing modes per agent- Search, low-res ID, high-res ID
• 3 object types, 1 of them high-valued- Vary relative error for
missing high-value- Potential initial arrival per cell
no departures
• Discrete-valued observations- Tentative classifications
• Strategy Parameters- Horizon, truncation strategy
• Problem definition studied by Hero et al (07)using stochastic dynamic programming
• Large number of cells, few of them occupied- Want to find occupied cells- Cell i state Ii = 1 if occupied, 0 otherwise- Prior probability Ii = 1 given
• Allocate energy to cells controls signal/noiseratio of measurement
• Two-stage problem: initial search, followed by refined
location of occupied cells • Adapt second stage allocations based on first stage search• Limited energy budget
Experiment 2: Adaptive Radar Search
Algorithms and Results
• Equal allocation among cells- Motivated by standard operations in wide-area Synthetic Aperture Radar (SAR)
• Model Predictive Control using relaxed resource constraints- Two stage allocation: low energy image to detect pixels of interest (high intensity
reflections), followed by higher energy in pixels of interest
• Results similar to prior work by Hero et al (2007) with simpler algorithm 2 orders of magnitude faster
Paradigm Extension: Mobile Agents
• Viewable sites depend on agent positions- Slower time scale control- Focus on trajectory selection and mode- Sequencing of sites critical to set up future sites
• Mobile agents: trajectory and focus of attention control- Models where electronic steering is not feasible- Sequence-dependent setup cost for activities
• Additional uncertainty: risk of travel- Visiting a site accomplishes task that gains task value- Traversing among sites can result in vehicle failure and loss
Experimental Platform for Research
• Multiple robots search for and perform tasks - Can provide varying levels of operator control: human-automata teams- Control information displayed, risk to each operator using video
Future Activities
• Evaluate algorithms on experiments with dynamic arrivals/departures
• Develop algorithms for motion-constrained mobile agents
• Implement experiments involving tasks with performance uncertainty in robot test facility- Vary tempo, size, uncertainty, information
• Implement autonomous team control algorithms to interact with humans in alternative roles- Supervisory control, Team partners, others
• Extend existing algorithms to different classes of tasks- Area search, task discovery, risk to platforms
• Collaborate with MURI team to design and analyze experiments involving alternative structures for human-automata teams
top related