learning of mediation strategies for heterogeneous agents cooperation
DESCRIPTION
Learning of Mediation Strategies for Heterogeneous Agents Cooperation. R. Charton, A. Boyer and F. Charpillet Maia Team - LORIA – France ICTAI'03 – Sacramento, CA, USA – November, 4 th 2003. Context of our works. - PowerPoint PPT PresentationTRANSCRIPT
1
Learning of Mediation Strategies for Heterogeneous Agents Cooperation
R. Charton, A. Boyer and F. Charpillet
Maia Team - LORIA – France
ICTAI'03 – Sacramento, CA, USA – November, 4th 2003
2
Context of our works
Industrial collaboration for the design of adaptive services that are multimedia, interactive, and general public. Focus on Information Retrieval assistance
Constraints :• User : occasional, novice• Information Source : ownership, costs
Goal : To enhance the service quality
3
Physical Environment
Cooperation in heterogeneous multi-agent systems
Agents of different nature: human, software, robots …
NN NNon-Controllable agents
Virtual Environment
How to make these agents cooperate ?
Achieve together applicative goals that satisfy a subset of agents
Controllable agents
C
C
CC
CP
Partially controllable agents
P
P
P
Interaction Links
4
Presentation Overview Learning of Mediation Strategies for Heterogeneous Agents Cooperation
• Typical example of interaction• Mediator and Mediation Strategies• Towards an MDP based Mediation • Our prototype of Mediator• Experiments and results
5
Too many/raw results...
Don't know how to formulate a request
An example of problem : a flight booking system
Customer
Interaction
Information Source
Goal : book a flightfrom Paris to Sacramento
Query
Results
Mediator
6
Role of the mediator agent
The mediator has to perform an useful task :
• Build a query that matches the most the user goal
• Provide relevant results to the user
• Maximize an utility approximation :– User satisfaction to be maximized
– Information Source cost to be minimized
At any time, the mediator can
• Ask a question to the user about the query
• Send the query to the information source
• Propose a limited number of results to the user
In return, it perceives the other agent's answers :
values, results, selections, rejections …
7
Now, the question is how to
• produce the mediator's behavior ?
• optimize the service quality ?
This requires finding an optimal mediation strategy
Mediation and Mediation strategies
A mediation strategy specifies which action the mediator must select to
control the mediation, according to the current state of the interactions.
A mediation is a sequence of question & answer interactions between the agents directed by the mediator. It is successful if the user gets the relevant results or if the mediator discovers that the source can't give any result.
8
Progressive Query buildingQuery
Precision
Number of InteractionsTotally Unknown
Partially specified
Fully specified
Sufficiently specified
Got useful answers
unuseful interactions
a good compromise
9
Requirements for the Mediator
Management of uncertainty and of imperfect knowledge :
• agents : – users may misunderstand the questions
– users may have a partial knowledge of their needs
• environment : – noise during communication
– imperfect sensors (for instance : speech recognition)
This requires an adaptive behavior
We propose to model the mediation problem with an MDP and to
compute a stochastic behavior for the mediator.
10
Markov Decision Process (MDP)
• Take a decision according to a Policy : S A [0;1]
s2
s0 s1
– States S={s0,s1,s2}
a0
a1
a1
a0
a0
a1– Actions A={a0,a1}
0.3
0.7
0.2
0.8
0.4
0.6
0.9
0.1
0.5 0.5
0.5
0.5
– Transition T : S A S [0;1] with T(s,a,s') = P (s'|s,a)
– Reward R : S A S IR
• Stochastic model <S,A,T,R>
• Optimize the expected reward
0i irR
: Discount factorCompute a Mediation Strategy
leads to Compute a Stochastic Policy
11
Modeling of the flight booking example
Define the model:
• S : State Space
• A : Mediator's actions
• T : Transitions
• R : Rewards
12
States : How to describe goals and objects ?
Form filling approach (Goddeau et al. 1996) :
Queries and source objects are described within a referential. The referential is built on a set of attributes :
Ref = { At 1, … , At m }
Example of referential :
• Departure : { London, Geneva, Paris, Berlin, … }
• Arrival: { Sacramento, Beijing, Moscow, … }
• Class : { Business, Normal, Economic, ... }
13
= U R
State space Mediator
User Source
R is the power set of all the objects of the information source :
R = { flight 1; ... ; flight r} is the set of objects that match the current query
R
U
U is the set of partial queries :
U = {(ea1 , val1), …, (ea m , val m) }
The state of the attribute At i is a couple (eai , vali) :
• Open ea = ‘?’ val is free
• Closed ea = ‘F’ val cannot be specified
• Assigned ea = ‘A’ val is already instantiated
14
State abstraction
The size of the state space is (2 n +1) (2+i) m where– n : total count of objects of the information source– m : number of attributes– i : average number of values per attribute
The size of the abstract state S space : 4 3m
Number of responses
0 nrmax
qr = 0 qr = + qr = * qr = ?
Unknown
(not yet)
An idea : use a State Abstraction S for the MDP
and only keep : the binding state {?, A, F} of each attribute from U
the response quality qr {?, 0, +, *} from R
15
Actions of the Mediator
Mediator
User SourceQuery
+ send the current query to the information source
+ ask the user to select a response
& propositions
Ask the user a question about an attributeExample for the travel class
•“In which class do you want to travel ?”
•“Do you want to travel in business class ?”
•“Are you sure you want to travel in economic class ?”
Questions
16
Rewards
Rewards can be obtained… Mediator
User Source
Selection, refusal...
• from the user interaction part
+ R selection user selects a proposition
- R noselect user refuses all the propositions
- R timeout too long interaction (user disconnection)
Results
• from the information source interaction part
+ R noresp no results for a fully specified task
- R overnum too many results (response quality is '*')
17
Example of mediation with the flight booking service
State Abstraction s Mediator Action Answers Rewards
<?, ?, ? | ?> <?, ?, ? | ?> Ask user for departure Paris 0
<Paris, ?, ? | ?> <A, ?, ? | ?> Ask source for results 1700 flights - R Overnum
<Paris, ?, ? | {nr Max first flights} > <A, ?, ? | *> Ask user for destination Sacramento 0
<Paris, Sacramento, ? | ?> <A, A, ? | ?> Ask user for flight class I don't know 0
<Paris, Sacramento, F | ?> <A, A, F | ?> Ask source for results 4 flights 0
<Paris, Sacramento, F | {4 flights}> <A, A, F | +> Ask user for selection Selection #2 + R Selection
MediatorUser SourceColors used :
18
Compute the Mediation Strategy
Problem : Two parts of model the are unknown !• T = f (user, information source)• R = f (user, information source)
Learn the Mediation Strategy by reinforcement
19
Reinforcement Learning
Dynamic System
Observation
Reinforcement (Reward)
Action
Transition
20
• Reinforcement Learning method
• Can be used online
Q-Learning (Watkins 89)
Q-Learner
Environment
s,r a
s
Q(s,a0)a0
a1
s'0
s'n
V(s)
V(s'0)
V(s'n)
Q(s,a1)Q-Value Q : S A IR
tA'att1t 'a,'sQMaxa,sRa,sQ1a,sQ
: Learning rate
Updating (Bellman 57)
21
Mediator Architecture
Decision Module (Q-Learning)
Interaction Manager
Task Manager (real state)
User / Client Agent
Information Agent Source
Requests and results
Selected Actions
Requests
Mediator Agent
User Profile
Store & retrieve preferences
Rewards
Updates
Results
Abstract State
Answers and selections
22
Experimentation
on the flight-booking application We trained the mediator task with • 3 Attributes (cities of departure/arrival and flight class)• 4 Attributes (+ the time of day for taking off)• 5 Attributes (+ the airline)
# of Attributes
(m)
# of Abstract states
(4.3 m)
# of Actions
(3.m+2)
# of Q-Values
((12.m+8).3 m)
3 108 11 1,188
4 324 14 4,536
5 972 17 16,524
Complexity growth as function of the number of attributes.
23
Learning results for 3-5 attributes (% of hits)
• 3 and 4 attributes 99% of selection (close to optimal)
• 5 attributes 90% of selection (more time required to converge)
0
10
20
30
40
50
60
70
80
90
100
0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000
Number of iterations
% o
f s
ucc
es
sfu
l me
dia
tio
ns
(s
ele
cti
on
/ n
o r
es
po
ns
e)
3 attributes
4 attributes
5 attributes
24
Learning results for 3-5 attributes (avg. length)
• 3 and 4 attributes the minimal length of the mediation is reached
• 5 attributes longer mediations
0
5
10
15
20
25
30
35
40
45
50
0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000
Number of iterations
Avg
. m
edia
tio
n l
eng
th (
nu
m.
of
acti
on
s p
er m
edia
tio
n) 3 attributes
4 attributes
5 attributes
25
Conclusion
Advantages
• MDP+RL allows to learn mediation strategies
• Answers to the needs of a majority of users (profiles)
• Designer Oriented User Oriented
• Incremental Approach
• Implemented Solution
Limits• User is partially observable, especially through imperfect
sensors, like speech recognition • Degradation of performance for more complex tasks
26
Future works
• Use other probabilistic models and methods:– Learn on pre-established policy
– Learn the model (Sutton DynaQ, Classifiers)
– POMDP approach (Modified Q-learning, Baxter Gradient)
• For more generic / complex tasks– Abstraction & Scalability : Change the abstract state space for a
better guidance of the process in the real state
– Hierarchical decomposition (H-MPD & H-POMDP) with attribute dependencies management
(ex : City Possible Company Specific options)
27
Thank you for your attention
Any questions ?
28
References(Allen et al. 2000) Allen J., Byron D., Dzikovska M., Ferguson G, Galescu L.,
Stent A., An Architecture for a Generic Dialogue Shell. In Natural Language Engineering, Cambridge University Press, vol 6, 2000.
(Young 1999) Young S., Probabilistic Methods in Spoken Dialog Systems. In Royal Society, London, September 1999.
(Levin et al. 1998) Levin E, Pieraccini R. and Eckert W. Using Markov Decision Process for Learning Dialogue Strategies. In Proceedings of ICASSP'98, Seattle, USA, 1998.
(Goddeau et al. 1996) Goddeau D., Meng H., Polifroni J., Seneff S., Busayapongchaiy S., A Form-Based Dialogue Manager For Spoken Language Applications, In Proceedings of ICSLP'96, Philadelphia, 1996.
(Sutton & Barto 1998) R. S. and Barto A. G. Reinforcement Learning: An Introduction. MIT Press Cambridge MA, 1998.
(Watkins 1989) Watkins C., Learning from Delayed Rewards. PhD Thesis of the King's College, University of Cambridge, England, 1989.
(Shardanand & Maes 1995) Shardanand U. and Maes P., Social Information Filtering: Algorithms for Automating "Word of Mouth", In Proceedings of ACM CHI'95, Vol. 1, pp. 210-217, 1995.
29
A trace in the Abstract State Space
<?, ? | ?> <?, ? | 0>
<?, ? | +> <?, ? | *>
<A, ? | ?> <A, ?, 0>
<A, ? | +> <A, ? | *>
<?, F | ?> <?, F, 0>
<?, F | +> <?, F | *>
<F, ? | ?> <F, ?, 0>
<F, ? | +> <F, ? | *>
<?, A | ?> <?, A, 0>
<?, A | +> <?, A | *>
<F, A | ?> <F, A, 0>
<F, A | +> <F, A | *>
<A, A | ?> <A, A, 0>
<A, A | +> <A, A | *>
<A, F | ?> <A, F, 0>
<A, F | +> <A, F | *>
<F, F | ?> <F, F, 0>
<F, F | +> <F, F | *>
1-Mediator : Ask the user for Attribute 1
1-User : don't know
2-M : Send query to the source
2-Source : 25 answers ...
4- Mediator : Send query to the source
4- Source : 3 answers ...
3- Mediator : Ask the user for Attribute 2
3- User : correct value
0 : State initialization