continous-action q-learning
DESCRIPTION
Continous-Action Q-Learning. Jose Del R.Millan et al, Machine Learning 49, 247-265 (2002). Summarized by Seung-Joon Yi. ITPM(Incremental Topology Preserving Map). Consists of units and edges between pairs of units. Maps current sensory situation x onto action a. - PowerPoint PPT PresentationTRANSCRIPT
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
1
Continous-Action Q-LearningContinous-Action Q-Learning
Summarized by Seung-Joon Yi
Jose Del R.Millan et al, Machine Learning 49, 247-265 (2002)
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
ITPM(Incremental Topology ITPM(Incremental Topology Preserving Map)Preserving Map)
Consists of units and edges between pairs of units. Maps current sensory situation x onto action a. Units are created incrementally and incorporates bias After being created, the units’ sensory component is
tuned by self-organizing rules Their action component is updated through
reinforcement learning.
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
ITPMITPM
Units and bias Initially the ITPM has no units and they are created as
the robot uses built-in reflexes.
Units in the network have overlapping localized receptive fields.
When the neural controller makes incorrect generalizations, reflexes get control of the robot and it adds a new unit to the ITPM.
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
ITPMITPM
Self-organizing rules
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
ITPMITPM
Advantages Automatically allocates units in the visited parts of the
input space. Adjusts dynamically the necessary resolution in
different regions. Experiments show that in everage every unit is
connected to 5 others at the end of learning episodes.
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
ITPMITPM
General learning algorithm
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
Discrete-action Q-LearningDiscrete-action Q-Learning
Action selection rule Ε-greedy policy
Q-value update rule
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
Continous-action Q-LearningContinous-action Q-Learning
Action selection rule An average of the discrete actions of the nearest unit
weighted by their Q-values
Q-value of the selected continous action a is:
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
Continous-action Q-LearningContinous-action Q-Learning
Q-value update rule
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
Average-Reward RLAverage-Reward RL
Q-value update rule
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
ExperimentsExperiments
Wall following task
Reward
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
ExperimentsExperiments
Performance comparison between discrete and continous discountd-rewarded RL
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
13
ExperimentsExperiments
Performance comparison between discrete and continous average-rewarded RL
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
ExperimentsExperiments
Performance comparison between discounted and average-rewarded RL,discrete-action case
(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
ConclusionConclusion
Presented a simple Q-learning that works in continous domains.
ITPM represents continous input space Compared discounted-rewarded RL against
average-awarded RL