incorporating advice into agents that learn from reinforcement
DESCRIPTION
Incorporating Advice into Agents that Learn from Reinforcement. Presented by Alp Sardağ. Problem of RL. Reinforcement Learning usually requires a large number of trainning episodes. HOW OVERCOME? Two approachs: Implicit representation of the utility function - PowerPoint PPT PresentationTRANSCRIPT
Incorporating Advice into Agents that Learn from
Reinforcement
Presented by
Alp Sardağ
Problem of RL• Reinforcement Learning usually requires a
large number of trainning episodes.
HOW OVERCOME?• Two approachs:
• Implicit representation of the utility function• Allowing Q-learner to accept given advice at
any time and, in a natural manner.
Input Generalization• To learn how to play game (chess 10120
states), impossible to visit all these states• Implicit representation of the function: A form
that allows to calculate the output for any input, much more compact than tabular form.
Example:
U(i) = w1f1(i)+w2f2(i)+...+wnfn(i)
Connectionist Q-learning• As the function to be learned is characterized by a
vector of weights w, neural networks are obvious candidates for learning weights.
The new update rule:
w w + (r+Uw(j)-Uw(i))wUw(i)
Note: TD-gammon learned better than Neurogammon
Example of Advice Taking
Advice: Don’t go into box canyons when opponents are in sight.
General Structure of RL learner
A connectionist Q-learning augmented with advice-taking.
Connectionist Q-learning• Q(a,i) : utility function maps state and actions to
numeric values.
• Given a perfect version of this function the optimal plan is to simply choose, in each state that is reached, the action with the maximum utility.
• The utility function is implemented as a neural network, whose inputs describe the current state and whose outputs are the utility of each action.
Step 1 in Taking Advice• Request the advice: Instead of having the
learner request the advice, the external observer provides advice whenever the observer feels it is appropriate. There are two reasons for this:• It places less burden on the observer• It is an open question how to create the best
mechanism for having a RL agent recognize its needs for advice.
Step 2 in Taking Advice• Convert the advice to an internal representation: Due to the
complexities of natural language processing, the external observer express its advice using a simple programming language and a list of task-specific terms.
• Example:
Step 3 in Advice Taking• Convert the advice into usable form: Using
techniques from knowledge compilation, a learner can convert high level advice into a collection of directly interpretable statements.
Step 4 in Advice Taking• Use ideas from knowledge-based neural networks:
Install the operationalized advice into the connectionist representation of the utility function.• Converts a ruleset into a network by mapping the
“target concepts” of the ruleset to output units and creating hidden units that represent the intermediate conclusion.
• Rules are intalled incrementally installed into networks.
Example
Example Cont.
Advice added, note that the inputs and outputs to the network remain unchanged; the advice only changes how the function from states to the utility of actions is calculated.
Example Cont.A multistep plan:
Example Cont.A multistep plan embedded in a REPEAT:
Example Cont
A dvice that involves previously defined terms:
Judge the Value of Advice• Once the advice is inserted, the RL agent returns to
exploring its environment, thereby integrating and refining the advice.
• In some circumstances, such as game learner that can play against itself, it would be straightforward to empirically evaluate the advice.
• It would also be possible to allow the observer to retract or counteract bad advice.
Test Bed
Test environment: (a) sample configuration (b) sample division Of the environment into sectors (c) distance measured by the agent sensors(d) A neural network that computes utility of actions.
Methodology• The agents are trained for a fixed number of episodes for
each experiment. • An episode consists of placing the agent into a randomly
generated, initial environment, and then allowing it to explore until it is captured or a treshold of 500 step is reached.
• Environment contains 7x7 grid 15 obstacles, 3 enemy agents, and 10 rewards.• 3 random generated environment.• 10 randomly initialized network.
• Average total reinforcement is measured by freezing the network and measuring the average reinforcement on a testset.
Advices
Testset Results
Testset Result
Above table shows how well each piece of advicemeets its intent.
Related Work Gordon and Subramanian (1994)
developed a system similiar to that one. The agent accept high-level advice of the form IF condition THEN ACHIEVE goal. It operationalizes these rules using its background knowledge about goal achievement. The resulting rules are then refined using genetic algorithms.