concept of a learning robot based on vsla
DESCRIPTION
ROBOT ARDUINOTRANSCRIPT
![Page 1: Concept of a Learning Robot Based on VSLA](https://reader031.vdocument.in/reader031/viewer/2022020117/563db7dc550346aa9a8ea552/html5/thumbnails/1.jpg)
Concept of a learning robot based on VSLA
by M. Bindhammer
1. Learning robot and its environment
In this first chapter we briefly discuss the concept behind the learning robot and its
environment based on the variable structure stochastic learning automaton (VSLA).
The robot can choose from a finite number of actions (e.g. drive forwards, drive
backwards, turn right, turn left). Initially at a time 1t n= = one of the possible
actions α is chosen by the robot at random with a given probability p . This action
is now applied to the random environment in which the robot "lives" and the response
β from the environment is observed by the sensor(s) of the robot.
The feedback β from the environment is binary, i.e. it is either favorable or
unfavorable for the given task the robot should learn. We define 0β = as a reward
(favorable) and 1β = as a penalty (unfavorable). If the response from the
environment is favorable ( 0β = ), then the probability ip of choosing that action iα
for the next period of time 1t n= + is updated according to the updating rule Τ .
After that, another action is chosen and the response of the environment observed.
When a certain stopping criterion is reached, the algorithm stops and the robot has
learnt some characteristics of the random environment.
Definition abstract:
� { }1 2, ,..., rα α α α= is the finite set of r actions/outputs of the robot. The
output (action) is applied to the environment at time t n= , denoted by ( )nα
� { }1 2,β β β= is the binary set of inputs/responses from the environment. The
input (response) is applied to the robot at time t n= , denoted by ( )nβ . In our
case, the values for β are chosen to be 0 and 1. 0β = represents a reward
![Page 2: Concept of a Learning Robot Based on VSLA](https://reader031.vdocument.in/reader031/viewer/2022020117/563db7dc550346aa9a8ea552/html5/thumbnails/2.jpg)
Rev. 2 2
and 1β = a penalty
� { }1 2, ,..., rp p p p= is the finite set of probabilities a certain action ( )nα is
chosen at a time t n= , denoted by ( )p n
� Τ is the updating function (rule) according to which the elements of the set p
are updated at each time t n= . Therefore ( ) ( ) ( ) ( )( )1 , ,p n n n p nα β+ = Τ ,
where the i th element of the set ( )p n is ( ) ( )( )Probi ip n nα α= = with
1, 2,...,i r= , ( ) ( ) ( ) ( )1 2
1
: ... 1r
i r
i
n p n p n p n p n=
∀ = + + + =∑ and
( ) 1 : 1ii p n
r∀ = = .
� { }1 2, ,... rc c c c= is the finite set of penalty probabilities that the action iα will
result in a penalty input from the random environment. If the penalty
probabilities are constant, the environment is called a stationary random
environment.
The updating functions (reinforcement schemes) are categorized based on their
linearity. The general linear scheme is given by:
If ( ) inα α= ,
0β = : ( )( ) ( )( )( ) ( )
1 1
1
j j
j
j
p n a p n j ip n
p n a j i
+ ⋅ − =+ =
⋅ − ∀ ≠
1β = : ( )( ) ( )
( ) ( )
1
11
1
j
j
j
p n b j i
p n bp n b j i
r
⋅ − =
+ = + ⋅ − ∀ ≠ −
where a and b are the learning parameter with 0 , 1a b> < .
If a b= , the scheme is called the linear reward-penalty scheme. If for 1β = jp
remains unchanged ( j∀ 0), it is called the linear reward scheme.
![Page 3: Concept of a Learning Robot Based on VSLA](https://reader031.vdocument.in/reader031/viewer/2022020117/563db7dc550346aa9a8ea552/html5/thumbnails/3.jpg)
Rev. 2 3
2. An Example
We make two assumptions before we start with the example. For simplicity we
consider the random environment as a stationary random environment and we are
using the linear reward-penalty scheme.
Let's say, the robot roams through a room and shall learn how to avoid obstacles, then
a stationary random environment simply means, that the probabilities are everywhere
in the room the same, that the robot will hit an obstacle. We will discuss later in detail
how to gain such penalty probabilities as a function of the position in the room.
Let us assume, the robot can choose from the set { }1 2 3 4, , ,α α α α α= of actions. We
could define these actions for instance as follows: 1α : drive forwards, 2α : drive
backwards, 3α : turn right and 4α : turn left.
1,2,...,i r=
4r =
( ) 1 11
4ip
r= =
Let 1
2a b= =
Let's assume now, the initial action 1α (which has been selected randomly) has led
to an input 0β = (reward) at the time t n= . The new probabilities are then
calculated as follows:
0β = : ( )( ) ( )( )( ) ( )
1 1
1
j j
j
j
p n a p n j ip n
p n a j i
+ ⋅ − =+ =
⋅ − ∀ ≠
( ) ( ) ( )( )1 1j j jp n p n a p n+ = + ⋅ − for 1α
( ) 1 1 1 51 1
4 2 4 8jp n
+ = + ⋅ − =
( ) ( ) ( )1 1 j jp n p n a+ = ⋅ − for 2 3 4, ,α α α
( ) 1 1 11 1 =
4 2 8jp n
+ = ⋅ −
As it is requested that ( )1
: 1r
i
i
n p n=
∀ =∑
![Page 4: Concept of a Learning Robot Based on VSLA](https://reader031.vdocument.in/reader031/viewer/2022020117/563db7dc550346aa9a8ea552/html5/thumbnails/4.jpg)
Rev. 2 4
( )1
5 1 1 11 1
8 8 8 8
r
j
j
p n=
+ = + + + =∑
I.e. after the input from the environment ( ) 0nβ = , the probability that action 1α
will be chosen as action ( )1nα + has been increased to 5
8 while the probabilities
that one of the actions 2α , 3α or 4α will be chosen has been decreased to 1
8.
The same we compute now if the initial action 1α has led to an input 1β =
(penalty) at the time t n= .
1β = : ( )( ) ( )
( ) ( )
1
11
1
j
j
j
p n b j i
p n bp n b j i
r
⋅ − =
+ = + ⋅ − ∀ ≠ −
( ) ( ) ( )1 1 j jp n p n b+ = ⋅ − for 1α
( ) 1 1 11 1 =
4 2 8jp n
+ = ⋅ −
( ) ( ) ( )1 11
j j
bp n p n b
r+ = + ⋅ −
− for 2 3 4, ,α α α
( )1
1 1 721 14 1 4 2 24
jp n + = + ⋅ − = −
( )1
1 7 7 71 1
8 24 24 24
r
j
j
p n=
+ = + + + =∑
I.e. after the input from the environment ( ) 1nβ = , the probability that action 1α
will be chosen as action ( )1nα + has been decreased to 1
8 while the probabilities
that one of the actions 2α , 3α or 4α will be chosen has been increased to 7
24.
From this example it can be also seen immediately that the limits of a probability ip
for n→∞ are either 0 or 1. Therefore the robot learns to choose the optimal action
asymptotically. It shall be noted, that it converges not always to the correct action; but
the probability that it converges to the wrong one can be made arbitrarily small by
making the learning parameter a small.
![Page 5: Concept of a Learning Robot Based on VSLA](https://reader031.vdocument.in/reader031/viewer/2022020117/563db7dc550346aa9a8ea552/html5/thumbnails/5.jpg)
Rev. 2 5
3. Algorithm of choice
Before we finally can start writing a basic program example, we need to find an
'algorithm of choice' that selects an action iα , tagged with the according probability
ip .
To begin with, we consider the random number generator random (min,max),
where min is the lower bound of the random value and max the upper bound. It is
sufficient for our approach, to use a pseudo-random generator, as all microcontrollers
must perform mathematics to generate random numbers, and so the sequence can
never be truly random, but it is important for a sequence of values generated by
random(min,max) to differ on subsequent executions. This can be archived by
initializing the random number generator with a fairly random input, such as an
unconnected ADC pin.
The easiest case is, if the probabilities are equal (which is for instance initially the
case). The pseudo code looks as follows:
if(p_1==p_2==...==p_r) then rand_number=random (1,r) //if all
probabilities are equal, generate a random number between 1 and r
if(rand_number==1) then do alpha_1 //if random number=1 then perform
action α1
if(rand_number==2) then do alpha_2 //if random number=2 then perform
action α2
...
if(rand_number==r) then do alpha_r //if random number=r then perform
action αr
To cover all cases, a simple approach is to find first the probability(ies) with the
maximum value at a time n and use a modified random number generator, to chose
the action according to the probability value of ( )maxp n and the remaining
probabilities. If the action with the maximum probability value has been selected, the
algorithm stops. If not, the action with the next smaller probability value is
determined and a modified random number generator is used again to chose. The
algorithm can be imagined as a repeating coin toss. On one side of the coin is the
action tagged with the probability ( )maxp n , one the other side are all the other
actions with the remaining probabilities.
We define the modified random number generator as random (1,y), where
0y >∈N . We furthermore define a variable 0t >∈N .
Example:
![Page 6: Concept of a Learning Robot Based on VSLA](https://reader031.vdocument.in/reader031/viewer/2022020117/563db7dc550346aa9a8ea552/html5/thumbnails/6.jpg)
Rev. 2 6
int t=3 //assign the value 3 to the integer variable t
int y=5 //assign the value 5 to the integer variable y
rand_number=random (1,y) //generate a random number between 1 and 5
if(rand_number<=t) then do alpha_p_max //if random number has a
value of 1 to 3 then perform action tagged with maxp
else
//start algorithm of choice again
The probability in the example that the action tagged with maxp will be chosen is 3
5,
while the probability that actions tagged with the remaining probabilities will be
chosen is 2
5. With this method we can create any rational number probability. In this
regard, we should not use an irrational number like 2 for the learning parameter
a , because it is then most likely that the probabilities getting irrational too and we
would be not able to reproduce it with the introduced simple random number
generator, which most microcontroller IDE's provide. We define therefore,
{ }0,1ip ∈ , otherwise ip+∈Q and 0 1 a a +> < ∧ ∈Q .
( )maxp n can be computed now by:
( )max
tp n
y=
As we defined the learning parameter a as 0 1 a a +> < ∧ ∈Q , we can substitute
a and ( )jp n by
ua
v= with 0,u v >∈N ,
( )j
wp n
x= with 0,w x >∈N .
We re-write the right terms of following equations as common fractions, where we
only have integer denominators and numerators:
0β = : ( )( ) ( )( )( ) ( )
1 1
1
j j
j
j
p n a p n j ip n
p n a j i
+ ⋅ − =+ =
⋅ − ∀ ≠
![Page 7: Concept of a Learning Robot Based on VSLA](https://reader031.vdocument.in/reader031/viewer/2022020117/563db7dc550346aa9a8ea552/html5/thumbnails/7.jpg)
Rev. 2 7
1β = : ( )( ) ( )
( ) ( )
1
11
1
j
j
j
p n a j i
p n ap n a j i
r
⋅ − =
+ = + ⋅ − ∀ ≠ −
0β = : ( )
( )
( )
1
1
2
2
1
j
v w u x w tj i
v x yp n
w v u tj i
v x y
⋅ + ⋅ −= = ⋅
+ = ⋅ − = ∀ ≠ ⋅
1β = : ( )
( )
( ) ( )( )
3
3
4
4
11
1
j
w v u tj i
v x yp n
x u w r v u tj i
x v r y
⋅ −= = ⋅
+ = ⋅ + ⋅ − ⋅ − = ∀ ≠
⋅ ⋅ −
At the end of this chapter, we discuss the need of stopping criterions. As mentioned,
the robot learns asymptotically, i.e. a certain probability converges only to 0 or 1 for
n→∞ . The rate of convergence can be adjusted by the learning parameter a . The
learning parameter is no universal constant. To chose the right value for the learning
parameter, some experiments need to be done and it depends on the task the robot
should learn. If the value of the learning parameter is too small, it takes a long time
till the robot learn a task; if the value of the learning parameter is too large, the robot
might interprets data from the environment wrong.
As all microcontrollers, which supporting floating point math, have a limited number
of digits after the decimal point, the convergence criterion is not n→∞ , it just
depends on the number of digits after the decimal point. If for instance 5 digits after
the decimal point can be calculated, the microcontroller will interpret 0.000001 as 0
and 0.999999 as 1. That isn't a problem for our project and can be used instead of
implementing stopping criterions.
Stopping criterions making only sense, if the environment not changes. Imagine a
robot arm that should learn to lift a cup from a fixed position. After the robot has
learned to lift the cup, the learning progress can be stopped; just repeat what it has
learned. But if the position of the cup changes unpredictable, the learning progress
never stops.