concept of a learning robot based on vsla

Concept of a learning robot based on VSLA

by M. Bindhammer

1. Learning robot and its environment

In this first chapter we briefly discuss the concept behind the learning robot and its

environment based on the variable structure stochastic learning automaton (VSLA).

The robot can choose from a finite number of actions (e.g. drive forwards, drive

backwards, turn right, turn left). Initially at a time 1t n= = one of the possible

actions α is chosen by the robot at random with a given probability p . This action

is now applied to the random environment in which the robot "lives" and the response

β from the environment is observed by the sensor(s) of the robot.

The feedback β from the environment is binary, i.e. it is either favorable or

unfavorable for the given task the robot should learn. We define 0β = as a reward

(favorable) and 1β = as a penalty (unfavorable). If the response from the

environment is favorable ( 0β = ), then the probability ip of choosing that action iα

for the next period of time 1t n= + is updated according to the updating rule Τ .

After that, another action is chosen and the response of the environment observed.

When a certain stopping criterion is reached, the algorithm stops and the robot has

learnt some characteristics of the random environment.

Definition abstract:

� { }1 2, ,..., rα α α α= is the finite set of r actions/outputs of the robot. The

output (action) is applied to the environment at time t n= , denoted by ( )nα

� { }1 2,β β β= is the binary set of inputs/responses from the environment. The

input (response) is applied to the robot at time t n= , denoted by ( )nβ . In our

case, the values for β are chosen to be 0 and 1. 0β = represents a reward

Rev. 2 2

and 1β = a penalty

� { }1 2, ,..., rp p p p= is the finite set of probabilities a certain action ( )nα is

chosen at a time t n= , denoted by ( )p n

� Τ is the updating function (rule) according to which the elements of the set p

are updated at each time t n= . Therefore ( ) ( ) ( ) ( )( )1 , ,p n n n p nα β+ = Τ ,

where the i th element of the set ( )p n is ( ) ( )( )Probi ip n nα α= = with

1, 2,...,i r= , ( ) ( ) ( ) ( )1 2

1

: ... 1r

i r

i

n p n p n p n p n=

∀ = + + + =∑ and

( ) 1 : 1ii p n

r∀ = = .

� { }1 2, ,... rc c c c= is the finite set of penalty probabilities that the action iα will

result in a penalty input from the random environment. If the penalty

probabilities are constant, the environment is called a stationary random

environment.

The updating functions (reinforcement schemes) are categorized based on their

linearity. The general linear scheme is given by:

If ( ) inα α= ,

0β = : ( )( ) ( )( )( ) ( )

1 1

1

j j

j

j

p n a p n j ip n

p n a j i

+ ⋅ − =+ =

⋅ − ∀ ≠

1β = : ( )( ) ( )

( ) ( )

1

11

1

j

j

j

p n b j i

p n bp n b j i

r

⋅ − =

+ = + ⋅ − ∀ ≠ −

where a and b are the learning parameter with 0 , 1a b> < .

If a b= , the scheme is called the linear reward-penalty scheme. If for 1β = jp

remains unchanged ( j∀ 0), it is called the linear reward scheme.

Rev. 2 3

2. An Example

We make two assumptions before we start with the example. For simplicity we

consider the random environment as a stationary random environment and we are

using the linear reward-penalty scheme.

Let's say, the robot roams through a room and shall learn how to avoid obstacles, then

a stationary random environment simply means, that the probabilities are everywhere

in the room the same, that the robot will hit an obstacle. We will discuss later in detail

how to gain such penalty probabilities as a function of the position in the room.

Let us assume, the robot can choose from the set { }1 2 3 4, , ,α α α α α= of actions. We

could define these actions for instance as follows: 1α : drive forwards, 2α : drive

backwards, 3α : turn right and 4α : turn left.

1,2,...,i r=

4r =

( ) 1 11

4ip

r= =

Let 1

2a b= =

Let's assume now, the initial action 1α (which has been selected randomly) has led

to an input 0β = (reward) at the time t n= . The new probabilities are then

calculated as follows:

0β = : ( )( ) ( )( )( ) ( )

1 1

1

j j

j

j

p n a p n j ip n

p n a j i

+ ⋅ − =+ =

⋅ − ∀ ≠

( ) ( ) ( )( )1 1j j jp n p n a p n+ = + ⋅ − for 1α

( ) 1 1 1 51 1

4 2 4 8jp n

+ = + ⋅ − =

( ) ( ) ( )1 1 j jp n p n a+ = ⋅ − for 2 3 4, ,α α α

( ) 1 1 11 1 =

4 2 8jp n

+ = ⋅ −

As it is requested that ( )1

: 1r

i

i

n p n=

∀ =∑

Rev. 2 4

( )1

5 1 1 11 1

8 8 8 8

r

j

j

p n=

+ = + + + =∑

I.e. after the input from the environment ( ) 0nβ = , the probability that action 1α

will be chosen as action ( )1nα + has been increased to 5

8 while the probabilities

that one of the actions 2α , 3α or 4α will be chosen has been decreased to 1

8.

The same we compute now if the initial action 1α has led to an input 1β =

(penalty) at the time t n= .

1β = : ( )( ) ( )

( ) ( )

1

11

1

j

j

j

p n b j i

p n bp n b j i

r

⋅ − =

+ = + ⋅ − ∀ ≠ −

( ) ( ) ( )1 1 j jp n p n b+ = ⋅ − for 1α

( ) 1 1 11 1 =

4 2 8jp n

+ = ⋅ −

( ) ( ) ( )1 11

j j

bp n p n b

r+ = + ⋅ −

− for 2 3 4, ,α α α

( )1

1 1 721 14 1 4 2 24

jp n + = + ⋅ − = −

( )1

1 7 7 71 1

8 24 24 24

r

j

j

p n=

+ = + + + =∑

I.e. after the input from the environment ( ) 1nβ = , the probability that action 1α

will be chosen as action ( )1nα + has been decreased to 1

8 while the probabilities

that one of the actions 2α , 3α or 4α will be chosen has been increased to 7

24.

From this example it can be also seen immediately that the limits of a probability ip

for n→∞ are either 0 or 1. Therefore the robot learns to choose the optimal action

asymptotically. It shall be noted, that it converges not always to the correct action; but

the probability that it converges to the wrong one can be made arbitrarily small by

making the learning parameter a small.

Rev. 2 5

3. Algorithm of choice

Before we finally can start writing a basic program example, we need to find an

'algorithm of choice' that selects an action iα , tagged with the according probability

ip .

To begin with, we consider the random number generator random (min,max),

where min is the lower bound of the random value and max the upper bound. It is

sufficient for our approach, to use a pseudo-random generator, as all microcontrollers

must perform mathematics to generate random numbers, and so the sequence can

never be truly random, but it is important for a sequence of values generated by

random(min,max) to differ on subsequent executions. This can be archived by

initializing the random number generator with a fairly random input, such as an

unconnected ADC pin.

The easiest case is, if the probabilities are equal (which is for instance initially the

case). The pseudo code looks as follows:

if(p_1==p_2==...==p_r) then rand_number=random (1,r) //if all

probabilities are equal, generate a random number between 1 and r

if(rand_number==1) then do alpha_1 //if random number=1 then perform

action α1

if(rand_number==2) then do alpha_2 //if random number=2 then perform

action α2

...

if(rand_number==r) then do alpha_r //if random number=r then perform

action αr

To cover all cases, a simple approach is to find first the probability(ies) with the

maximum value at a time n and use a modified random number generator, to chose

the action according to the probability value of ( )maxp n and the remaining

probabilities. If the action with the maximum probability value has been selected, the

algorithm stops. If not, the action with the next smaller probability value is

determined and a modified random number generator is used again to chose. The

algorithm can be imagined as a repeating coin toss. On one side of the coin is the

action tagged with the probability ( )maxp n , one the other side are all the other

actions with the remaining probabilities.

We define the modified random number generator as random (1,y), where

0y >∈N . We furthermore define a variable 0t >∈N .

Example:

Rev. 2 6

int t=3 //assign the value 3 to the integer variable t

int y=5 //assign the value 5 to the integer variable y

rand_number=random (1,y) //generate a random number between 1 and 5

if(rand_number<=t) then do alpha_p_max //if random number has a

value of 1 to 3 then perform action tagged with maxp

else

//start algorithm of choice again

The probability in the example that the action tagged with maxp will be chosen is 3

5,

while the probability that actions tagged with the remaining probabilities will be

chosen is 2

5. With this method we can create any rational number probability. In this

regard, we should not use an irrational number like 2 for the learning parameter

a , because it is then most likely that the probabilities getting irrational too and we

would be not able to reproduce it with the introduced simple random number

generator, which most microcontroller IDE's provide. We define therefore,

{ }0,1ip ∈ , otherwise ip+∈Q and 0 1 a a +> < ∧ ∈Q .

( )maxp n can be computed now by:

( )max

tp n

y=

As we defined the learning parameter a as 0 1 a a +> < ∧ ∈Q , we can substitute

a and ( )jp n by

ua

v= with 0,u v >∈N ,

( )j

wp n

x= with 0,w x >∈N .

We re-write the right terms of following equations as common fractions, where we

only have integer denominators and numerators:

0β = : ( )( ) ( )( )( ) ( )

1 1

1

j j

j

j

p n a p n j ip n

p n a j i

+ ⋅ − =+ =

⋅ − ∀ ≠

Rev. 2 7

1β = : ( )( ) ( )

( ) ( )

1

11

1

j

j

j

p n a j i

p n ap n a j i

r

⋅ − =

+ = + ⋅ − ∀ ≠ −

0β = : ( )

( )

( )

1

1

2

2

1

j

v w u x w tj i

v x yp n

w v u tj i

v x y

⋅ + ⋅ −= = ⋅

+ = ⋅ − = ∀ ≠ ⋅

1β = : ( )

( )

( ) ( )( )

3

3

4

4

11

1

j

w v u tj i

v x yp n

x u w r v u tj i

x v r y

⋅ −= = ⋅

+ = ⋅ + ⋅ − ⋅ − = ∀ ≠

⋅ ⋅ −

At the end of this chapter, we discuss the need of stopping criterions. As mentioned,

the robot learns asymptotically, i.e. a certain probability converges only to 0 or 1 for

n→∞ . The rate of convergence can be adjusted by the learning parameter a . The

learning parameter is no universal constant. To chose the right value for the learning

parameter, some experiments need to be done and it depends on the task the robot

should learn. If the value of the learning parameter is too small, it takes a long time

till the robot learn a task; if the value of the learning parameter is too large, the robot

might interprets data from the environment wrong.

As all microcontrollers, which supporting floating point math, have a limited number

of digits after the decimal point, the convergence criterion is not n→∞ , it just

depends on the number of digits after the decimal point. If for instance 5 digits after

the decimal point can be calculated, the microcontroller will interpret 0.000001 as 0

and 0.999999 as 1. That isn't a problem for our project and can be used instead of

implementing stopping criterions.

Stopping criterions making only sense, if the environment not changes. Imagine a

robot arm that should learn to lift a cup from a fixed position. After the robot has

learned to lift the cup, the learning progress can be stopped; just repeat what it has

learned. But if the position of the cup changes unpredictable, the learning progress

never stops.

concept of a learning robot based on vsla

Documents