towards a theoretical towards a theoretical framework for lcs framework for lcs

Towards a Theoretical Towards a Theoretical

Framework for LCSFramework for LCS

Jan Jan DrugowitschDrugowitsch

Dr Alwyn BarryDr Alwyn Barry

May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006

OverviewOverviewOverview

Big Question: Why use LCS?Big Question: Why use LCS?

1.1. Aims and ApproachAims and Approach

2.2. Function ApproximationFunction Approximation

3.3. Dynamic ProgrammingDynamic Programming

4.4. Classifier ReplacementClassifier Replacement

5.5. Future WorkFuture Work


Research AimsResearch AimsResearch Aims

� To establish a formal model for

Accuracy-based LCS

� To use established models, methods

and notation

� To transfer method knowledge (analysis

/ improvement)

� To create links to other ML approaches

and disciplines


ApproachApproachApproach

�� Reduce LCS to components:Reduce LCS to components:– Function Approximation

– Dynamic Programming

– Classifier Replacement

�� Model the componentsModel the components

�� Model their interactionsModel their interactions

�� Verify models by empirical studiesVerify models by empirical studies

�� Extend models from related domainsExtend models from related domains


Function ApproximationFunction ApproximationFunction Approximation

� Architecture Outline:

– Set of K classifiers, matching states Sk⊆ S

– Every classifier k approximates the value function V over Sk. independently.

� Assumptions

– Function V : S → ℜ is time-invariant

– Time-invariant set of classifiers { S1, …, SK}

– Fixed sampling dist. ∏(S) given by π : S → [0, 1]

– Minimising MSE: ∑∈

−kSi

k iViVi 2~

))()()((π



� Linear approx. architecture

– Feature vector φ : S → ℜL

– Approximation parameters vector ωk ∈ ℜL

– Approximation

� Averaging classifiers, φ (i) = (1), ∀i∈S

� Linear estimators, φ (i) = (1, i)′, ∀i∈S

kT

k iV ωφ )(~

=



�� Approximation to MSE for Approximation to MSE for kk at time at time tt ::

–– Matching:Matching:

–– Match count:Match count:

�� Since the sequence of states Since the sequence of states iimm is dependant is dependant

on on π, the MSE is automatically weighted by this distribution.

( ) ( ) ( )( )2,

0,, 1

1mtkm

t

mmS

tktk iiViI

c kφωε ′−

−= ∑

=

( )m

t

mStk iIc

k∑=

=0

,

{ }1,0: →SIkS



�� Mixing classifiersMixing classifiers–– A classifier A classifier kk covers covers SSkk ⊆⊆ SS–– Need to Need to mixmix estimates to recoverestimates to recover

–– Requires mixing parameter (will be Requires mixing parameter (will be ‘‘accuracyaccuracy’’ ... see later!)... see later!)

–– ψψkk((ii)) = 0= 0 where where ii ∉∉ SSkk …… classifiers that do classifiers that do not match do not contributenot match do not contribute

)()()(1

~

iiiV k

K

kk φωψ ′=∑

=

{ } ( ) 1,1,0:1

=→ ∑=

K

kkk iS ψψwhere

( )iV~



�� Matrix notationMatrix notation

=

=

′

′=Φ

′=

)(,,0

0,),1(

)(,,0

0,),1(

)(

)1(

))(,),1((

NI

I

I

N

D

N

NVVV

k

k

k

S

S

S

K

O

K

K

O

K

LL

L

LL

L

π

πφ

φ

∑∑==

ΦΨ=Ψ=

=Ψ

Φ=

=

K

kkkk

K

kk

k

k

k

kk

Sk

VV

N

V

DIDk

11

~~

)(,,0

0,),1(

~

ω

ψ

ψω

K

O

K



�� AchievementsAchievements

–– Proof that the MSE of a classifier Proof that the MSE of a classifier kk at time at time t t is the is the

normalised sample variance of the value normalised sample variance of the value

functions over the states matched until time functions over the states matched until time t t

–– Proof that the expression developed for sampleProof that the expression developed for sample--

based approximated MSE is a suitable based approximated MSE is a suitable

approximation for the actual MSE, for finite state approximation for the actual MSE, for finite state

spacesspaces

–– Proof of optimality of approximated MSE from the Proof of optimality of approximated MSE from the

vectorvector--space viewpoint (relates classifier error to space viewpoint (relates classifier error to

the Principle of the Principle of OrthogonalityOrthogonality))



�� Achievements (contAchievements (cont’’d)d)–– Elaboration of algorithms using the model:Elaboration of algorithms using the model:

�� Steepest Gradient DescentSteepest Gradient Descent

�� Least Mean SquareLeast Mean Square–– Show sensitivity to step size and input range hold in LCSShow sensitivity to step size and input range hold in LCS

�� Normalised LMSNormalised LMS

�� KalmanKalman FilterFilter–– …… and its inverse coand its inverse co--variance formvariance form

–– Equivalence to Recursive Least SquaresEquivalence to Recursive Least Squares

–– Derivation of computationally cheap implementationDerivation of computationally cheap implementation

–– For each, derivation of:For each, derivation of:�� Stability criteriaStability criteria

–– For LMS, identification of the excess mean squared For LMS, identification of the excess mean squared estimation errorestimation error

�� Time constant boundsTime constant bounds



�� Achievements (contAchievements (cont’’d)d)

–– Investigation:Investigation:



�� How to set the mixing weights How to set the mixing weights ψψkk((ii))??–– AccuracyAccuracy--based mixing (YCSbased mixing (YCS--like!)like!)

–– Divorce mixing for fn. approx. from fitnessDivorce mixing for fn. approx. from fitness

–– Investigating power factor Investigating power factor νν ∈ℜ∈ℜ++

–– Demonstrate, using the model, that we Demonstrate, using the model, that we

obtain the MLE of the mixed classifiers obtain the MLE of the mixed classifiers

when when νν = = 11

∑=

−

−

= K

ppS

kSk

iI

iIi

p

k

1

)(

)()(

ν

ν

ε

εψ



�� Mixing Mixing -- Empirical ResultsEmpirical Results

–– Higher Higher νν, more importance to low error, more importance to low error

–– FoundFoundνν = 1 best compromise= 1 best compromise

Classifiers: 0..π/2, π/2.. π, 0.. π Classifiers: 0..0.5, 0.5.. 1, 0.. 1


Dynamic ProgrammingDynamic ProgrammingDynamic Programming

�� Dynamic Programming:Dynamic Programming:

�� BellmanBellman’’s Equation:s Equation:

�� Applying the DP operator Applying the DP operator T T to value est.to value est.

( ) ( )

=== ∑∞

=+

0t01

t* )(,|,,Emax ttttt iaiiiairiV µγµ

( ) ( )( )),(|)(,,Emax ** aipjjVjairiVAa

=+=∈

γ

( ) ( )( )),(|)(,,Emax)( aipjjVjairiTVAa

=+=∈

γ



�� TT is a contraction is a contraction w.r.tw.r.t. maximum norm. maximum norm

–– Converges to unique fixed point Converges to unique fixed point VV**= TV= TV**

�� With function approximation:With function approximation:

–– Approximation after every update:Approximation after every update:

–– Might not converge for more complex Might not converge for more complex

approximation architecturesapproximation architectures

�� Is LCS function approximation a nonIs LCS function approximation a non--

expansion expansion w.r.tw.r.t. maximum norm?. maximum norm?

( )tt VTfV~~

1 =+



�� LCS Value IterationLCS Value Iteration

–– Assume:Assume:

�� constant populationconstant population

� averaging classifiers, φ (i) = (1), ∀i∈S

– Classifier is approximating:

– Based on minimising MSE:

( ) 2

1,

2

1,1

2

1,1

~~~)(

~)(

kSkSk

ItktItktSi

tkt VVTVViViV +++∈

++ −=−=−∑

tt VTV~

1 =+



�� LCS Value Iteration (contLCS Value Iteration (cont’’d)d)

–– Minimum is Minimum is orthogorthog. projection on approx:. projection on approx:

–– Need also to determine new mixing weight Need also to determine new mixing weight

by evaluating the new approx. error:by evaluating the new approx. error:

–– ∴∴the only time dependency is onthe only time dependency is on

–– Thus:Thus:

tV~

tItItk VTVVkSkS

~~11, Π=Π= ++

( ) 22

1,11,

~)Tr(

1~)Tr(

1

kSk

kkS

k

ItDS

ItktS

tk VTII

VVI

Π−=−= +++ε

∑=

++ ΠΨ=K

ktItkt VTV

kS1

1,1

~~



�� LCS Asynchronous Value IterationLCS Asynchronous Value Iteration

–– Only update Only update VVkk for the matched statefor the matched state

–– Minimise:Minimise:

–– Thus, approx. costs weighted by state dist Thus, approx. costs weighted by state dist

( )∑=

′−t

mmtkmmmS iiVTiI

k0

2

, )())(~

()( φω

( )

∑

∑

=++

+

∈

ΠΨ=

Π=∴

Φ−=′−

K

ktDtkt

tDtk

t

SiDkk

VTV

VTV

VTiiVTi

k

k

kk

11,1

1,

22

~~

~~

~)())(

~()( ωφωπ



�� Also derived:Also derived:–– LCS QLCS Q--LearningLearning

�� LMS ImplementationLMS Implementation

�� KalmanKalman Filter ImplementationFilter Implementation

�� Demonstrate that the weight update uses the Demonstrate that the weight update uses the normalised form of local gradient descent normalised form of local gradient descent

–– ModelModel--based Policy Iterationbased Policy Iteration

–– StepStep--wise Policy Iterationwise Policy Iteration

–– TD(TD(λλ) is not possible in LCS due to re) is not possible in LCS due to re--evaluation of mixing weightsevaluation of mixing weights

∑=

+ ΠΨ=K

ktDkt VTV

k1

1

~~µ



�� Is LCS function approximation a nonIs LCS function approximation a non--expansion expansion w.r.tw.r.t. . maximum norm?maximum norm?– We derive that:

� Constant Mixing: Yes� Arbitrary Mixing: Not necessarily� Accuracy-based Mixing: non-expansion (dependant upon a

conjecture)

�� Is LCS function approximation a nonIs LCS function approximation a non--expansion expansion w.r.tw.r.t. . weighted norm?weighted norm?– We know Tµ is a contraction– We show ΠDk is a non-expansion– We demonstrate:

� Single Classifier: convergence to a fixed pt� Disjoint Classifiers: convergence to a fixed pt� Arbitrary Mixing: spectral radius can be ≥ 1 even in some

fixed weightings


Classifier ReplacementClassifier ReplacementClassifier Replacement

� How can we define the optimal population formally?

– What do we want to reach?

– How does the global fitness of one classifier differ from its local fitness (i.e. accuracy)?

– What is the quality of a population?

� Methodology

– Examine methods to define the optimal population

– Relate to generalised hill-climbing

– Map to other search criteria


Future WorkFuture WorkFuture Work

� Formalising and analysing classifier replacement

� Interaction of replacement with function approximation

� Interaction of replacement with DP� Handling stochastic environments� Error bounds and rates of convergence� …

Big Answer: LCS is an integrative framework,

…… But we need the framework definition!But we need the framework definition!

towards a theoretical towards a theoretical framework for lcs framework for lcs

Technology