towards a theoretical towards a theoretical framework for lcs framework for lcs
DESCRIPTION
Alwyn Barry introduces the theoretical framework for LCS that Jan Drugowitsch is currently working on.TRANSCRIPT
Towards a Theoretical Towards a Theoretical
Framework for LCSFramework for LCS
Jan Jan DrugowitschDrugowitsch
Dr Alwyn BarryDr Alwyn Barry
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
OverviewOverviewOverview
Big Question: Why use LCS?Big Question: Why use LCS?
1.1. Aims and ApproachAims and Approach
2.2. Function ApproximationFunction Approximation
3.3. Dynamic ProgrammingDynamic Programming
4.4. Classifier ReplacementClassifier Replacement
5.5. Future WorkFuture Work
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Research AimsResearch AimsResearch Aims
� To establish a formal model for
Accuracy-based LCS
� To use established models, methods
and notation
� To transfer method knowledge (analysis
/ improvement)
� To create links to other ML approaches
and disciplines
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
ApproachApproachApproach
�� Reduce LCS to components:Reduce LCS to components:– Function Approximation
– Dynamic Programming
– Classifier Replacement
�� Model the componentsModel the components
�� Model their interactionsModel their interactions
�� Verify models by empirical studiesVerify models by empirical studies
�� Extend models from related domainsExtend models from related domains
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
� Architecture Outline:
– Set of K classifiers, matching states Sk⊆ S
– Every classifier k approximates the value function V over Sk. independently.
� Assumptions
– Function V : S → ℜ is time-invariant
– Time-invariant set of classifiers { S1, …, SK}
– Fixed sampling dist. ∏(S) given by π : S → [0, 1]
– Minimising MSE: ∑∈
−kSi
k iViVi 2~
))()()((π
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
� Linear approx. architecture
– Feature vector φ : S → ℜL
– Approximation parameters vector ωk ∈ ℜL
– Approximation
� Averaging classifiers, φ (i) = (1), ∀i∈S
� Linear estimators, φ (i) = (1, i)′, ∀i∈S
kT
k iV ωφ )(~
=
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
�� Approximation to MSE for Approximation to MSE for kk at time at time tt ::
–– Matching:Matching:
–– Match count:Match count:
�� Since the sequence of states Since the sequence of states iimm is dependant is dependant
on on π, the MSE is automatically weighted by this distribution.
( ) ( ) ( )( )2,
0,, 1
1mtkm
t
mmS
tktk iiViI
c kφωε ′−
−= ∑
=
( )m
t
mStk iIc
k∑=
=0
,
{ }1,0: →SIkS
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
�� Mixing classifiersMixing classifiers–– A classifier A classifier kk covers covers SSkk ⊆⊆ SS–– Need to Need to mixmix estimates to recoverestimates to recover
–– Requires mixing parameter (will be Requires mixing parameter (will be ‘‘accuracyaccuracy’’ ... see later!)... see later!)
–– ψψkk((ii)) = 0= 0 where where ii ∉∉ SSkk …… classifiers that do classifiers that do not match do not contributenot match do not contribute
)()()(1
~
iiiV k
K
kk φωψ ′=∑
=
{ } ( ) 1,1,0:1
=→ ∑=
K
kkk iS ψψwhere
( )iV~
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
�� Matrix notationMatrix notation
=
=
′
′=Φ
′=
)(,,0
0,),1(
)(,,0
0,),1(
)(
)1(
))(,),1((
NI
I
I
N
D
N
NVVV
k
k
k
S
S
S
K
O
K
K
O
K
LL
L
LL
L
π
πφ
φ
∑∑==
ΦΨ=Ψ=
=Ψ
Φ=
=
K
kkkk
K
kk
k
k
k
kk
Sk
VV
N
V
DIDk
11
~~
)(,,0
0,),1(
~
ω
ψ
ψω
K
O
K
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
�� AchievementsAchievements
–– Proof that the MSE of a classifier Proof that the MSE of a classifier kk at time at time t t is the is the
normalised sample variance of the value normalised sample variance of the value
functions over the states matched until time functions over the states matched until time t t
–– Proof that the expression developed for sampleProof that the expression developed for sample--
based approximated MSE is a suitable based approximated MSE is a suitable
approximation for the actual MSE, for finite state approximation for the actual MSE, for finite state
spacesspaces
–– Proof of optimality of approximated MSE from the Proof of optimality of approximated MSE from the
vectorvector--space viewpoint (relates classifier error to space viewpoint (relates classifier error to
the Principle of the Principle of OrthogonalityOrthogonality))
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
�� Achievements (contAchievements (cont’’d)d)–– Elaboration of algorithms using the model:Elaboration of algorithms using the model:
�� Steepest Gradient DescentSteepest Gradient Descent
�� Least Mean SquareLeast Mean Square–– Show sensitivity to step size and input range hold in LCSShow sensitivity to step size and input range hold in LCS
�� Normalised LMSNormalised LMS
�� KalmanKalman FilterFilter–– …… and its inverse coand its inverse co--variance formvariance form
–– Equivalence to Recursive Least SquaresEquivalence to Recursive Least Squares
–– Derivation of computationally cheap implementationDerivation of computationally cheap implementation
–– For each, derivation of:For each, derivation of:�� Stability criteriaStability criteria
–– For LMS, identification of the excess mean squared For LMS, identification of the excess mean squared estimation errorestimation error
�� Time constant boundsTime constant bounds
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
�� Achievements (contAchievements (cont’’d)d)
–– Investigation:Investigation:
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
�� How to set the mixing weights How to set the mixing weights ψψkk((ii))??–– AccuracyAccuracy--based mixing (YCSbased mixing (YCS--like!)like!)
–– Divorce mixing for fn. approx. from fitnessDivorce mixing for fn. approx. from fitness
–– Investigating power factor Investigating power factor νν ∈ℜ∈ℜ++
–– Demonstrate, using the model, that we Demonstrate, using the model, that we
obtain the MLE of the mixed classifiers obtain the MLE of the mixed classifiers
when when νν = = 11
∑=
−
−
= K
ppS
kSk
iI
iIi
p
k
1
)(
)()(
ν
ν
ε
εψ
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Function ApproximationFunction ApproximationFunction Approximation
�� Mixing Mixing -- Empirical ResultsEmpirical Results
–– Higher Higher νν, more importance to low error, more importance to low error
–– FoundFoundνν = 1 best compromise= 1 best compromise
Classifiers: 0..π/2, π/2.. π, 0.. π Classifiers: 0..0.5, 0.5.. 1, 0.. 1
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Dynamic ProgrammingDynamic ProgrammingDynamic Programming
�� Dynamic Programming:Dynamic Programming:
�� BellmanBellman’’s Equation:s Equation:
�� Applying the DP operator Applying the DP operator T T to value est.to value est.
( ) ( )
=== ∑∞
=+
0t01
t* )(,|,,Emax ttttt iaiiiairiV µγµ
( ) ( )( )),(|)(,,Emax ** aipjjVjairiVAa
=+=∈
γ
( ) ( )( )),(|)(,,Emax)( aipjjVjairiTVAa
=+=∈
γ
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Dynamic ProgrammingDynamic ProgrammingDynamic Programming
�� TT is a contraction is a contraction w.r.tw.r.t. maximum norm. maximum norm
–– Converges to unique fixed point Converges to unique fixed point VV**= TV= TV**
�� With function approximation:With function approximation:
–– Approximation after every update:Approximation after every update:
–– Might not converge for more complex Might not converge for more complex
approximation architecturesapproximation architectures
�� Is LCS function approximation a nonIs LCS function approximation a non--
expansion expansion w.r.tw.r.t. maximum norm?. maximum norm?
( )tt VTfV~~
1 =+
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Dynamic ProgrammingDynamic ProgrammingDynamic Programming
�� LCS Value IterationLCS Value Iteration
–– Assume:Assume:
�� constant populationconstant population
� averaging classifiers, φ (i) = (1), ∀i∈S
– Classifier is approximating:
– Based on minimising MSE:
( ) 2
1,
2
1,1
2
1,1
~~~)(
~)(
kSkSk
ItktItktSi
tkt VVTVViViV +++∈
++ −=−=−∑
tt VTV~
1 =+
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Dynamic ProgrammingDynamic ProgrammingDynamic Programming
�� LCS Value Iteration (contLCS Value Iteration (cont’’d)d)
–– Minimum is Minimum is orthogorthog. projection on approx:. projection on approx:
–– Need also to determine new mixing weight Need also to determine new mixing weight
by evaluating the new approx. error:by evaluating the new approx. error:
–– ∴∴the only time dependency is onthe only time dependency is on
–– Thus:Thus:
tV~
tItItk VTVVkSkS
~~11, Π=Π= ++
( ) 22
1,11,
~)Tr(
1~)Tr(
1
kSk
kkS
k
ItDS
ItktS
tk VTII
VVI
Π−=−= +++ε
∑=
++ ΠΨ=K
ktItkt VTV
kS1
1,1
~~
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Dynamic ProgrammingDynamic ProgrammingDynamic Programming
�� LCS Asynchronous Value IterationLCS Asynchronous Value Iteration
–– Only update Only update VVkk for the matched statefor the matched state
–– Minimise:Minimise:
–– Thus, approx. costs weighted by state dist Thus, approx. costs weighted by state dist
( )∑=
′−t
mmtkmmmS iiVTiI
k0
2
, )())(~
()( φω
( )
∑
∑
=++
+
∈
ΠΨ=
Π=∴
Φ−=′−
K
ktDtkt
tDtk
t
SiDkk
VTV
VTV
VTiiVTi
k
k
kk
11,1
1,
22
~~
~~
~)())(
~()( ωφωπ
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Dynamic ProgrammingDynamic ProgrammingDynamic Programming
�� Also derived:Also derived:–– LCS QLCS Q--LearningLearning
�� LMS ImplementationLMS Implementation
�� KalmanKalman Filter ImplementationFilter Implementation
�� Demonstrate that the weight update uses the Demonstrate that the weight update uses the normalised form of local gradient descent normalised form of local gradient descent
–– ModelModel--based Policy Iterationbased Policy Iteration
–– StepStep--wise Policy Iterationwise Policy Iteration
–– TD(TD(λλ) is not possible in LCS due to re) is not possible in LCS due to re--evaluation of mixing weightsevaluation of mixing weights
∑=
+ ΠΨ=K
ktDkt VTV
k1
1
~~µ
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Dynamic ProgrammingDynamic ProgrammingDynamic Programming
�� Is LCS function approximation a nonIs LCS function approximation a non--expansion expansion w.r.tw.r.t. . maximum norm?maximum norm?– We derive that:
� Constant Mixing: Yes� Arbitrary Mixing: Not necessarily� Accuracy-based Mixing: non-expansion (dependant upon a
conjecture)
�� Is LCS function approximation a nonIs LCS function approximation a non--expansion expansion w.r.tw.r.t. . weighted norm?weighted norm?– We know Tµ is a contraction– We show ΠDk is a non-expansion– We demonstrate:
� Single Classifier: convergence to a fixed pt� Disjoint Classifiers: convergence to a fixed pt� Arbitrary Mixing: spectral radius can be ≥ 1 even in some
fixed weightings
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Classifier ReplacementClassifier ReplacementClassifier Replacement
� How can we define the optimal population formally?
– What do we want to reach?
– How does the global fitness of one classifier differ from its local fitness (i.e. accuracy)?
– What is the quality of a population?
� Methodology
– Examine methods to define the optimal population
– Relate to generalised hill-climbing
– Map to other search criteria
May 2006May 2006 ©© A.M.BarryA.M.Barry and J. and J. DrugowitschDrugowitsch, 2006, 2006
Future WorkFuture WorkFuture Work
� Formalising and analysing classifier replacement
� Interaction of replacement with function approximation
� Interaction of replacement with DP� Handling stochastic environments� Error bounds and rates of convergence� …
Big Answer: LCS is an integrative framework,
…… But we need the framework definition!But we need the framework definition!