![Page 1: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/1.jpg)
BioIntelligence Lab. 1
Learning to Trade via Direct Learning to Trade via Direct ReinforcementReinforcement
John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001
Summarized by Jangmin O
![Page 2: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/2.jpg)
BioIntelligence Lab. 2
AuthorAuthor
J. Moody Director of Computational Finance Program and a Professor of
CSEE at Oregon Graduate Institute of Science and Technology Founder & President of Nonlinear Prediction Systems Program Co-Chair for Computational Finance 2000 a past General Chair and Program Chair of the NIPS a member of the editorial board of Quantitative Finance
![Page 3: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/3.jpg)
BioIntelligence Lab. 3
I. IntroductionI. Introduction
![Page 4: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/4.jpg)
BioIntelligence Lab. 4
Optimizing Investment Optimizing Investment PerformancePerformance Characteristic
Path-dependent
Methods : Direct Reinforcement learning (DR) Recurrent Reinforcement Learning [1, 2] No need for forecasting model Single security or Asset allocation
Recurrent Reinforcement Learning (RRL) Adaptive policy search Learning investment strategy on-line No need to learn a value function Immediate rewards available in financial market
![Page 5: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/5.jpg)
BioIntelligence Lab. 5
Difference between RRL & Q Difference between RRL & Q or TDor TD Financial decision making problem : suitable to RRL
Immediate feedback available
Performance criteria : risk-adjusted investment returns Shape ratio Downside risk minimization
Differential form
![Page 6: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/6.jpg)
BioIntelligence Lab. 6
Experimental DataExperimental Data
U.S. dollar/British Pound foreign exchange market
S&P 500 Stock Index and Treasury Bills
RRL v.s. Q Bellman’s curse of dimensionality
![Page 7: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/7.jpg)
BioIntelligence Lab. 7
II. Trading Systems and II. Trading Systems and Performance CriteriaPerformance Criteria
![Page 8: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/8.jpg)
BioIntelligence Lab. 8
Structure of Trading Systems Structure of Trading Systems (1)(1) An agent : assumption
단일 시장에서 고정 포지션씩 거래 Trader at time t , Ft {+1, 0, 1}
Long : 매수 , Neutral : 관망 , Short : 공매도 이익 Rt
(t-1, t] 의 끝에 실현 , Ft-1 포지션에 따른 손익 + Ft-1 에서 Ft 로의 포지션 이동에 따른 수수료
Recurrent 구조로 가야 한다 ! 수수료 , 마켓 임팩트 , 세금등을 고려한 결정을 하기
위해서
![Page 9: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/9.jpg)
BioIntelligence Lab. 9
Structure of Trading Systems Structure of Trading Systems (2)(2) A single asset trading system
t : system parameter at time t
It : information at time t
zt : price series, yt : other external variable series
Simple example
,...},,,...;,,{
),;(
2121
1
ttttttt
tttt
yyyzzzI
IFFF
)...( 1101 wrvrvrvuFsignF mtmtttt
![Page 10: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/10.jpg)
BioIntelligence Lab. 10
Profit and Wealth for Trading Profit and Wealth for Trading Systems (1)Systems (1) Performance functions, U(), for risk insensitive trader =
Profit Additive profits
Security 의 고정 주수 (shares or contracts) 에 대한 거래 rt = zt – zt-1 : risky asset 의 리턴 rt
f : risk-free asset 의 리턴 (T-bill 같은 )
: 수수료 비율 Trader 의 자산 : WT = W0 + PT
||)( , 111
ttf
tttf
tt
T
ttT FFrrFrRRP
![Page 11: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/11.jpg)
BioIntelligence Lab. 11
Profit and Wealth for Trading Profit and Wealth for Trading Systems (2)Systems (2) Multiplicative profits
누적 자산의 일정 비율 > 0 이 투자됨 rt = (zt/zt-1 –1)
In case of no short sales, when = 1
||1)1(1}1{
}1{
111
10
ttttf
ttt
T
ttT
FFrFrFR
RWW
![Page 12: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/12.jpg)
BioIntelligence Lab. 12
Performance CriteriaPerformance Criteria
UT in general form of U(RT,…,Rt,…,R2,R1;W0) Simple form U(WT) : standard economic utility
Path-dependent performance function : Sharpe ratio etc.
Moody 의 관심사 . Marginal increase of Ut, caused by Rt at each time step
Differential performance criteria
1 ttt UUD
![Page 13: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/13.jpg)
BioIntelligence Lab. 13
Differential Sharpe Ratio (1)Differential Sharpe Ratio (1)
Sharpe ratio : risk adjusted return
Differential Sharpe ratio 온라인 러닝을 위해 , 시간 t 에서의 Rt 의 영향을 계산이
필요 . 지수 이동 평균 사용
Adaptation rate 에 대한 1 차 Taylor 전개
)(Deviation Standard
)(Average
t
tT R
RS
)(
)(||
2
0
1
2
0
00
OS
S
OS
SS
tt
ttt
= 0 이면 St = St-1
![Page 14: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/14.jpg)
BioIntelligence Lab. 14
Differential Sharpe Ratio (2)Differential Sharpe Ratio (2)
Exponential moving average with adaptation rate
Sharpe Ratio
Taylor 전개로부터 ,
ttttt
ttttt
BBBRB
AAARA
112
11
)1(
)1(
2/12 )( tt
tt ABK
AS
2/3211
11
0)(
21
tt
ttttt
t AB
BAABSD
Rt > At-1 : increased reward
Rt2 > Bt-1 : increased risk
![Page 15: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/15.jpg)
BioIntelligence Lab. 15
Differential Sharpe Ratio (3)Differential Sharpe Ratio (3)
Derivative with .
Dt is max at Rt = Bt-1/At-1
Meaning of differential Sharpe ratio Making on-line learning possible : At-1 과 Bt-1 로부터 쉽게
계산 가능 Recursive updating 이 가능함 최근 return 에 강한 가중치 부여 해석력 : Rt 의 기여도를 알 수 있게됨
2/3211
11
)(
tt
ttt
t
t
AB
RAB
R
D
![Page 16: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/16.jpg)
BioIntelligence Lab. 16
IIIIII. . Learning to TradeLearning to Trade
![Page 17: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/17.jpg)
BioIntelligence Lab. 17
Reinforcement FrameworkReinforcement Framework
RL Maximizing the expected reward Trial and error exploration of the environment
Comparison with supervised learning [1, 2] Problematic with transaction costs Structural credit assignment v.s. temporal credit assignment
Types of RL DL : policy search Q-learning : value function Actor-critic method
![Page 18: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/18.jpg)
BioIntelligence Lab. 18
Recurrent Reinforcement Recurrent Reinforcement Learning (1)Learning (1) Goal
트레이딩 시스템 Ft() 에 대해 , UT 를 최대화 하는 파라미터 를 찾는 것
Example 트레이딩 시스템
Trading return
시간 T 후의 미분 공식
,...},,,...;,,{
),;(
2121
1
ttttttt
tttt
yyyzzzI
IFFF
|| , 11 ttttttT FFrFRRP
1
11
)( t
t
tt
t
tT
t t
TT F
F
RF
F
R
R
UU
![Page 19: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/19.jpg)
BioIntelligence Lab. 19
Recurrent Reinforcement Recurrent Reinforcement Learning (2)Learning (2) 학습 기법
Back-propagation through time (BPTT)
Temporal dependencies
Stochastic version Rt 에 관계되는 항에만 집중
1
1
t
t
ttt F
F
FFF
1
1
11
)(
t
t
t
t
t
t
t
t
t
t
t
tt F
F
RF
F
R
R
UU
1
1
)( t
t
tt
t
t
t
t
t
ttt
F
F
RF
F
R
R
DD
Differential performance criteria Dt
![Page 20: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/20.jpg)
BioIntelligence Lab. 20
Recurrent Reinforcement Recurrent Reinforcement Learning (3)Learning (3) Remind
Moody 는 특정 액션에 대한 즉각적인 측정치 , Dt 를 최적화 하는 것에 초점
[1, 2] 포트폴리오 최적화 등
![Page 21: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/21.jpg)
BioIntelligence Lab. 21
Value Function (1)Value Function (1)
Implicitly learning correct actions through value iteration Value function
Discounted future rewards being received from state x following the policy
a y
xy yVayxDapaxxV )}(),,({)(),()(
상태 x 에서 액션 a 를 취할 확률
x y 상태 전이시 액션 a 를 취할 확률
x y 상태 전이시 액션 a 를 취할 때의 immediate reward
Future reward 와 immediate rewards 간의 discount factor
![Page 22: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/22.jpg)
BioIntelligence Lab. 22
Value Function (2)Value Function (2)
Optimal value function & Bellman’s optimally equation
Value iteration update : Converge to optimal solution
Optimal Policy
)(max)(* xVxV
)}(),,({)(max)( ** yVayxDapxVy
xya
)}(),,({)(max)(1 yVayxDapxV ty
xya
t
)}(),,({)(maxarg ** yVayxDapay
xya
![Page 23: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/23.jpg)
BioIntelligence Lab. 23
Q-LearningQ-Learning
Q-function : 현재 상태와 현재 액션에 대한 future reward 계산
Value iteration update : Converge to optimal Q-function
Calculating the best action No need to know pxy(a)
)},(max),,({)(),( ** byQayxDapaxQb
yxy
)},(max),,({)(max),(1 byQayxDapaxQ tb
yxy
at
)),((maxarg ** axQaa
2)),(),(max),,((2
1axQbyQayxD
b
Error function of function approximator (i.e. NN)
![Page 24: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/24.jpg)
BioIntelligence Lab. 24
IVIV. . Empirical ResultsEmpirical Results
1. Artificial price series
2. U.S. Dollar/British Pound Exchange rate
3. Monthly S&P 500 stock index
![Page 25: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/25.jpg)
BioIntelligence Lab. 25
A trading system based on DR
![Page 26: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/26.jpg)
BioIntelligence Lab. 26
Artificial price seriesArtificial price series
Data : autoregressive trend processes
10,000 samples 검증
RRL 이 트레이딩 전략의 학습 도구로 적합한지 ? 거래세의 증가에 따른 거래 횟수의 경향은 ?
)()1()(
)()1()1()(
tvtt
tkttptp
)/)(exp()( Rtptz
![Page 27: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/27.jpg)
BioIntelligence Lab. 27
Error function of function approximator (i.e. NN)10,000 샘플
{long, short} position only
~2,000 기간 동안 성능 저하
![Page 28: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/28.jpg)
BioIntelligence Lab. 28
9,000~ 확대
= 0.01
![Page 29: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/29.jpg)
BioIntelligence Lab. 29
거래횟수
누적이익
Sharpe Ratio
100 번 실험 후 결과100 에포크 학습 + 온라인 적응거래세 0.2%, 0.5%, 1%
![Page 30: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/30.jpg)
BioIntelligence Lab. 30
U.S. Dollar/British Pound U.S. Dollar/British Pound Foreign Exchange TradingForeign Exchange Trading {long, neutral, short} trading system
30 minute U.S. Dollar/British Pound foreign exchange (FX) rate data 주 5 일 , 24 시간 거래 : 1996 년 1~8 월 분량
전략 2,000 데이터 학습 480 데이터 트레이딩 (2 주 ) 윈도우 이동후 재학습
결과 Annualized 15% return with annualized Sharpe ratio 2.3 평균적으로 5 시간당 1 번 거래
고려되지 않은 사항 피크를 이룬 트레이딩 . 시장의 비유동성
![Page 31: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/31.jpg)
BioIntelligence Lab. 31
![Page 32: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/32.jpg)
BioIntelligence Lab. 32
S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (1)Allocation (1) 소개
Long position : S&P 500 에 포지션 , T-Bill 이윤은 없음 Short position : 2 배의 T-Bill 비율을 얻음
배당금 재투자
T-Bill 배당금
S&P500 배당금
![Page 33: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/33.jpg)
BioIntelligence Lab. 33
S&P 500/T-Bill Asset S&P 500/T-Bill Asset Allocation (2)Allocation (2) 시뮬레이션
데이터 (1950 ~ 1994): 초기 학습 (~1969) + 테스트 (1970~) 학습 윈도우 : 10 년 학습 + 10 년 validation Input Feature : 84 (financial + macroeconomic) series
RRL-trader tanh 유닛 1 개 , weight decay
Q-trader bootstrap 샘플 사용 2-layer FNN (30 tanh 유닛 ) Bias/variance trade off : 10, 20, 30, 40 유닛 모델중 선택
![Page 34: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/34.jpg)
BioIntelligence Lab. 34
Voting methods
RRL : 30 번 , Q : 10 번거래세 0.5%이익금 재투자Multiplicative profit ratio
Buy and Hold : 1348%Q-Trader : 3359%RRL-Trader : 5860%
![Page 35: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/35.jpg)
BioIntelligence Lab. 35
대전제 : 1970 ~ 1994 의 25 년 동안 미국 증권 /재무증권 시장은 예측가능했다 .
오일쇼크
통화긴축
시장조정
시장붕괴
걸프 전쟁
Statistically significant
![Page 36: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/36.jpg)
BioIntelligence Lab. 36
Sensitivity Analysisj
ji
i dx
dF
dx
dFS max/
인플레이션 기대치
![Page 37: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/37.jpg)
BioIntelligence Lab. 37
VV . . Learn the Policy or Learn Learn the Policy or Learn the Value?the Value?
![Page 38: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/38.jpg)
BioIntelligence Lab. 38
Immediate v.s. Future Immediate v.s. Future RewardsRewards Reinforcement signal
Immediate (RRL) or delayed (Q , dynamic programming, or TD)
RRL Policy is represented directly. Learning value function is bypassed
Q Policy is represented indirect
![Page 39: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/39.jpg)
BioIntelligence Lab. 39
Policies v.s. ValuesPolicies v.s. Values
Some limitations of value function approach Original formulation of Q-learning : discrete action & state spaces Curse of dimensionality Policies derived from Q-learning tend to be brittle : small changes in
value function may lead large changes in the policy Large scale noise and non-stationarity may lead severe problems
RRL’s advantages Policy is represented directly : Simpler functional form is sufficient Can produce real valued actions More robust in noisy environment / Quick adaptation to non-
stationarity
![Page 40: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/40.jpg)
BioIntelligence Lab. 40
An ExampleAn Example
Simple trading system {buy, sell} a single asset. Assumption: rt+1 is known in advance.
No need to future rewards : = 0
Policy funciton is trivial : at = rt+1
1 tanh unit is sufficient
Value function : ability to treat XOR 2 tanh units needed
![Page 41: BioIntelligence Lab.1 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell, IEEE Trans Neural Networks 12(4), pp. 875-889, 2001 Summarized](https://reader035.vdocument.in/reader035/viewer/2022062301/56649f0e5503460f94c2257d/html5/thumbnails/41.jpg)
BioIntelligence Lab. 41
ConclusionConclusion
How to train trading systems via DR RRL algorithm Differential Sharpe ratio & differential downside
deviation ratio RRL is more efficient than Q-learning in financial area.