market making: comparison between reinforcement learning...

Market Making: Comparison between ReinforcementLearning and Analytical Benchmarks

Jose E. Figueroa-Lopez

Washington University in St. Louis

[email protected]

Stochastic Control in Finance, NUS, SingaporeJuly 22, 2019

Joint work with Agostino Capponi and Chuyi Yu

Jose E. Figueroa-Lopez (WUSTL) RL in Market Making NUS, Singapore 1 / 36

What is Market Making?

. Market Makers (MM) continually quote bid and ask prices atwhich they are willing to buy and sell on both sides of a Limit OrderBook;

. If both orders get executed or lifted, the MM gains their quotedspread, which will be at least as big as the market spread;

. There is a natural tradeoff when placing orders deeper in the book tomaximize the profit, if executed, but with reduced chances ofexecution.

. In addition, there is also considerable inventory risk in which theMM may accumulate large amount of inventory that must be clearedaggressively at the end of day at suboptimal prices or even worst atmarket prices.


Very Selective Literature on Market Making

Avellaneda and Stoikov (2008):

. Maximization of the exponential utility from terminal trading profit/loss WT

and residual inventory IT liquidation: E [−e−γ(WT +ITST )];

. Optimize bid/ask LO placements St ± δ±t of one share unit under aBrownian midprice dynamics St = σBt and Poisson MOs arrival times withlifting rates πS(δ−) and πB(δ+).

Cartea and Jaimungal (2015):

. Optimize terminal wealth WT and residual inventory liquidation whilepenalizing terminal and running inventory:

sup(δ±s )t≤s≤T

E[WT + ITST − λI 2

T − φ∫ T

t

I 2s ds∣∣∣Wt− = w , It− = i ,St− = s

]. Still optimize bid/ask LO placements St ± δ±t of one unit, which is lifted

with probability e−κ±δ±t when a Poisson MO arrives;

. Brownian midprice dynamics between MOs but with i.i.d. jumps at MOsarrival times (price impact of MOs);


Very Selective Literature on Market Making

Adrian, T., Capponi, A., Vogt, E., & Zhang, H. (2018):

. Optimize sup(δ±s )t≤s≤TEt [WT + ITST − λI 2

T ], but allows placement of orders

of arbitrary (but fixed) volume ϑ;

. Use of exogenously specified linear demand and supply functions todetermine number of lifted shares at the arrival of MOs;

. Brownian midprice dynamics and Poissonian arrival times of MOs;

Chan and Shelton (2001):

. Apply reinforcement learning techniques in a simulated market environment;Mid price is the difference of Poisson processes: St = (N+

t − N−t )× spread.

. Focus on an information-based model. Poissonian arrivals of informed anduniformed traders.

Spooner, T., Fearnley, J., Savani, R., & Koukorinis, A. (2018):

. Train the reinforcement learning market making agent in a realistic,data-driven limit order book. Actions taken at every LOB event.


Our Objectives

. Assess the performance of the reinforcement learning techniquesunder modeled market dynamics with analytical optimal policies.

. Specifically, we devise a discrete-time version of the model in Adrian,Capponi, Vogt, & Zhang (2018) with actions taking place at eitherthe arrival times of MOs or at prespecified discrete times tj .

. Apply reinforcement learning techniques to deal with nonlineardemand functions and compare the RL policies with approximateanalytical benchmarks.


A Simple Market Making Model I

. Finite time horizon [0,T ]

. The fundamental price is denoted by {St}t≥0

. Market orders (MOs) (either buy or sell) arrive in the market at times0 < τ1 < τ2 < . . . . Let

αj =

{1, if the MO at τj is to sell,−1, if the MO at τj is to buy.

. The counting processes of sell and buy MOs are denoted by NSt and

NBt , respectively:

NSt = #{j : τj ≤ t, αj = 1}, NB

t = #{j : τj ≤ t, αj = −1}

Nt = NSt + NB

t .

. We consider two cases for the times tj ’s of action or control:

Case 1: Actions at the arrivals of MOs: t0 = 0, tj = τj ∧ T ,

Case 2: Actions at n fixed regular times: tj = Tn j .


A Simple Market Making Model II

The following give the details in 1st setting (2nd case is similar).

. At times tj < T (j = 0, 1, . . . ,NT ), the market maker placessimultaneous bid and ask limit orders (LOs) of constant volume ϑat levels

btj = Stj − δ−j , atj = Stj + δ+

j ,

where 0 ≤ δ−j , δ+j ∈ Ftj , information available at tj .

. The LOs placed at time tj will stay in the market until the next arrivalof MO or until the terminal time T . We set

at = atj , bt = btj , t ∈ [tj , tj+1).

(Remark: Note that at−j= atj−1 and bt−j

= btj−1 , for j = 1, 2, . . . ).


A Simple Market Making Model III

. After placing LOs at time tj , there are three possibilities:

(i) If tj+1 = T (i.e. tj is the arrival time of the last MO beforetermination), the market maker liquidates her inventory IT at midpricefor a cashflow of ST IT ;

(ii) If tj+1 < T and αj+1 = 1 (i.e., tj+1 is the time of a sell MO), then

QS(Stj+1 , btj ) shares are filled at bid price btj ;

(iii) If tj+1 < T and αj+1 = −1 (i.e., tj+1 is the time of a buy MO), then

QB(Stj+1 , atj ) shares are filled at ask price atj ;

. QS(s ′, b) and QB(s ′, a) are given deterministic supply/demandfunctions;

. Remark: In the cases (ii)-(iii), the unexecuted LOs are “immediately”canceled and new LO’s placement positions are taken right after atlevels atj+1 and btj+1 .


Optimal Market Making Problem

. The agent aims to maximize her final wealth at time T :

V ∗0 = sup{δ+

j ,δ−j }

E [WT + ST IT | S0 = s, I0 = i ,W0 = w ],

where, for t ∈ (0,T ],

Wt = W0 +

∫ t

0au−Q

B(Su, au−)dNBu −

∫ t

0bu−Q

S(Su, bu−)dNSu ,

It = I0 −∫ t

0QB(Su, au−)dNB

u +

∫ t

0QS(Su, bu−)dNS

u .


Linear Supply/Demand Functions

In terms of a given fixed constant c , the linear supply/demand functions([Adrian et. al., 2018],[Hendershott and Menkveld, 2014]) are given by:

QS(s ′, b) = ϑ+ c(b − s ′), QB(s ′, a) = ϑ+ c(s ′ − a)

. At the arrival of a Sell MO at time tj+1, all of the agent’s volume ϑ isfilled if Stj+1 = btj ; otherwise, this amount decays linearly as Stj+1 isfather away (above) btj .

. Similarly, at the arrival of a Buy MO at time tj+1, the agent sell all itsvolume ϑ if Stj+1 = atj ; otherwise, this amount decays linearly as Stj+1

is father away (below) atj .

In what follows, it would be useful to write QS and QB in terms ofp := ϑ/c and c (instead of ϑ and c):

QS(s ′, b) = c(p + b − s ′), QB(s ′, a) = c(p + s ′ − a).


Optimal Placement Under Martingale Asset Prices

Theorem (J. F-L, A. Capponi, & C. Yu ’19)

Assume the following conditions:

Linear Demand Functions:

{QS(s ′, b) = c(p + b − s ′)QB(s ′, a) = c(p + s ′ − a)

Martingale Condition : E[

1tj+1<T ,αj+1=±1

(Stj+1 − Stj

)∣∣Ftj

]= 0

Then, the optimal placement policy and achieved expected maximalwealth are given by

a∗j = Stj +1

2p, b∗j = Stj −

1

2p, j = 0, 1, . . . ,NT ,

V ∗0 = W0 +cp2

4E[NT ]− cE

[NT−1∑i=0

(Sti+1 − Sti

)2

].


Nonlinear Demand Functions

. Even though linear supply/demand functions

QS(Stj+1 , btj ) = c(p + btj −Stj+1), QB(Stj+1 , atj ) = c(p +Stj+1 − atj )

allow us to find explicit optimal policies, they are not realistic in thatQS and QB give negative values when Stj+1 − p > btj andStj+1 + p < atj

. It is then natural to constrain the demand functions to be positive:

QS(Stj+1 , btj ) = max{0, c(p + btj − Stj+1)}

QB(Stj+1 , atj ) = max{0, c(p + Stj+1 − atj )}

. Similarly, if we think of ϑ = cp as the initial volume of placed orders,it is not natural to have demand/supply larger than cp. Hence, wealso consider

QS(Stj+1 , btj ) = min{max{0, c(p + btj − Stj+1)}, cp}

QB(Stj+1 , atj ) = min{max{0, c(p + Stj+1 − atj )}, cp}


Alternative Optimal Placement Policies

. There is no closed form solution for the optimal placement policyunder the previous modified demand functions.

. Two possibilities to deal with this is to impose some simplifyingconstraints on the policies atj = Stj + δ−j p and btj = Stj − δ

+j p such

as

- Time-invariance:δ+j ≡ δ+ and δ−j ≡ δ−, for all j .

- Myopic or one-step ahead optimality:Maximize the immediate expected reward:

maxδ±j ∈Ftj

E[Rtj+1 |Ftj ],

Rtj+1 = (Wtj+1 −Wtj ) + (Stj+1 Itj+1 − Stj Itj )

While we expect the 2nd strategy to yield more optimal strategies,they are also more computationally expensive. So, in practice, the 1ststrategy may suffice.


Bachelier/Poisson Model: Time-Invariant Policies for Q(·)

When imposing the restriction QS,B ≥ 0, the immediate reward attime tj+1 < T (ignoring Itj (Stj+1 − Stj )) takes the form:

Rtj+1 = 1{αj+1(Stj+1−Stj

)≤(1−δαj+1 )p}

[cp2δαj+1 (1− δαj+1 )− c(Stj+1 − Stj )

2

+ αj+1cp(1− 2δαj+1 )(Stj+1 − Stj )]

depending on whether αj+1 = 1 (sell) or αj+1 = −1 (buy).

It can be shown that the optimal placement is symmetric:δ := δ+ = δ−

Assuming that {Nt} and {St} are independent and conditioning on{Nt}, we get the expected reward

V ∗0 = E

[NT−1∑j=0

{cp2δ(1− δ)Φ

( (1− δ)p

σ√tj+1 − tj

)− cσ2(tj+1 − tj)Φ

( (1− δ)p

σ√tj+1 − tj

)+cpσδ

√tj+1 − tjφ

( (1− δ)p

σ√tj+1 − tj

)}]Jose E. Figueroa-Lopez (WUSTL) RL in Market Making NUS, Singapore 14 / 36


. Recall that QS ,B imposes both conditions 0 ≤ QS ,B ≤ cp.

. The immediate reward at time tj+1 < T (ignoring Itj (Stj+1 − Stj )),takes the form:

Rtj+1 = 1{−δ±p≤±(Stj+1−Stj )≤(1−δ±)p}

[cp2δ±(1− δ±)− c(Stj+1 − Stj )

2

±cp(1− 2δ±)(Stj+1 − Stj )]

+ 1{±(Stj+1−Stj )≤−δ±p}cp

[±(Stj+1 − Stj ) + δ±p

],

depending on whether αj+1 = 1 (sell) or αj+1 = −1 (buy).

. Again, the optimal policy is symmetric (i.e., δ := δ+ = δ−). So, wecan focus on such policies.



. By conditioning on {Nt}t≥0 (assuming independence between N andS), the expected reward now takes the form

V ∗0 = V ∗0 + E[ NT−1∑

j=0

cp2δ2Φ( −δpσ√tj+1 − tj

)+ cσ2(tj+1 − tj)Φ

( −δpσ√tj+1 − tj

)− cpσδ

√tj+1 − tjφ

( δp

σ√tj+1 − tj

)}]where V ∗0 is the expected total reward for the previous case.


Bachelier Poisson Model: Time-Invariant Policies by Monte Carlo

Under Poisson arrival rates, we compute V ∗0 , V∗0 by Monte Carlo1:

Demand Functions Expected Total Reward (V ∗) Optimal Placement (δ∗)

QS(Stj+1 , btj ) = c(p + btj − Stj+1)

QB(Stj+1 , atj ) = c(p + Stj+1 − atj )0.016 0.50

QS(Stj+1 , btj ) = max{0, c(p + btj − Stj+1)}QB(Stj+1 , atj ) = max{0, c(p + Stj+1 − atj )}

0.163 0.65

QS(Stj+1 , btj ) = min{max{0, c(p + btj − Stj+1)}, cp}QB(Stj+1 , atj ) = min{max{0, c(p + Stj+1 − atj )}, cp}

0.196 0.60

1Parameters as in [Adrian et. al., 2018]: T = 10(seconds), λS = λB = 10, c = 30,p = 0.017, σ = 0.0375, S0 = $100, the initial inventory and cash holding are both 0.


Bachelier Model: Myopic Policy for QS ,B(·)

. Goal: Maximize the immediate expected reward:

maxδ±j ∈Ftj

E[Rtj+1 |Ftj ]

. This is equivalent to maximizing

E

[1{tj+1<T}

{cp2δ(1− δ)Φ

( (1− δ)p

σ√

tj+1 − tj

)− cσ2(tj+1 − tj )Φ

( (1− δ)p

σ√

tj+1 − tj

)+cpσδ

√tj+1 − tjφ

( (1− δ)p

σ√

tj+1 − tj

)}∣∣∣∣∣Ftj

]

. For most values of t (t < 0.9T ), δ∗ ≈ 0.65. As t gets very close toT , δ∗ moves quickly to 0.5. A similar phenomenon is observe whenusing the demand functions QS ,B . For t < 0.9T , δ∗ ≈ 0.6. As t getsvery close to T , δ∗ moves quickly to 0.5.


Reinforcement Learning (State and Action)

. The agent in reinforcement learning (RL) learns to make the optimaldecisions by interacting with the environment.

. No knowledge of the market environment, such as the order flow orprice process, is assumed.

. Trajectory: SSSe0,A0,R1,SSS

e1,A1,R2,SSS

e2,A2,R3, . . .

. In most cases, a Markov Decision Process is assumed:

p(s ′, r |s, a) = P[SSSet+1 = s ′,Rt+1 = r |SSSe

t = s,At = a].


Policy and Action-Value Functions

. Policy π(a|sss): the probability of selecting action a under state sss.

. The goal of the RL agent is to choose the policy π to maximize theexpected discounted cumulative reward:

v∗(sss) := maxπ

vπ(sss) := maxπ

Eπ

[ ∞∑k=0

γkRt+k+1 |SSSet = sss

]

. Equivalently, we can first find the q-function:

q∗(sss, a) := maxπ

Eπ

[ ∞∑k=0

γkRt+k+1|SSSet = sss,At = a

],

and, then, the optimal policy as:

v∗(sss) = supa

q∗(sss, a), π∗(sss) = argmaxaq∗(sss, a)


Bellman Optimality Equation

. For the value function,

v∗(sss) = maxa

E[Rt+1 + γv∗(SSS

et+1)

∣∣SSSet = sss,At = a

]. For the q-function,

q∗(sss, a) = E[Rt+1 + γmax

a′q∗(SSS

et+1, a

′)

∣∣∣∣SSSet = sss,At = a

]. Reinforced Learning Algorithms Build on The Bellman’s Equations

Above and a Combination of Monte Carlo and Fixed Point Steps.


Monte Carlo Step

Suppose we want to evaluate µ := E(X ) based on independentreplicas X1,X2, . . . . A useful iterative method to evaluate µ startswith µ0 = 0 and set:

µ1 = µ0 + α[X1 − µ0] = αX1

µ2 = µ1 + α[X2 − µ1] = αX2 + α(1− α)X1 . . .

µk = µk−1 + α[Xk − µk−1] = α

k−1∑i=0

(1− α)iXk−i

1st and 2nd order Properties of µk :

E[µk ] = α

k−1∑i=0

(1− α)iµ = (1− (1− α)k)µk→∞−→ µ

Var(µk) = α2k−1∑i=0

(1− α)2iσ2X =

α(1− (1− α)2k)

2− ασ2X

k→∞−→ α

2− ασ2X .


Monte Carlo Step (Step-Dependent Rate)

To ensure consistence, we need α = αi ↘ 0:

µ1 = µ0 + α1[X1 − µ0] = α1X1

µ2 = µ1 + α2[X2 − µ1] = α2X2 + α1(1− α2)X1

µ3 = µ2 + α3[X3 − µ2] = α3X3 + α2(1− α3)X2 + α1(1− α2)(1− α3)X1

µk = µk−1 + αk [Xk − µk−1] =k∑

i=1

αi

k∏j=i+1

(1− αj)Xi

We need that:

E[µk ] =k∑

i=1

αi

k∏j=i+1

(1− αj)µk→∞−→ µ

Var(µk) =k∑

i=1

α2i

k∏j=i+1

(1− αj)2σ2

Xk→∞−→ 0.

Fact: αi = 1i suffices.


Fixed Point Step

In this step, we evaluate q∗ using a type of fixed point algorithm:

q∗,0(sss, a) = Initialize a q-function

q∗,1(sss, a) = E[Rt+1 + γmax

a′q∗,0(SSSe

t+1, a′)


]q∗,2(sss, a) = E

[Rt+1 + γmax

a′q∗,1(SSSe

t+1, a′)


]...


Reinforcement Q-Learning Algorithm

Q-learning Algorithm

Initialize Q(sss, a) for all states sss and actions a;Repeat (for each episode):

Set initial state SSSe0

Repeat for each step t = 0, 1, . . . of episodeChoose action At from SSSe

t based on Q by ε-greedy method:

At =

{argmaxaQ(SSSe

t , a), w.p. 1− ε,any other action a, with equal prob. ε

k−1

Take action At , observe Rt+1,SSSet+1

Q(SSSet ,At)← Q(SSSe

t ,At) + α[Rt+1 + γmaxa′ Q(SSSet+1, a

′)− Q(SSSet ,At)]

SSSet ← SSSe

t+1

until SSSet is terminal


Action-Value Function Approximation by Tile Coding

. When the state space is continuous, we apply tile coding method toapproximate the value function Q(sss, a)

. A partition of the continuous space, but with more than one layer(tiling)

. Low-dimensional continuous state sss → High dimensional binaryfeature vector xxx(sss) = [xxx1(sss), . . . ,xxx`(sss)]:Example: Consider tiling a unit square into 4 tilings of 4× 4 . We have

[0, 0]→ [1, 1, 1, 1, 0, 0, . . . , 0]64 [0, 0.1]→ [1, 0, 1, 1, 1, 0, 0, . . . , 0]64


Action-Value Function Approximation by Tile Coding

. The action-value function can then be approximated by assigning aweight to each tile:

q(sss, a,w) =∑i

wi · xxx i (sss, a)

. Close points have close values in q

. Instead of updating the action-value function for each state andaction, tile coding method only needs to update the weights of tiles

. Apply Stochastic Gradient Descent (SGD) to update the weightparameter www during the training; i.e., change

www t+1 ← www t +α[Rt+1 +γmaxa′

q(SSSet+1, a

′,www t)− q(SSSet ,At ,www t)]∇www q(SSSe

t ,At ,www t)

instead ofQ(SSSe

t ,At)← Q(SSSet ,At) + α[Rt+1 + γmaxa′ Q(SSSe

t+1, a′)− Q(SSSe

t ,At)]


Reinforced Learning Implementation Of Market-Making

. State Space: SSSet = (St , It , t).

. Action Space: 10 available actions for RL agent: Placing bid andask LOs simultaneously at prices{

at = St + δp for Ask LObt = St − δp for Bid LO

where δ is chosen among [0.1, 0.2, ..., 1].

. Design of the Immediate Reward (λ = 0):

Rj = (Wtj+1 −Wtj ) + (Stj+1 Itj+1 − Stj Itj ), j = 0, 1, . . .

and discount rate γ = 1, so that

R =∞∑j=0

{(Wtj+1 −Wtj ) + (Stj+1 Itj+1 − Stj Itj )

}= WT + ST IT

The objective to maximize in RL is the same as that in the previouscontrol problem.


RL Policy for QS ,B(·)

. QS(Stj+1 , btj ) = c(p + btj − Stj+1 ),QB(Stj+1 , atj ) = c(p + Stj+1 − atj )

. δ∗ = 0.5,V ∗ = 0.016

Optimal policy given by RL at different time t:



. QS (Stj+1 , btj ) = max{0, c(p + btj − Stj+1 )}, QB(Stj+1 , atj ) = max{0, c(p + Stj+1 − atj )}

. Time-invariant case: δ∗ ≈ 0.65, V ∗ ≈ 0.163 for (t < 0.9T )

Myopic case: δ∗ ≈ 0.65 at most of the time and moves to 0.5 quickly when

t → T




. QS (Stj+1 , btj ) = min{max{0, c(p + btj − Stj+1 )}, cp},QB(Stj+1 , atj ) = min{max{0, c(p + Stj+1 − atj )}, cp}

. Time-invariant case: δ∗ ≈ 0.6, V ∗ ≈ 0.196

Myopic case: δ∗ ≈ 0.6 at most of the time and moves to 0.5 quickly when t → T



Conclusion

Market making models with linear/non-linear demand functions:

. For linear demand functions, we get an analytical optimal placement policywith only a martingale condition on the fundamental price

. For non-linear demand functions with lower bound zero and those with bothupper and lower bounds, we present approximate optimal placement policieswith time-invariance assumption and myopic assumption, respectively.

We train a RL agent in a simulated environment from the market makingmodel above:

. Overall the optimal RL policies is consistent with the benchmarks. But thestates near the boundary are seldomly visited by RL agent and thus theoptimal policies on those states haven’t been reached yet.

. Flat total rewards for different actions leads to ’islands’ of similar RL policiesunder some states.


Future Work I

Improvement on Market Making Model:

. Penalty on large inventory

- maxE(WT + ITST − λI 2T ): penalty on end-of-day inventory

- maxE(WT + ITST − λ∫ T

0I 2t ): penalty on intraday inventory

. Model bid and ask prices in LOB instead of the fundamental price

. Consider the impact of our agent’s placements on the market:

- the arrival rate and volume of the incoming MOs- price impact


Future Work II

Improvement on Reinforcement Learning Technique:

. When the underlying model gets more complex,

- Incorporate action-critic method to allow continuous action space- Apply model with higher flexibility (e.g. neural network) to

approximate action value function q(s, a) in high-dimensional problem.

Reinforcement learning helps improve the model?

- Apply reinforcement learning result as a benchmark to check thevalidity of the proposed model on real world problem.


References

Avellaneda, M., & Stoikov, S. (2008)High-frequency trading in a limit order bookQuantitative Finance 8(3), 217-224.

Cartea, A., & Jaimungal, S. (2015)Risk metrics and fine tuning of highfrequency trading strategies.Mathematical Finance 25(3), 576-611.

Adrian, T., Capponi, A., Vogt, E., & Zhang, H. (2018)Intraday market making with overnight inventory costs.

Chan, N. T., & Shelton, C. (2001)An electronic market-maker.

Spooner, T., Fearnley, J., Savani, R., & Koukorinis, A. (2018)Market Making via Reinforcement LearningIn Proceedings of the 17th International Conference on Autonomous Agents andMultiAgent Systems (pp. 434-442).

Hendershott, T., & Menkveld, A. J. (2014)Price pressuresJournal of Financial Economics 114(3), 405-423.


Appendix: Outline Of Theorem Proof

Set W0 = I0 = 0 and write btj = Stj − δ−j p and atj = Stj + δ+

j p with

δ±j ∈ Ftj . The final reward can be decomposed in a sequence ofimmediate rewards:

WT + ST IT =∞∑j=0

{(Wtj+1 −Wtj ) + (Stj+1 Itj+1 − Stj Itj )

}=:

∞∑j=0

Rtj+1 .

If tj = T , Rtj+1 = 0. If tj+1 = T and tj < T , Rtj+1 = Itj (Stj+1 − Stj ), whichhas mean 0. If tj+1 < T , then

Rtj+1 = cp2(1− δαj+1

j )δαj+1

j − c(Stj+1 − Stj )2

+ cpαj+1(1− 2δαj+1

j )(Stj+1 − Stj ) + Itj (Stj+1 − Stj ).

The “martingale condition” implies that the contribution of the last twoterms is 0. The first term is maximal when δ±j = 1

2 , while the second termdoes not depend on the δj ’s.


market making: comparison between reinforcement learning...

Documents