![Page 1: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/1.jpg)
Application of Reinforcement Learning in Network Routing
ByBy
Chaopin ZhuChaopin Zhu
![Page 2: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/2.jpg)
Machine Learning
Supervised LearningSupervised Learning Unsupervised LearningUnsupervised Learning Reinforcement LearningReinforcement Learning
![Page 3: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/3.jpg)
Supervised Learning
Feature: Learning with a teacherFeature: Learning with a teacher PhasesPhases
• Training phaseTraining phase
• Testing phaseTesting phase ApplicationApplication
• Pattern recognitionPattern recognition
• Function approximationFunction approximation
![Page 4: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/4.jpg)
Unsupervised Leaning
FeatureFeature
• Learning without a teacherLearning without a teacher ApplicationApplication
• Feature extractionFeature extraction
• Other preprocessingOther preprocessing
![Page 5: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/5.jpg)
Reinforcement Learning
Feature: Learning with a criticFeature: Learning with a critic ApplicationApplication
• OptimizationOptimization
• Function approximationFunction approximation
![Page 6: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/6.jpg)
Elements ofReinforcement Learning
AgentAgent EnvironmentEnvironment PolicyPolicy Reward functionReward function Value functionValue function Model of environment (optional)Model of environment (optional)
![Page 7: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/7.jpg)
Reinforcement Learning Problem
Environment
Agent
action an
reward rn
state xn
xn+1
Rn+1
![Page 8: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/8.jpg)
Markov Decision Process (MDP)
Definition:Definition:
A reinforcement learning task that satisfies A reinforcement learning task that satisfies the Markov propertythe Markov property
Transition probabilitiesTransition probabilities
nnnayx axyxP n
n,|Pr 1
![Page 9: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/9.jpg)
An Example of MDP
High Low
search wait
recharge
search wait
![Page 10: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/10.jpg)
Markov Decision Process (cont.)
ParametersParameters
Value functionsValue functionspolicy
discount
rewardr
actiona
statex
n
n
n
0
0
,|,
|
knnkn
knn
knkn
k
axrEaxQ
xxrExV
![Page 11: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/11.jpg)
Elementary Methods forReinforcement Learning Problem
Dynamic programmingDynamic programming Monte Carlo MethodsMonte Carlo Methods Temporal-Difference LearningTemporal-Difference Learning
![Page 12: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/12.jpg)
Bellman’s Equations
aaxxaxQrEaxQ
aaxxxVrExV
nnna
n
nnnna
,|,max,
,|max
'1
*1
*
1*
1*
'
![Page 13: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/13.jpg)
Dynamic Programming Methods
Policy evaluationPolicy evaluation
Policy improvementPolicy improvement
xxxVrExV nnknk |111
aaxxxVrEaxQ
axQx
nnnn
a
,|,
,maxarg'
11
![Page 14: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/14.jpg)
Dynamic Programming (cont.)
E ---- policy evaluationE ---- policy evaluation
I ---- policy improvementI ---- policy improvement Policy IterationPolicy Iteration Value IterationValue Iteration
**210
10 VVVEIEIEIE
![Page 15: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/15.jpg)
Monte Carlo Methods
FeatureFeature• Learning from experienceLearning from experience• Do not need complete transition Do not need complete transition
probabilitiesprobabilities IdeaIdea
• Partition experience into episodesPartition experience into episodes• Average sample returnAverage sample return• Update at episode-by-episode baseUpdate at episode-by-episode base
![Page 16: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/16.jpg)
Temporal-Difference Learning
Features Features (Combination of Monte Carlo and DP ideas)(Combination of Monte Carlo and DP ideas)
• Learn from experience (Monte Learn from experience (Monte Carlo)Carlo)
• Update estimates based in part on Update estimates based in part on other learned estimates (DP)other learned estimates (DP)
TD(TD() algorithm seemlessly integrates TD ) algorithm seemlessly integrates TD and Monte Carlo Methodsand Monte Carlo Methods
![Page 17: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/17.jpg)
TD(0) LearningInitialize V(x) arbitrarilyInitialize V(x) arbitrarily to the policy to be evaluatedto the policy to be evaluatedRepeat (for each episode):Repeat (for each episode):
Initialize xInitialize xRepeat (for each step of episode)Repeat (for each step of episode)
aaaction given by action given by for x for xTake action a; observe reward r and next state x’Take action a; observe reward r and next state x’
xxx’x’until x is terminaluntil x is terminal
)]()'([)()( xVxVrxVxV
![Page 18: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/18.jpg)
Q-Learning
Initialize Q(x,a) arbitrarilyInitialize Q(x,a) arbitrarilyRepeat (for each episode)Repeat (for each episode)
Initialize xInitialize xRepeat (for each step of episode):Repeat (for each step of episode):
Choose a from x using policy derived from QChoose a from x using policy derived from QTake action a, observe r, x’Take action a, observe r, x’
xxx’x’until x is terminaluntil x is terminal
)],(),(max[),(),( ''
'axQaxQraxQaxQ
a
![Page 19: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/19.jpg)
Q-Routing
QQxx(y,d)----estimated time that a packet would (y,d)----estimated time that a packet would
take to reach the destination node d from take to reach the destination node d from current node x via x’s neighbor node ycurrent node x via x’s neighbor node y
TTyy(d) ------y’s estimate for the time remaining (d) ------y’s estimate for the time remaining
in the tripin the trip
qqyy ---------queuing time in node y ---------queuing time in node y
TTxyxy --------transmission time between x and y --------transmission time between x and y
dzQdT yyNz
y ,min)()(
![Page 20: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/20.jpg)
Algorithm of Q-Routing
1.1. Set initial Q-values for each nodeSet initial Q-values for each node2.2. Get the first packet from the packet queue of Get the first packet from the packet queue of
node xnode x3.3. Choose the best neighbor node and forward Choose the best neighbor node and forward
the packet to node bythe packet to node by4.4. Get the estimated valueGet the estimated value from node from node5.5. Update Update 6.6. Go to 2.Go to 2.
y
y dyQy xxNy
,minargˆ)(
yy qdT ˆˆ
dyQdTqtdyQdyQ xyyyxxx ,ˆ,ˆ,ˆ ˆˆˆ
![Page 21: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/21.jpg)
Dual Reinforcement Q-Routing
s
x y
d
Qx(y,d) Qy(x,s) Packet
Backward Exploration
Forward Exploration
![Page 22: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/22.jpg)
Network Model
Subnet1 Subnet2
![Page 23: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/23.jpg)
Network Model (cont.)
16
13
10
7
4
1
17
14
11
8
5
2
18
15
12
9
6
3
19
22
25
28
31
34
20
23
26
29
32
35
21
24
27
30
33
36
![Page 24: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/24.jpg)
Node Model
Packet Generator
Packet Destroyer
Routing Controller
Input Queue
Input Queue
Input Queue
Input Queue
Output Queue
Output Queue
Output Queue
Output Queue
![Page 25: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/25.jpg)
Routing Controller
Init
Idle
Arrival Departure
Arrival Depart
Default
![Page 26: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/26.jpg)
Initialization/ Termination Procedures
InitilizationInitilization Initialize and / or register global variableInitialize and / or register global variable Initialize routing tableInitialize routing table
TerminationTermination Destroy routing tableDestroy routing table Release memoryRelease memory
![Page 27: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/27.jpg)
Arrival Procedure
Data packet arrivalData packet arrival Update routing tableUpdate routing table Route it with control information or Route it with control information or
destroy the packet if it reaches the destroy the packet if it reaches the destinationdestination
Control information packet arrivalControl information packet arrival Update routing tableUpdate routing table Destroy the packetDestroy the packet
![Page 28: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/28.jpg)
Departure Procedure
Set all fields of the packetSet all fields of the packet Get a shortest routeGet a shortest route Send the packet according to the routeSend the packet according to the route
![Page 29: Applying Reinforcement Learning for Network Routing](https://reader035.vdocument.in/reader035/viewer/2022062705/55639510d8b42acc128b5532/html5/thumbnails/29.jpg)
References
[1] Richard S. Sutton and Andrew G. Barto, [1] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning—An IntroductionReinforcement Learning—An Introduction
[2] Chengan Guo, Applications of [2] Chengan Guo, Applications of Reinforcement Learning in Sequence Reinforcement Learning in Sequence Detection and Network RoutingDetection and Network Routing
[3] Simon Haykin, Neural Networks– A [3] Simon Haykin, Neural Networks– A Comprehensive FoundationComprehensive Foundation