UAV Route Planning in Delay Tolerant Networks
Daniel Henkel, Timothy X Brown
University of Colorado, Boulder
Infotech @ Aerospace ‘07
May 8, 2007
Familiar: Dial-A-Ride
Receive calls Pick up and drop
off passengers Minimize overall
transit time
The Bus
Dial-A-Ride: curb-to-curb, shared ride transportation service
Optimal route not trivial !
In context: Dial-A-UAV
Sparsely distributed sensors, limited radios TSP solution not optimal Our approach: Queueing and MDP theory
Sensor-1
Sensor-2
Sensor-4
MonitoringStation
Delay tolerant traffic!
Sensor-5Sensor-3
Sensor-6
Complication: infinite data at sensors; potentially two-way traffic
Talk tomorrow – 8am:Sensor Data Collection
TSP’s Problem
• One cycle visits every node
• Problem: far-away nodes with little data to sendVisit them less often
Traveling Salesman SolutionA Bhub
fA fB
UAV
dA dB
New: cycle defined by visit frequencies pi
pA pB
B
B
Queueing Approach
Idea: express delay in terms of pi, then minimize over set {pi}
• pi as probability distribution
• Expected service time of any packet
• Inter-service time: exponential distribution with mean Ti/pi
• Weighted delay:
A
Chub
fA
fC
UAV
dAdB
pA pB
B
fB
D
fD
i
ip 1
pC
pD
dD
dC
i
ii pT
GoalMinimize average delay
0ip
i j i
ijj
Fp
fpT
Solution and Algorithm
Probability of choosing node i for next visit:
jjj
iii
f
fp
/
/
Implementation: deterministic algorithm1. Set ci = 02. ci = ci + pi while max{ci} < 13. k = argmax {ci}4. Visit node k; ck = ck-15. Go to 2.
Performance improvement over TSP!
Unknown Environment
• What is RL?• Learning what to do without prior training• Given: high-level goal; NOT: how to reach it• Improving actions on the go
• Distinguishing Features:• Interaction with environment• Trial & Error Search• Concept of Rewards & Punishments
• Example: training dog
Learns model of environment.
The Framework
Agent• Performs
Actions
Environment• Gives rise to
Rewards• Puts Agent in
situations called States
Elements of RL
Policy
RewardValue
Model ofEnvironment
• Policy: what to do (depending on state)• Reward: what is good• Value: what is good because it predicts reward• Model: what follows what
Source: Sutton, Barto, Reinforcement Learning – An Introduction, MIT Press, 1998
UA Path Planning - Simple
• Service traffic from A and B to hub H
• Goal: minimize average packet delay• State: traffic waiting at nodes: (tA, tB)
• Actions: fly to A; fly to B• Reward: # packets delivered
• Optimal policy: # visits to A and B; depend on flow rates, distances
A Bhub
fA fB
UAV
dA dB
pA pB
GoalMinimize average delay
-> Find pA and pB
MDP
• If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).
• If state and action sets are finite, it is a finite MDP.• To define a finite MDP, you need to give:
• state and action sets• one-step “dynamics” defined by transition probabilities:
• reward expectation:
).(,, allfor ,Pr 1 sAaSssaassss tttass P
Rs s a E rt1 st s,at a,st1 s for all s, s S, a A(s).
• Policy: Mapping from set of States to set of Actions
π : S → A • Sum of Rewards (:=return): from this time onwards
• Value function (of a state): Expected return when starting with s and following policy π. For an MDP,
RL approach to solving MDPs
Bellman Equation for Policy π
• Evaluating E{.}; assuming deterministic policy; π solution:
• Action-Value Function: Value of taking action a in state s. For an MDP,
• V and Q, both have a partial ordering on them since they are real valued. π also ordered:
• Concept of V* and Q*:
• Concept of π*: The policy π which maximizes Qπ(s,a) for all states s.
Optimality
if and only if V (s) V (s) for all s S
V (s) max
V (s) for all s S
Q(s,a) max
Q (s, a) for all s S and a A(s)
(s) arg maxaA (s)
Q(s,a)
Reinforcement Learning - Methods
• To find π*, all methods try to evaluate V/Q value functions
• Different Approaches:• Dynamic Programming Approach
• Policy evaluation, improvement, iteration
• Monte-Carlo Methods• Decisions are taken based on averaging sample
returns
• Temporal Difference Methods (!!)