efficient sequential decision-making in structured problems adam tauman kalai georgia institute of...
Post on 28-Mar-2015
217 Views
Preview:
TRANSCRIPT
Efficient SequentialDecision-Making in Structured Problems
Adam Tauman KalaiGeorgia Institute of TechnologyWeizmann InstituteToyota Technological InstituteNational Institute of Corrections
BANDITS AND REGRET
TIME
123
19
511
88
869
343
2
4
AVG 18 65 4
95
1
5
REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3
TWO APPROACHES
Bayesian setting [Robbins52] Independent prior probability dist. over payoff
sequences for each machine Thm: Maximize (discounted) expected reward
by pulling arm of largest “Gittins index”
Nonstochastic [Auer,Cesa-Bianchi,Freund,Schapire95]
Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected
regret of O
Route Time
25 min
17 min
44 min
STRUCTURED COMB-OPT
Clustering Errors
40
55
19
Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification
Problems not included: Portfolio selection
(nonlinear) Online sudoko
STRUCTURED COMB-OPT
Known decision set S.
Known LINEARLINEAR cost func. c: S £ [0,1]d ! [0,1].
Unknown w1, w2, …, w 2 [0,1]d
On period t = 1, 2, …, T:
Alg. picks st2S.
Alg. pays and finds out c(st,wt).
REGRET =
=
MAIN POINTS
Offline optimization M: [0,1]d ! SM(w) = argmins2S c(s,w), e.g. shortest pathEasier than sequential decision-making!?
EXPLORATIONAutomatically find “exploration basis” using M
LOW REGRETDimension matters more than # decisions
EFFICIENCYOnline algorithm uses offline black-box opt. M
MAIN RESULT
An algorithm that achives:
For any set S, any linear c: S£[0,1]d![0,1], any T ¸ 1, and any sequence w1,…,wT 2 [0,1]d,
E[regret of alg] · 15dT-1/3
Each update requires linear time and calls offline optimizer M with probability O(dT-1/3)
[AK04,MB04,DH06]
EXPLORE vs EXPLOITFind good “exploration basis” using M
On period t = 1, 2, …, T: ExploreExplore with probability ,
Play st := a random element of exploration basis
Estimate vt somehow
ExploitExploit with probability 1-,Play st := M(i<tvi+p)
vt := 0
KeyKey property: E[vt] = wt
E[calls to M] = .
random perturbation[Hannan57]
[AK04,[AK04, MB04]MB04]
REMAINDER OF TALK
EXPLORATIONEXPLORATIONGood “exploration basis” definitionFinding one
EXPLOITATIONEXPLOITATIONPerturbation (randomized regularization)Stability analysis
OTHER DIRECTIONSOTHER DIRECTIONSApproximation algorithmsConvex problems
EXPLORATIONEXPLORATION
GOING TO d-DIMENSIONS
Linear cost function c: S £ [0,1]d ! [0,1] Mapping S ! [0,1]d:
s = (c(s,(1,0,…,0)),c(s,(0,1,…,0)),…,c(s,(0,…,0,1)) c(s,w) = s¢w
S = { s | s 2 S }K = convex-hull(S)WLOG dim(S)=d
K
EXPLORATION BASISDef: Exploration basis b1, b2, …, bd 2 S is a
2-Barycentric-spanner if, for every s 2 S, s = i ibi for some 1, 2, …, d 2 [-2,2]
Possible to find an exploration basis efficiently using offline optimizer M(w) = argmins2S c(s,w)
[AK04]
S = { s | s 2 S }K = convex-hull(S)WLOG dim(S)=d
K
badgood
EXPLOITATIONEXPLOITATION
EXPLORE vs EXPLOITFind good “exploration basis” using M
On period t = 1, 2, …, T: ExploreExplore with probability ,
Play st := a random element of exploration basis
Estimate vt somehow
ExploitExploit with probability 1-,Play st := M(i<tvi+p)
vt := 0
KeyKey property: E[vt] = wt
E[calls to M] = .
random perturbation[Hannan57]
[AK04,[AK04, MB04]MB04]
INSTABILITY
Define zt = M(i·t wi) = argmins2S i·t c(s,wi)
Natural idea: use zt-1 on period t?
REGRET=1!
½ 0
0 1
1 0
0 1
1 0
STABILITY ANALYSIS [KV03]
Define zt = M(i·t wi) = argmins2S i<t c(s,wi)
Lemma: Regret of using zt on period t is 0
Proof:
mins2S c(s,w1)+c(s,w2)+…+c(s,wT) =
c(zT,w1)+…+c(zT,wT-1)+c(zT,wT) ¸
c(zT-1,w1)+…+c(zT-1,wT-1)+c(zT,wT) ¸
¸
c(z1,w1)+c(z2,w2)+…+c(zT,wT)
STABILITY ANALYSIS [KV03]
Define zt = M(i·t wi) = argmins2S i<t c(s,wi)
Lemma: Regret of using zt on period t is 0
) Regret of zt-1 on t · t·T c(zt-1,wt)-c(zt,wt)
Idea: regularize to achieve stability
Let yt = M(i·t wi+p), for random p 2 [0,1]d.
E[Regret of yt-1 on t] · t·T E[c(yt-1,wt)-c(yt,wt)] +
Strange: randomized regularization!
yt can be computed using M
OTHER DIRECTIONSOTHER DIRECTIONS
BANDIT CONVEX OPT. Convex feasible set S µ Rd
Unknown sequence of concave functions f1,…, fT: S ! [0,1]
On period t = 1,2,…,T:Algorithm chooses xt 2 S
Algorithm pays and finds out ft(xt)
Thm. 8 concave f1, f2, …: S ! [0,1], 8T0,T ¸ 1, bacterial ascent algorithm achieves:
MOTIVATING EXAMPLE
Company has to decide how much to advertize among d channels, within budget.
Feedback is total profit, affected by external factors.
x1
f1(x1)
$P
RO
FIT
$ADVERTISINGx2
f2(x2)
x3
f3(x3)
x4
f4(x4)
f1
f2
f3
f4
x*
BACTERIAL ASCENT
S
EXPLOREEXPLOIT
x0 x1
BACTERIAL ASCENT
S
EXPLOREEXPLOIT
x0 x1x2
BACTERIAL ASCENT
S
EXPLOREEXPLOIT
x0 x1x2
x3
APPROXIMATION ALG’s
What if offline optimization is NP-hard? Example: repeated traveling salesman problem Suppose you have approximation algorithm A,
c(A(w),w) · mins2S c(s,w) for all w 2 [0,1]d
Would like to achieve low -regret = our cost – (min cost of best s 2 S)
Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]
CONCLUSIONS
Can extend bandit algorithms to structured problemsGuarantee worst-case low regretLinear combinatorial optimization problems Convex optimization
Remarks Works against adaptive adversaries as wellOnline efficiency = offline efficiencyCan handle approximation algorithmsCan achieve cost · (1+) min cost + O(1/)
top related