efficient sequential decision-making in structured problems adam tauman kalai georgia institute of...

Post on 28-Mar-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Efficient SequentialDecision-Making in Structured Problems

Adam Tauman KalaiGeorgia Institute of TechnologyWeizmann InstituteToyota Technological InstituteNational Institute of Corrections

BANDITS AND REGRET

TIME

123

19

511

88

869

343

2

4

AVG 18 65 4

95

1

5

REGRET = AVG REWARD OF BEST DECISION – AVG REWARD = 8 – 5 = 3

TWO APPROACHES

Bayesian setting [Robbins52] Independent prior probability dist. over payoff

sequences for each machine Thm: Maximize (discounted) expected reward

by pulling arm of largest “Gittins index”

Nonstochastic [Auer,Cesa-Bianchi,Freund,Schapire95]

Thm: For any sequence of [0,1] costs on N machines, their algorithm achieves expected

regret of O

Route Time

25 min

17 min

44 min

STRUCTURED COMB-OPT

Clustering Errors

40

55

19

Online examples: Routing Compression Binary search trees PCFGs Pruning dec. trees Poker Auctions Classification

Problems not included: Portfolio selection

(nonlinear) Online sudoko

STRUCTURED COMB-OPT

Known decision set S.

Known LINEARLINEAR cost func. c: S £ [0,1]d ! [0,1].

Unknown w1, w2, …, w 2 [0,1]d

On period t = 1, 2, …, T:

Alg. picks st2S.

Alg. pays and finds out c(st,wt).

REGRET =

=

MAIN POINTS

Offline optimization M: [0,1]d ! SM(w) = argmins2S c(s,w), e.g. shortest pathEasier than sequential decision-making!?

EXPLORATIONAutomatically find “exploration basis” using M

LOW REGRETDimension matters more than # decisions

EFFICIENCYOnline algorithm uses offline black-box opt. M

MAIN RESULT

An algorithm that achives:

For any set S, any linear c: S£[0,1]d![0,1], any T ¸ 1, and any sequence w1,…,wT 2 [0,1]d,

E[regret of alg] · 15dT-1/3

Each update requires linear time and calls offline optimizer M with probability O(dT-1/3)

[AK04,MB04,DH06]

EXPLORE vs EXPLOITFind good “exploration basis” using M

On period t = 1, 2, …, T: ExploreExplore with probability ,

Play st := a random element of exploration basis

Estimate vt somehow

ExploitExploit with probability 1-,Play st := M(i<tvi+p)

vt := 0

KeyKey property: E[vt] = wt

E[calls to M] = .

random perturbation[Hannan57]

[AK04,[AK04, MB04]MB04]

REMAINDER OF TALK

EXPLORATIONEXPLORATIONGood “exploration basis” definitionFinding one

EXPLOITATIONEXPLOITATIONPerturbation (randomized regularization)Stability analysis

OTHER DIRECTIONSOTHER DIRECTIONSApproximation algorithmsConvex problems

EXPLORATIONEXPLORATION

GOING TO d-DIMENSIONS

Linear cost function c: S £ [0,1]d ! [0,1] Mapping S ! [0,1]d:

s = (c(s,(1,0,…,0)),c(s,(0,1,…,0)),…,c(s,(0,…,0,1)) c(s,w) = s¢w

S = { s | s 2 S }K = convex-hull(S)WLOG dim(S)=d

K

EXPLORATION BASISDef: Exploration basis b1, b2, …, bd 2 S is a

2-Barycentric-spanner if, for every s 2 S, s = i ibi for some 1, 2, …, d 2 [-2,2]

Possible to find an exploration basis efficiently using offline optimizer M(w) = argmins2S c(s,w)

[AK04]

S = { s | s 2 S }K = convex-hull(S)WLOG dim(S)=d

K

badgood

EXPLOITATIONEXPLOITATION

EXPLORE vs EXPLOITFind good “exploration basis” using M

On period t = 1, 2, …, T: ExploreExplore with probability ,

Play st := a random element of exploration basis

Estimate vt somehow

ExploitExploit with probability 1-,Play st := M(i<tvi+p)

vt := 0

KeyKey property: E[vt] = wt

E[calls to M] = .

random perturbation[Hannan57]

[AK04,[AK04, MB04]MB04]

INSTABILITY

Define zt = M(i·t wi) = argmins2S i·t c(s,wi)

Natural idea: use zt-1 on period t?

REGRET=1!

½ 0

0 1

1 0

0 1

1 0

STABILITY ANALYSIS [KV03]

Define zt = M(i·t wi) = argmins2S i<t c(s,wi)

Lemma: Regret of using zt on period t is 0

Proof:

mins2S c(s,w1)+c(s,w2)+…+c(s,wT) =

c(zT,w1)+…+c(zT,wT-1)+c(zT,wT) ¸

c(zT-1,w1)+…+c(zT-1,wT-1)+c(zT,wT) ¸

¸

c(z1,w1)+c(z2,w2)+…+c(zT,wT)

STABILITY ANALYSIS [KV03]

Define zt = M(i·t wi) = argmins2S i<t c(s,wi)

Lemma: Regret of using zt on period t is 0

) Regret of zt-1 on t · t·T c(zt-1,wt)-c(zt,wt)

Idea: regularize to achieve stability

Let yt = M(i·t wi+p), for random p 2 [0,1]d.

E[Regret of yt-1 on t] · t·T E[c(yt-1,wt)-c(yt,wt)] +

Strange: randomized regularization!

yt can be computed using M

OTHER DIRECTIONSOTHER DIRECTIONS

BANDIT CONVEX OPT. Convex feasible set S µ Rd

Unknown sequence of concave functions f1,…, fT: S ! [0,1]

On period t = 1,2,…,T:Algorithm chooses xt 2 S

Algorithm pays and finds out ft(xt)

Thm. 8 concave f1, f2, …: S ! [0,1], 8T0,T ¸ 1, bacterial ascent algorithm achieves:

MOTIVATING EXAMPLE

Company has to decide how much to advertize among d channels, within budget.

Feedback is total profit, affected by external factors.

x1

f1(x1)

$P

RO

FIT

$ADVERTISINGx2

f2(x2)

x3

f3(x3)

x4

f4(x4)

f1

f2

f3

f4

x*

BACTERIAL ASCENT

S

EXPLOREEXPLOIT

x0 x1

BACTERIAL ASCENT

S

EXPLOREEXPLOIT

x0 x1x2

BACTERIAL ASCENT

S

EXPLOREEXPLOIT

x0 x1x2

x3

APPROXIMATION ALG’s

What if offline optimization is NP-hard? Example: repeated traveling salesman problem Suppose you have approximation algorithm A,

c(A(w),w) · mins2S c(s,w) for all w 2 [0,1]d

Would like to achieve low -regret = our cost – (min cost of best s 2 S)

Possible using convex optimization approach above and transformations of approximation algorithms [KKL07]

CONCLUSIONS

Can extend bandit algorithms to structured problemsGuarantee worst-case low regretLinear combinatorial optimization problems Convex optimization

Remarks Works against adaptive adversaries as wellOnline efficiency = offline efficiencyCan handle approximation algorithmsCan achieve cost · (1+) min cost + O(1/)

top related