machine learning for online decision making + applications to online pricing
DESCRIPTION
Machine Learning for Online Decision Making + applications to Online Pricing. Your guide:. Avrim Blum. Carnegie Mellon University. Plan for Today. An interesting algorithm for online decision making. Problem of “combining expert advice” - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/1.jpg)
Machine Learning for Online
Decision Making + applications
to Online Pricing
Avrim BlumCarnegie Mellon University
Your guide:
![Page 2: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/2.jpg)
Plan for TodayPlan for Today
An interesting algorithm for online decision making. Problem of “combining expert advice”
Same problem but now with very limited feedback: the “multi-armed bandit problem”
Application to online pricing
![Page 3: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/3.jpg)
Using “expert” adviceUsing “expert” advice
• We solicit n “experts” for their advice. (Will the market go up or down?)
• We then want to use their advice somehow to make our prediction. E.g.,
Say we want to predict the stock market.
Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight?
[“expert” = someone with an opinion. Not necessarily someone who knows anything.]
![Page 4: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/4.jpg)
Simpler questionSimpler question• We have n “experts”.• One of these is perfect (never makes a
mistake). We just don’t know which one.• Can we find a strategy that makes no more
than lg(n) mistakes?
Answer: sure. Just take majority vote over all experts that have been correct so far.
Each mistake cuts # available by factor of 2.
Note: this means ok for n to be very large.
![Page 5: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/5.jpg)
What if no expert is perfect?What if no expert is perfect?One idea: just run above protocol until all
experts are crossed off, then repeat.
Makes at most log(n) mistakes per mistake of the best expert (plus initial log(n)).
Can we do better?
![Page 6: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/6.jpg)
What if no expert is perfect?What if no expert is perfect?Intuition: Making a mistake doesn't
completely disqualify an expert. So, instead of crossing off, just lower its weight.
Weighted Majority Alg:– Start with all experts having weight 1.– Predict based on weighted majority vote.– Penalize mistakes by cutting weight in half.
![Page 7: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/7.jpg)
Analysis: do nearly as well as best Analysis: do nearly as well as best expert in hindsightexpert in hindsight
• M = # mistakes we've made so far.• m = # mistakes best expert has made so far.• W = total weight (starts at n).
• After each mistake, W drops by at least 25%. So, after M mistakes, W is at most n(3/4)M.• Weight of best expert is (1/2)m. So,
So, if m is small, then M is pretty small too.
![Page 8: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/8.jpg)
Randomized Weighted Randomized Weighted MajorityMajority
2.4(m + lg n)) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes.
• Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick 70:30) Idea: smooth out the worst case.
• Also, generalize ½ to 1- .
M = expected
#mistakes
![Page 9: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/9.jpg)
AnalysisAnalysis• Say at time t we have fraction Ft of weight on
experts that made mistake.
• So, we have probability Ft of making a mistake, and we remove an Ft fraction of the total weight.
– Wfinal = n(1- F1)(1 - F2)...
– ln(Wfinal) = ln(n) + t [ln(1 - Ft)] · ln(n) - t Ft
(using ln(1-x) < -x)
= ln(n) - M. ( Ft = E[# mistakes])
• If best expert makes m mistakes, then ln(Wfinal) > ln((1-)m).
• Now solve: ln(n) - M > m ln(1-).
![Page 10: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/10.jpg)
Additive regretAdditive regret• So, have M · OPT + OPT + 1/ log(n).• Say we know we will play for T time steps. Then
can set log(n) 1/2. Get M · OPT + 2log(n)1/2.
• If we don’t know T in advance, can guess and double.
• These are called “additive regret” bounds.
![Page 11: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/11.jpg)
ExtensionsExtensions• What if experts are actions? (rows in a matrix
game, choice of deterministic alg to run,…)• At each time t, each has a loss (cost) in {0,1}.• Can still run the algorithm
– Rather than viewing as “pick a prediction with prob proportional to its weight” ,
– View as “pick an expert with probability proportional to its weight”
• Same analysis applies.
![Page 12: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/12.jpg)
ExtensionsExtensions• What if losses (costs) in [0,1]? • Here is a simple way to extend the results.
• Given cost vector c, view cGiven cost vector c, view cii as bias of coin. Flip to create as bias of coin. Flip to create boolean vector c’, s.t. E[c’boolean vector c’, s.t. E[c’ii] = c] = cii. Feed c’ to alg A.. Feed c’ to alg A.
• For any sequence of vectors c’, we have:For any sequence of vectors c’, we have:– EEAA[cost’(A)] [cost’(A)] ·· min minii cost’(i) + [regret term] cost’(i) + [regret term]
– So, ESo, E$$[E[EAA[cost’(A)]] [cost’(A)]] ·· E E$$[min[minii cost’(i)] + [regret term] cost’(i)] + [regret term]
• LHS is ELHS is EAA[cost(A)].[cost(A)].
• RHS RHS ·· min minii E E$$[cost’(i)] + [r.t.] = min[cost’(i)] + [r.t.] = minii[cost(i)] + [r.t.][cost(i)] + [r.t.]
In other words, costs between 0 and 1 just make the In other words, costs between 0 and 1 just make the problem easier…problem easier…
Cost’ = cost on c’ vectors
![Page 13: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/13.jpg)
Online pricing• Say you are selling lemonade (or a cool new software
tool, or bottles of water at the world expo).• Protocol #1: for t=1,2,…T
– Seller sets price pt
– Buyer arrives with valuation vt
– If vt ¸ pt, buyer purchases and pays pt, else doesn’t.– vt revealed to algorithm. – repeatrepeat
• Protocol #2: same as protocol #1 but without vt revealed.
• Assume all valuations in [1,h]
$500 a glass$2
$5.00 a glass
• Goal: do nearly as well as best fixed price in hindsight.
![Page 14: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/14.jpg)
Online pricing• Say you are selling lemonade (or a cool new software
tool, or bottles of water at the world expo).• Protocol #1: for t=1,2,…T
– Seller sets price pt
– Buyer arrives with valuation vt
– If vt ¸ pt, buyer purchases and pays pt, else doesn’t.– vt revealed to algorithm.
• Bad algorithm: “best price in past”– What if sequence of buyers = 1, h, 1, …, 1, h, 1, …,
1, h, …– Alg makes T/h, OPT makes T.
Factor of h worse!
![Page 15: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/15.jpg)
Online pricing• Say you are selling lemonade (or a cool new software
tool, or bottles of water at the world expo).• Protocol #1: for t=1,2,…T
– Seller sets price pt
– Buyer arrives with valuation vt
– If vt ¸ pt, buyer purchases and pays pt, else doesn’t.– vt revealed to algorithm.
• Good algorithm: Randomized Weighted Majority!– Define one expert for each price p 2 [1,h].– Best price of this form gives profit OPT.– Run RWM algorithm. Get expected gain at least:
OPT/(1+²) - O(²-1 h log h)
[extra factor of h coming from range of gains]
#experts = h
![Page 16: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/16.jpg)
Online pricing• Say you are selling lemonade (or a cool new software
tool, or bottles of water at the world expo).• What about Protocol #2? [just see accept/reject
decision]– Now we can’t run RWM directly since we don’t know
how to penalize the experts!– Called the “adversarial multiarmed bandit problem”– How can we solve that? $2
$5.00 a glass
![Page 17: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/17.jpg)
Multi-armed bandit problemExponential Weights for Exploration and Exploitation (exp3)
RWM
n = #experts
Exp3
Distrib pt
Expert i ~ qt
$1.25
Gain git
Gain vector ĝt
qt
qt = (1-°)pt + ° unif
ĝt = (0,…,0, git/qi
t,0,…,0)
OPT
OPT
1. RWM believes gain is: pt ¢ ĝt = pit(gi
t/qit) ´ gt
RWM
3. Actual gain is: git = gt
RWM (qit/pi
t) ¸ gtRWM(1-°)
2. t gtRWM ¸ /(1+²) - O(²-1 nh/° log n)OPT
4. E[ ] ¸ OPT. OPT Because E[ĝjt] = (1- qj
t)0 + qjt(gj
t/qjt) = gj
t ,
so E[maxj[t ĝjt]] ¸ maxj [ E[t ĝj
t] ] = OPT.
· nh/°
[Auer,Cesa-Bianchi,Freund,Schapire]
![Page 18: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/18.jpg)
Multi-armed bandit problemExponential Weights for Exploration and Exploitation (exp3)
RWM
n = #experts
Exp3
Distrib pt
Expert i ~ qt
$1.25
Gain git
Gain vector ĝt
qt
qt = (1-°)pt + ° unif
ĝt = (0,…,0, git/qi
t,0,…,0)
OPT
OPT
Conclusion (° = ²): E[Exp3] ¸ OPT/(1+²)2 - O(²-2 nh log(n))
[Auer,Cesa-Bianchi,Freund,Schapire]
· nh/°
Quick improvement: choose expert i to be price (1+²)i . Gives n = log1+²(h), & only hurts OPT by at most (1+²) factor.
![Page 19: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/19.jpg)
Multi-armed bandit problemExponential Weights for Exploration and Exploitation (exp3)
RWM
n = #experts
Exp3
Distrib pt
Expert i ~ qt
$1.25
Gain git
Gain vector ĝt
qt
qt = (1-°)pt + ° unif
ĝt = (0,…,0, git/qi
t,0,…,0)
OPT
OPT
Conclusion (° = ² and n = log1+²(h)): E[Exp3] ¸ OPT/(1+²)3 - O(²-2 h log(h) loglog(h))
[Auer,Cesa-Bianchi,Freund,Schapire]
· nh/°
Almost as good as protocol 1!
Can even reduce ²-2
to ²-1 with more care in analysis.
![Page 20: Machine Learning for Online Decision Making + applications to Online Pricing](https://reader035.vdocument.in/reader035/viewer/2022062409/568151b6550346895dbfe40c/html5/thumbnails/20.jpg)
SummarySummaryAlgorithms for online decision-making with strong guarantees on performance compared to best fixed choice.• Application: play repeated game against
adversary. Perform nearly as well as fixed strategy in hindsight.
Can apply even with very limited feedback.• Application: online pricing, even if only have
buy/no buy feedback.